Google Research Blog
The latest news from Research at Google
A Billion Words: Because today's language modeling standard should be higher
Wednesday, April 30, 2014
Posted by Dave Orr, Product Manager, and Ciprian Chelba, Research Scientist
Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans
pronounce “ladder” and “latter” identically
, for instance. Keyboard inputs on mobile devices have a similar problem, especially for
. For example, the input patterns for “Yankees” and “takes” look very similar:
Photo credit: Kurt Partridge
But in this context -- the previous two words, “New York” -- “Yankees” is much more likely.
One key way computers use context is with
. These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on. Often those are specialized: word order for queries versus web pages can be very different. Either way, having an accurate language model with wide coverage drives the quality of all these applications.
Due to interactions between components, one thing that can be tricky when evaluating the quality of such complex systems is error attribution. Good engineering practice is to evaluate the quality of each module separately, including the language model. We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques.
To that end,
we are releasing scripts
that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an
. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks.
The benchmark scripts and data are freely available, and can be found here:
The field needs a new and better standard benchmark. Currently, researchers report from a set of their choice, and results are very hard to reproduce because of a lack of a standard in preprocessing. We hope that this will solve both those problems, and become the standard benchmark for language modeling experiments. As more researchers use the new benchmark, comparisons will be easier and more accurate, and progress will be faster.
For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals.
Natural Language Processing
Natural Language Understanding
Lens Blur in the new Google Camera app
Wednesday, April 16, 2014
Posted by Carlos Hernández, Software Engineer
One of the biggest advantages of
over camera phones is the ability to achieve shallow depth of field and
effects. Shallow depth of field makes the object of interest "pop" by bringing the foreground into focus and de-emphasizing the background. Achieving this optical effect has traditionally required a big lens and aperture, and therefore hasn’t been possible using the camera on your mobile phone or tablet.
That all changes with
a new mode in the
app. It lets you take a photo with a shallow depth of field using just your Android phone or tablet. Unlike a regular photo, Lens Blur lets you change the point or level of focus
the photo is taken. You can choose to make any object come into focus simply by tapping on it in the image. By changing the depth-of-field slider, you can simulate different aperture sizes, to achieve bokeh effects ranging from subtle to surreal (e.g.,
). The new image is rendered instantly, allowing you to see your changes in real time.
Lens Blur replaces the need for a large optical system with algorithms that
a larger lens and aperture. Instead of capturing a single photo, you move the camera in an upward sweep to capture a whole series of frames. From these photos, Lens Blur uses computer vision algorithms to create a 3D model of the world, estimating the
(distance) to every point in the scene. Here’s an example -- on the left is a raw input photo, in the middle is a “depth map” where darker things are close and lighter things are far away, and on the right is the result blurred by distance:
Here’s how we do it. First, we pick out visual features in the scene and track them over time, across the series of images. Using computer vision algorithms known as Structure-from-Motion (SfM) and
, we compute the camera’s 3D position and orientation and the 3D positions of all those image features throughout the series.
Once we’ve got the 3D pose of each photo, we compute the depth of each pixel in the reference photo using
(MVS) algorithms. MVS works the way human stereo vision does: given the location of the same object in two different images, we can
the 3D position of the object and compute the distance to it. How do we figure out which pixel in one image corresponds to a pixel in another image? MVS measures how similar they are -- on mobile devices, one particularly simple and efficient way is computing the Sum of Absolute Differences (SAD) of the RGB colors of the two pixels.
Now it’s an optimization problem: we try to build a depth map where all the corresponding pixels are most similar to each other. But that’s typically not a well-posed optimization problem -- you can get the same similarity score for different depth maps. To address this ambiguity, the optimization also incorporates assumptions about the 3D geometry of a scene, called a "prior,” that favors reasonable solutions. For example, you can often assume two pixels near each other are at a similar depth. Finally, we use
Markov Random Field
inference methods to solve the optimization problem.
Having computed the depth map, we can re-render the photo, blurring pixels by differing amounts depending on the pixel’s depth, aperture and location relative to the focal plane. The focal plane determines which pixels to blur, with the amount of blur increasing proportionally with the distance of each pixel to that focal plane. This is all achieved by simulating a physical lens using the
The algorithms used to create the 3D photo run entirely on the mobile device, and are closely related to the computer vision algorithms used in 3D mapping features like Google Maps
. We hope you have fun with your
Sawasdeee ka Voice Search
Wednesday, April 02, 2014
Posted by Keith Hall and Richard Sproat, Staff Research Scientists, Speech
Typing on mobile devices can be difficult, especially when you're on the go. Google Voice Search gives you a fast, easy, and natural way to search by speaking your queries instead of typing them. In Thailand, Voice Search has been one of the most requested services, so we’re excited to now offer users there the ability to speak queries in Thai, adding to over 75 languages and accents in which you can talk to Google.
To power Voice Search, we teach computers to understand the sounds and words that build spoken language. We trained our speech recognizer to understand Thai by collecting speech samples from hundreds of volunteers in Bangkok, which enabled us to build this recognizer in just a fraction of the time it took to build other models. Our helpers are asked to read popular queries in their native tongue, in a variety of acoustic conditions such as in restaurants, out on busy streets, and inside cars.
Each new language for voice recognition often requires our research team to tackle new challenges, including Thai.
Segmentation is a major challenge in Thai, as the Thai script has no spaces between words, so it is harder to know when a word begins and ends. Therefore, we created a Thai segmenter to help our system recognize words better. For example: ตากลม can be segmented to ตาก ลม or ตา กลม. We collected a large corpus of text and asked Thai speakers to manually annotate plausible segmentations. We then trained a sequence segmenter on this data allowing it to generalize beyond the annotated data.
Numbers are an important part of any language: the string “87” appears on a web page and we need to know how people would say that. As with over 40 other languages, we included a number grammar for Thai, that tells you that “87” would be read as แปดสิบเจ็ด.
Thai users often mix English words with Thai, such as brand or artist names, in both spoken and written Thai which adds complexity to our acoustic models, lexicon models, and segmentation models. We addressed this by introducing ‘code switching’, which allows Voice Search to recognize when different languages are being spoken interchangeably and adjust phonetic transliteration accordingly.
Many Thai users frequently leave out accents and tone markers when they search (eg โน๊ตบุก instead of โน้ตบุ๊ก OR หมูหยอง instead of หมูหย็อง) so we had to create a special algorithm to ensure accents and tones were restored in search results provided and our Thai users would see properly formatted text in the majority of cases.
We’re particularly excited that Voice Search can help people find locally relevant information, ranging from travel directions to the nearest restaurant, without having to type long phrases in Thai.
Voice Search is available for Android devices running Jelly Bean and above. It will be available for older Android releases and iOS users soon.
Making Blockly Universally Accessible
Tuesday, April 01, 2014
Posted by Neil Fraser, Chief Interplanetary Liaison
We work hard to make our products accessible to people everywhere, in every culture. Today we’re expanding our outreach efforts to support a traditionally underserved community -- those who call themselves "tlhIngan."
Google's Blockly programming environment is used in K-12 classrooms around the world to teach programming. But the world is not enough. Students on
have had difficulty learning to code because most of the teaching tools aren't available in their native language. Additionally, many existing tools are too fragile for their pedagogical approach. As a result, Klingons have found it challenging to enter computer science. This is reflected in the fact that less than 2% of Google engineers are Klingon.
Today we launch a full translation of Blockly in Klingon. It incorporates Klingon cultural norms to facilitate learning in this unique population:
Blockly has no syntax errors. This reduces frustration, and reduces the number of computers thrown through bulkheads.
Variables are untyped. Type errors can too easily be perceived as a challenge to the honor of a student's family (and we’ve seen where that ends).
Debugging and bug reports have been omitted, our research indicates that in the event of a bug, they prefer the entire program to just blow up.
Get a little keyboard dirt under your fingernails. Learn that although
is delicious, code structure should not resemble it. And above all, be proud that
You can try out the demo
or get involved
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog