Google Research Blog
The latest news from Research at Google
Introducing the TensorFlow Research Cloud
Wednesday, May 17, 2017
Posted by Zak Stone, Product Manager for TensorFlow
Researchers require enormous computational resources to train the machine learning (ML) models that have delivered recent breakthroughs in
neural machine translation
, and many other domains. We believe that significantly larger amounts of computation will make it possible for researchers to invent new types of ML models that will be even more accurate and useful.
To accelerate the pace of open machine-learning research, we are introducing the
TensorFlow Research Cloud
(TFRC), a cluster of 1,000
that will be made available free of charge to support a broad range of computationally-intensive research projects that might not be possible otherwise.
The TensorFlow Research Cloud offers researchers the following benefits:
Access to Google’s all-new
that accelerate both training
Up to 180 teraflops of floating-point performance per Cloud TPU
64 GB of ultra-high-bandwidth memory per Cloud TPU
Familiar TensorFlow programming interfaces
sign up here
to request to be notified when the TensorFlow Research Cloud application process opens, and you can optionally share more information about your computational needs. We plan to evaluate applications on a rolling basis in search of the most creative and ambitious proposals.
The TensorFlow Research Cloud program is not limited to academia — we recognize that people with a wide range of affiliations, roles, and expertise are making major machine learning research contributions, and we especially encourage those with non-traditional backgrounds to apply. Access will be granted to selected individuals for limited amounts of compute time, and researchers are welcome to apply multiple times with multiple projects.
Since the main goal of the TensorFlow Research Cloud is to benefit the open machine learning research community as a whole, successful applicants will be expected to do the following:
Share their TFRC-supported research with the world through peer-reviewed publications, open-source code, blog posts, or other open media
Share concrete, constructive feedback with Google to help us improve the TFRC program and the underlying Cloud TPU platform over time
Imagine a future in which ML acceleration is abundant and develop new kinds of machine learning models in anticipation of that future
For businesses interested in using Cloud TPUs for proprietary research and development, we will offer a parallel Cloud TPU Alpha program. You can
sign up here
to learn more about this program. We recommend participating in the Cloud TPU Alpha program if you are interested in any of the following:
Accelerating training of proprietary ML models; models that take weeks to train on other hardware can be trained in days or even hours on Cloud TPUs
Accelerating batch processing of industrial-scale datasets: images, videos, audio, unstructured text, structured data, etc.
Processing live requests in production using larger and more complex ML models than ever before
We hope the
TensorFlow Research Cloud
will allow as many researchers as possible to explore the frontier of machine learning research and extend it with new discoveries! We encourage you to
sign up today
to be among the first to know as more information becomes available.
Using Machine Learning to Explore Neural Network Architecture
Wednesday, May 17, 2017
Posted by Quoc Le & Barret Zoph, Research Scientists, Google Brain team
At Google, we have successfully applied deep learning models to many applications, from
. Typically, our machine learning models are painstakingly designed by a team of engineers and scientists. This process of manually designing machine learning models is difficult because the search space of all possible models can be combinatorially large — a typical 10-layer network can have ~10
candidate networks! For this reason, the process of designing networks often takes a significant amount of time and experimentation by those with significant machine learning expertise.
architecture. Design of this network required many years of careful experimentation and refinement from initial versions of convolutional architectures.
To make this process of designing machine learning models much more accessible, we’ve been exploring ways to automate the design of machine learning models. Among many algorithms we’ve studied,
reinforcement learning algorithms
 have shown great promise. But in this blog post, we’ll focus on our reinforcement learning approach and the early results we’ve gotten so far.
In our approach (which we call "AutoML"), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task. That feedback is then used to inform the controller how to improve its proposals for the next round. We repeat this process thousands of times — generating new architectures, testing them, and giving that feedback to the controller to learn from. Eventually the controller learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset, and low probability to areas of architecture space that score poorly. Here’s what the process looks like:
We’ve applied this approach to two heavily benchmarked datasets in deep learning: image recognition with
and language modeling with
. On both datasets, our approach can design models that achieve accuracies on par with state-of-art models designed by machine learning experts (including some on our own team!).
So, what kind of neural nets does it produce? Let’s take one example: a recurrent architecture that’s trained to predict the next word on the Penn Treebank dataset. On the left here is a neural net designed by human experts. On the right is a recurrent architecture created by our method:
The machine-chosen architecture does share some common features with the human design, such as using addition to combine input and previous hidden states. However, there are some notable new elements — for example, the machine-chosen architecture incorporates a multiplicative combination (the left-most blue node on the right diagram labeled “
”). This type of combination is not common for recurrent networks, perhaps because researchers see no obvious benefit for having it. Interestingly, a simpler form of this approach was
by human designers, who also argued that this multiplicative combination can actually alleviate gradient vanishing/exploding issues, suggesting that the machine-chosen architecture was able to discover a useful new neural net architecture.
This approach may also teach us something about why certain types of neural nets work so well. The architecture on the right here has many channels so that the gradient can flow backwards, which may help explain why
work better than standard
Going forward, we’ll work on careful analysis and testing of these machine-generated architectures to help refine our understanding of them. If we succeed, we think this can inspire new types of neural nets and make it possible for non-experts to create neural nets tailored to their particular needs, allowing machine learning to have a greater impact to everyone.
Large-Scale Evolution of Image Classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, Alex Kurakin. International Conference on Machine Learning, 2017.
Neural Architecture Search with Reinforcement Learning
Barret Zoph, Quoc V. Le. International Conference on Learning Representations, 2017.
Efficient Smart Reply, now for Gmail
Wednesday, May 17, 2017
Posted by Brian Strope, Research Scientist, and Ray Kurzweil, Engineering Director, Google Research
Last year we launched Smart Reply, a feature for
Inbox by Gmail
machine learning to suggest replies to email
. Since the initial release, usage of Smart Reply has grown significantly, making up about 12% of replies in Inbox on mobile. Based on our examination of the use of Smart Reply in Inbox and our ideas about how humans learn and use language, we have created a new version of
Smart Reply for Gmail
. This version increases the percentage of usable suggestions and is more algorithmically efficient.
Novel thinking: hierarchy
Inspired by how humans understand languages and concepts, we turned to hierarchical models of language, an approach that uses
hierarchies of modules, each of which can learn, remember, and recognize a sequential pattern
The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc. Consider the message, "That interesting person at the cafe we like gave me a glance." The hierarchical chunks in this sentence are highly variable. The subject of the sentence is "That interesting person at the cafe we like." The modifier "interesting" tells us something about the writer's past experiences with the person. We are told that the location of an incident involving both the writer and the person is "at the cafe." We are also told that "we," meaning the writer and the person being written to, like the cafe. Additionally, each word is itself part of a hierarchy, sometimes more than one. A cafe is a type of restaurant which is a type of store which is a type of establishment, and so on.
In proposing an appropriate response to this message we might consider the meaning of the word "glance," which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, "Cool!" Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions.
Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. Moreover, a hierarchical approach to learning is well suited to the hierarchical nature of language. We have found that this approach works well for suggesting possible responses to emails. We use a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales, similar to how we understand speech and language.
Each module processes inputs and provides transformed representations of those inputs on its outputs (which are, in turn, available for the next level). In the Smart Reply system, and the figure above, the repeated structure has two layers of hierarchy. The first makes each feature useful as a predictor of the final result, and the second combines these features. By definition, the second works at a more abstract representation and considers a wider timescale.
By comparison, the initial release of Smart Reply encoded input emails word-by-word with a
(LSTM) recurrent neural network, and then decoded potential replies with yet another word-level LSTM. While this type of modeling is very effective in many contexts, even with Google infrastructure, it’s an approach that requires substantial computation resources. Instead of working word-by-word, we found an effective and highly efficient path by processing the problem more all-at-once, by comparing a simple hierarchy of vector representations of multiple features corresponding to longer time spans.
We have also considered whether the mathematical space of these vector representations is implicitly semantic. Do the hierarchical network representations reflect a coarse “understanding” of the actual meaning of the inputs and the responses in order to determine which go together, or do they reflect more consistent syntactical patterns? Given many real examples of which pairs go together and, perhaps more importantly which do not, we found that our networks are surprisingly effective and efficient at deriving representations that meet the training requirements.
So far we see that the system can find responses that are on point, without an overlap of keywords or even synonyms of keywords.More directly, we’re delighted when the system suggests results that show understanding and are helpful.
The key to this work is the confidence and trust people give us when they use the Smart Reply feature. As always, thank you for showing us the ways that work (and the ways that don’t!). With your help, we’ll do our best to keep learning.
Natural Language Processing
Natural Language Understanding
Coarse Discourse: A Dataset for Understanding Online Discussions
Tuesday, May 16, 2017
Posted by Praveen Paritosh, Senior Research Scientist, Ka Wong, Senior Data Scientist
Every day, participants of online communities form and share their opinions, experiences, advice and social support, most of which is expressed freely and without much constraint. These online discussions are often a key resource of information for many important topics, such as parenting, fitness, travel and more. However, these discussions also are intermixed with a clutter of disagreements, humor, flame wars and trolling, requiring readers to filter the content before getting the information they are looking for. And while the field of
actively explores ways to allow users to more efficiently find, navigate and consume this content, there is a lack of shared datasets on forum discussions to aid in understanding these discussions a bit better.
To aid researchers in this space, we are releasing the
Coarse Discourse dataset
, the largest dataset of annotated online discussions to date. The Coarse Discourse contains over half a million human annotations of publicly available online discussions on a random sample of over 9,000 threads from 130 communities from
To create this dataset, we developed the Coarse Discourse taxonomy of forum comments by going through a small set of forum threads, reading every comment, and deciding what role the comments played in the discussion. We then repeated and revised this exercise with crowdsourced human editors to validate the reproducibility of the taxonomy's discourse types, which include: announcement, question, answer, agreement, disagreement, appreciation, negative reaction, elaboration, and humor. From this data, over 100,000 comments were independently annotated by the crowdsourced editors for discourse type and relation. Along with the raw annotations from crowdsourced editors, we also provide the
Coarse Discourse annotation task guidelines
used by the editors to help with collecting data for other forums and refining the task further.
An example thread annotated with discourse types and relations.
suggest that question answering is a prominent use case in most communities, while some communities are more converationally focused, with back-and-forth interactions.
For machine learning and natural language processing researchers trying to characterize the nature of online discussions, we hope that this dataset is a useful resource. Visit our
to download the data. For more details, check out our
Characterizing Online Discussion Using Coarse Discourse Sequences
This work was done by Amy Zhang during her internship at Google. We would also like to thank Bryan Culbertson, Olivia Rhinehart, Eric Altendorf, David Huynh, Nancy Chang, Chris Welty and our crowdsourced editors.
Natural Language Processing
Neural Network-Generated Illustrations in Allo
Thursday, May 11, 2017
Posted by Jennifer Daniel, Expressions Creative Director, Allo
Taking, sharing, and viewing selfies has become a daily habit for many — the car selfie, the cute-outfit selfie, the travel selfie, the I-woke-up-like-this selfie. Apart from a social capacity, self-portraiture has long served as a means for self and identity exploration. For some, it’s about figuring out who they are. For others it’s about projecting how they want to be perceived. Sometimes it’s both.
Photography in the form of a selfie is a very direct form of expression. It comes with a set of rules bounded by reality. Illustration, on the other hand, empowers people to define themselves - it’s warmer and less fraught than reality.
Today, Google is introducing a feature in
that uses a combination of neural networks and the work of artists to turn your selfie into a personalized sticker pack.
Simply snap a selfie
, and it’ll return an automatically generated illustrated version of you, on the fly, with customization options to help you personalize the stickers even further.
What makes you,
The traditional computer vision approach to mapping selfies to art would be to analyze the pixels of an image and algorithmically determine attribute values by looking at pixel values to measure color, shape, or texture. However, people today take selfies in all types of lighting conditions and poses. And while people can easily pick out and recognize qualitative features, like eye color, regardless of the lighting condition, this is a very complex task for computers. When people look at eye color, they don’t just interpret the pixel values of blue or green, but take into account the
surrounding visual context
In order to account for this, we explored how we could enable an algorithm to pick out qualitative features in a manner similar to the way people do, rather than the traditional approach of hand coding how to interpret every permutation of lighting condition, eye color, etc. While we could have trained a large convolutional neural network from scratch to attempt to accomplish this, we wondered if there was a more efficient way to get results, since we expected that learning to interpret a face into an illustration would be a very iterative process.
That led us to run some experiments, similar to
, on some of Google's existing more general-purpose computer vision neural networks. We discovered that a few neurons among the millions in these networks were good at focusing on things they weren’t explicitly trained to look at that seemed useful for creating personalized stickers. Additionally, by virtue of being large general-purpose neural networks they had already figured out how to abstract away things they didn’t need. All that was left to do was to provide a much smaller number of human labeled examples to teach the classifiers to isolate out the qualities that the neural network already knew about the image.
To create an illustration of you that captures the qualities that would make it recognizable to your friends, we worked alongside an artistic team to create illustrations that represented a wide variety of features. Artists initially designed a set of hairstyles, for example, that they thought would be representative, and with the help of human raters we used these hairstyles to train the network to match the right illustration to the right selfie. We then asked human raters to judge the sticker output against the input image to see how well it did. In some instances, they determined that some styles were not well represented, so the artists created more that the neural network could learn to identify as well.
Raters were asked to classify hairstyles that the icon on the left resembled closest. Then, once consensus was reached, resident artist
drew a representation of what they had in common.
Avoiding the uncanny valley
In the study of aesthetics, a well-known problem is the
- the hypothesis that human replicas which appear almost, but not exactly, like real human beings can feel repulsive. In machine learning, this could be compounded if were confronted by a computer’s perception of you, versus how you may think of yourself, which can be at odds.
Rather than aim to replicate a person’s appearance exactly, pursuing a lower resolution model, like emojis and stickers, allows the team to explore expressive representation by returning an image that is less about reproducing reality and more about breaking the rules of representation.
The team worked with artist
to design the features that make up more than 563 quadrillion combinations.
Translating pixels to artistic illustrations
Reconciling how the computer perceives you with how you perceive yourself and what you want to project is truly an artistic exercise. This makes a customization feature that includes different hairstyles, skin tones, and nose shapes, essential. After all, illustration by its very nature can be subjective. Aesthetics are defined by race, culture, and class which can lead to creating zones of exclusion without consciously trying. As such, we strove to create a space for a range of race, age, masculinity, femininity, and/or androgyny. Our teams continue to evaluate the research results to help prevent against incorporating biases while training the system.
Creating a broad palette for identity and sentiment
There is no such thing as a ‘universal aesthetic’ or ‘a singular you’. The way people talk to their parents is different than how they talk to their friends which is different than how they talk to their colleagues. It’s not enough to make an avatar that is a literal representation of yourself when there are many versions of you. To address that, the Allo team is working with a range of artistic voices to help others extend their own voice. This first style that launched today speaks to your sarcastic side but the next pack might be more cute for those sincere moments. Then after that, maybe they’ll turn you into a dog. If emojis broadened the world of communication it’s not hard to imagine how this technology and language evolves. What will be most exciting is listening to what people say with it.
This feature is starting to roll out in
Allo today for Android
, and will come soon to Allo on iOS.
This work was made possible through a collaboration of the Allo Team and
researchers at Google. We additionally thank Lamar Abrams, Koji Ashida, Forrester Cole, Jennifer Daniel, Shiraz Fuman, Dilip Krishnan, Inbar Mosseri, Aaron Sarna, Aaron Maschinot and Bhavik Singh.
Updating Google Maps with Deep Learning and Street View
Wednesday, May 03, 2017
Posted by Julian Ibarz, Staff Software Engineer, Google Brain Team and Sujoy Banerjee, Product Manager, Ground Truth Team
Every day, Google Maps provides useful directions, real-time traffic information and information on businesses to millions of people. In order to provide the best experience for our users, this information has to constantly mirror an ever-changing world. While Street View cars collect millions of images daily, it is impossible to manually analyze more than 80 billion high resolution images collected to date in order to find new, or updated, information for Google Maps. One of the goals of the Google’s Ground Truth team is to enable the automatic extraction of information from our geo-located imagery to improve Google Maps.
Attention-based Extraction of Structured Information from Street View Imagery
”, we describe our approach to accurately read street names out of very challenging Street View images in many countries, automatically, using a deep neural network. Our algorithm achieves 84.2% accuracy on the challenging
French Street Name Signs
(FSNS) dataset, significantly outperforming the previous state-of-the-art systems. Importantly, our system is easily extensible to extract other types of information out of Street View images as well, and now helps us automatically extract business names from store fronts. We are excited to announce that this model is now
Example of street name from the FSNS dataset correctly transcribed by our system. Up to four views of the same sign are provided.
Text recognition in a natural environment is a challenging computer vision and machine learning problem. While traditional
Optical Character Recognition
(OCR) systems mainly focus on extracting text from scanned documents, text acquired from natural scenes is more challenging due to visual artifacts, such as distortion, occlusions, directional blur, cluttered background or different viewpoints. Our efforts to solve this research challenge first began in 2008, when we used
neural networks to blur faces and license plates
in Street View images to protect the privacy of our users. From this initial research, we realized that with enough labeled data, we could additionally use machine learning not only to protect the privacy of our users, but also to automatically improve Google Maps with relevant up-to-date information.
In 2014, Google’s Ground Truth team published a state-of-the-art method for
reading street numbers
Street View House Numbers
(SVHN) dataset, implemented by then summer intern (now Googler)
. This work was not only of academic interest but was critical in making Google Maps more accurate. Today, over one-third of addresses globally have had their location improved thanks to this system. In some countries, such as Brazil, this algorithm has improved more than 90% of the addresses in Google Maps today, greatly improving the usability of our maps.
The next logical step was to extend these techniques to street names. To solve this problem, we created and
released French Street Name Signs
(FSNS), a large training dataset of more than 1 million street names. The FSNS dataset was a multi-year effort designed to allow anyone to improve their OCR models on a challenging and real use case. FSNS dataset is much larger and more challenging than SVHN in that accurate recognition of street signs may require combining information from many different images.
These are examples of challenging signs that are properly transcribed by our system by selecting or combining understanding across images. The second example is extremely challenging by itself, but the model learned a language model prior that enables it to remove ambiguity and correctly read the street name. Note that in the FSNS dataset, random noise is used in the case where less than four independent views are available of the same physical sign.
With this training set, Google intern Zbigniew Wojna spent the summer of 2016 developing a deep learning model architecture to automatically label new Street View imagery. One of the interesting strengths of our new model is that it can normalize the text to be consistent with our naming conventions, as well as ignore extraneous text, directly from the data itself.
Example of text normalization learned from data in Brazil. Here it changes “AV.” into “Avenida” and “Pres.” into “Presidente” which is what we desire.
In this example, the model is not confused from the fact that there is two street names, properly normalizes “Av” into “Avenue” as well as correctly ignores the number “1600”.
While this model is accurate, it did show a sequence error rate of 15.8%. However, after analyzing failure cases, we found that 48% of them were due to ground truth errors, highlighting the fact that this model is on par with the label quality (a full analysis our error rate can be found in
This new system, combined with the one extracting street numbers, allows us to create new addresses directly from imagery, where we previously didn’t know the name of the street, or the location of the addresses. Now, whenever a Street View car drives on a newly built road, our system can analyze the tens of thousands of images that would be captured, extract the street names and numbers, and properly create and locate the new addresses, automatically, on Google Maps.
But automatically creating addresses for Google Maps is not enough -- additionally we want to be able to provide navigation to businesses by name. In 2015, we published “
Large Scale Business Discovery from Street View Imagery
”, which proposed an approach to accurately detect business store-front signs in Street View images. However, once a store front is detected, one still needs to accurately extract its name for it to be useful -- the model must figure out which text is the business name, and which text is not relevant. We call this extracting “structured text” information out of imagery. It is not just text, it is text with semantic meaning attached to it.
Using different training data, the same model architecture that we used to read street names can also be used to accurately extract business names out of business facades. In this particular case, we are able to only extract the business name which enables us to verify if we already know about this business in Google Maps, allowing us to have more accurate and up-to-date business listings.
The system is correctly able to predict the business name ‘Zelina Pneus’, despite not receiving any data about the true location of the name in the image. Model is not confused by the tire brands that the sign indicates are available at the store.
Applying these large models across our more than 80 billion Street View images requires a lot of computing power. This is why the Ground Truth team was the first user of Google's TPUs, which were
publicly announced earlier this year
, to drastically reduce the computational cost of the inferences of our pipeline.
People rely on the accuracy of Google Maps in order to assist them. While keeping Google Maps up-to-date with the ever-changing landscape of cities, roads and businesses presents a technical challenge that is far from solved, it is the goal of the Ground Truth team to drive cutting-edge innovation in machine learning to create a better experience for over one billion Google Maps users.
Optical Character Recognition
Experimental Nighttime Photography with Nexus and Pixel
Tuesday, April 25, 2017
Posted by Florian Kainz, Software Engineer, Google Daydream
On a full moon night last year I carried a professional DSLR camera, a heavy lens and a tripod up to a hilltop in the
just north of San Francisco to take a picture of the Golden Gate Bridge and the lights of the city behind it.
A view of the Golden Gate Bridge from the Marin Headlands, taken with a DSLR camera (Canon 1DX, Zeiss Otus 28mm f/1.4 ZE). Click
for the full resolution image.
I thought the photo of the moonlit landscape came out well so I showed it to my (then) teammates in
, a Google Research team that focuses on
- developing algorithms that assist in taking pictures, usually with smartphones and similar small cameras. Seeing my nighttime photo, one of the Gcam team members challenged me to re-take it, but with a phone camera instead of a DSLR. Even though cameras on cellphones have come a long way, I wasn’t sure whether it would be possible to come close to the DSLR shot.
Probably the most successful Gcam project to date is
the image processing pipeline that enables the HDR+ mode
in the camera app on Nexus and Pixel phones. HDR+ allows you to take photos at low-light levels by rapidly shooting a burst of up to ten short exposures and averaging them them into a single image, reducing blur due to camera shake while collecting enough total light to yield
surprisingly good pictures
. Of course there are limits to what HDR+ can do. Once it gets dark enough the camera just cannot gather enough light and challenging shots like nighttime landscapes are still beyond reach.
To learn what was possible with a cellphone camera in extremely low-light conditions, I looked to the experimental
app, written by Marc Levoy and presented at the
ICCV 2015 Extreme Imaging Workshop
, which can produce pictures with even less light than HDR+. It does this by accumulating more exposures, and merging them under the assumption that the scene is static and any differences between successive exposures must be due to camera motion or sensor noise. The app reduces noise further by dropping image resolution to about 1 MPixel. With SeeInTheDark it is just possible to take pictures, albeit fairly grainy ones, by the light of the full moon.
However, in order to keep motion blur due to camera shake and moving objects in the scene at acceptable levels, both HDR+ and SeeInTheDark must keep the exposure times for individual frames below roughly one tenth of a second. Since the user can’t hold the camera perfectly still for extended periods, it doesn’t make sense to attempt to merge a large number of frames into a single picture. Therefore, HDR+ merges at most ten frames, while SeeInTheDark progressively discounts older frames as new ones are captured. This limits how much light the camera can gather and thus affects the quality of the final pictures at very low light levels.
Of course, if we want to take high-quality pictures of low-light scenes (such as a landscape illuminated only by the moon), increasing the exposure time to more than one second and mounting the phone on a tripod or placing it on some other solid support makes the task a lot easier. Google’s Nexus 6P and Pixel phones support exposure times of 4 and 2 seconds respectively. As long as the scene is static, we should be able to record and merge dozens of frames to produce a single final image, even if shooting those frames takes several minutes.
Even with the use of a tripod, a sharp picture requires the camera’s lens to be focused on the subject, and this can be tricky in scenes with very low light levels. The two autofocus mechanisms employed by cellphone cameras —
— fail when it’s dark enough that the camera's image sensor returns mostly noise. Fortunately, the interesting parts of outdoor scenes tend to be far enough away that simply setting the focus distance to infinity produces sharp images.
Experiments & Results
Taking all this into account, I wrote a simple Android camera app with manual control over exposure time, ISO and focus distance. When the shutter button is pressed the app waits a few seconds and then records up to 64 frames with the selected settings. The app saves the raw frames captured from the sensor as
files, which can later be downloaded onto a PC for processing.
To test my app, I visited the
Point Reyes lighthouse
on the California coast some thirty miles northwest of San Francisco on a full moon night. I pointed a Nexus 6P phone at the building and shot a burst of 32 four-second frames at ISO 1600. After covering the camera lens with opaque adhesive tape I shot an additional 32 black frames. Back at the office I loaded the raw files into Photoshop. The individual frames were very grainy, as one would expect given the tiny sensor in a cellphone camera, but computing the mean of all 32 frames cleaned up most of the grain, and subtracting the mean of the 32 black frames removed faint grid-like patterns caused by local variations in the sensor's black level. The resulting image, shown below, looks surprisingly good.
Point Reyes lighthouse at night, photographed with Google Nexus 6P (full resolution image
The lantern in the lighthouse is overexposed, but the rest of the scene is sharp, not too grainy, and has pleasing, natural looking colors. For comparison, a hand-held HDR+ shot of the same scene looks like this:
Point Reyes Lighthouse at night, hand-held HDR+ shot (full resolution image
). The inset rectangle has been brightened in Photoshop to roughly match the previous picture.
Satisfied with these results, I wanted to see if I could capture a nighttime landscape as well as the stars in the clear sky above it, all in one picture. When I took the photo of the lighthouse a thin layer of clouds conspired with the bright moonlight to make the stars nearly invisible, but on a clear night a two or four second exposure can easily capture the brighter stars. The stars are not stationary, though;
they appear to rotate around the celestial poles
, completing a full turn every 24 hours. The motion is slow enough to be invisible in exposures of only a few seconds, but over the minutes it takes to record a few dozen frames the stars move enough to turn into streaks when the frames are merged. Here is an example:
The North Star above Mount Burdell, single 2-second exposure. (full resolution image
Mean of 32 2-second exposures (full resolution image
Seeing streaks instead of pinpoint stars in the sky can be avoided by shifting and rotating the original frames such that the stars align. Merging the aligned frames produces an image with a clean-looking sky, and many faint stars that were hidden by noise in the individual frames become visible. Of course, the ground is now motion-blurred as if the camera had followed the rotation of the sky.
Mean of 32 2-second exposures, stars aligned (full resolution image
We now have two images; one where the ground is sharp, and one where the sky is sharp, and we can combine them into a single picture that is sharp everywhere. In Photoshop the easiest way to do that is with a hand-painted layer mask. After adjusting brightness and colors to taste, slight cropping, and removing an ugly "No Trespassing" sign we get a presentable picture:
The North Star above Mount Burdell, shot with Google Pixel, final image (full resolution image
Using Even Less Light
The pictures I've shown so far were shot on nights with a full moon, when it was bright enough that one could easily walk outside without a lantern or a flashlight. I wanted to find out if it was possible to take cellphone photos in even less light. Using a Pixel phone, I tried a scene illuminated by a three-quarter moon low in the sky, and another one with no moon at all. Anticipating more noise in the individual exposures, I shot 64-frame bursts. The processed final images still look fine:
Wrecked fishing boat in Inverness and the Big Dipper, 64 2-second exposures, shot with Google Pixel (full resolution image
Stars above Pierce Point Ranch, 64 2-second exposures, shot with Google Pixel (full resolution image
In the second image the distant lights of the cities around the San Francisco Bay caused the sky near the horizon to glow, but without moonlight the night was still dark enough to make the Milky Way visible. The picture looks noticeably grainier than my earlier moonlight shots, but it's not too bad.
Pushing the Limits
How far can we go? Can we take a cellphone photo with only starlight - no moon, no artificial light sources nearby, and no background glow from a distant city?
To test this I drove to a point on the California coast a little north of the mouth of the
, where nights can get really dark, and pointed my Pixel phone at the summer sky above the ocean. Combining 64 two-second exposures taken at ISO 12800, and 64 corresponding black black frames did produce a recognizable image of the Milky Way. The constellations
are clearly visible, and squinting hard enough one can just barely make out the horizon and one or two rocks in the ocean, but overall, this is not a picture you'd want to print out and frame. Still, this may be the lowest-light cellphone photo ever taken.
Only starlight, shot with Google Pixel (full resolution image
Here we are approaching the limits of what the Pixel camera can do. The camera cannot handle exposure times longer than two seconds. If this restriction was removed we could expose individual frames for eight to ten seconds, and the stars still would not show noticeable motion blur. With longer exposures we could lower the ISO setting, which would significantly reduce noise in the individual frames, and we would get a correspondingly cleaner and more detailed final picture.
Getting back to the original challenge - using a cellphone to reproduce a night-time DSLR shot of the Golden Gate - I did that. Here is what I got:
Golden Gate Bridge at night, shot with Google Nexus 6P (full resolution image
The Moon above San Francisco, shot with Google Nexus 6P (full resolution image
At 9 to 10 MPixels the resolution of these pictures is not as high as what a DSLR camera might produce, but otherwise image quality is surprisingly good: the photos are sharp all the way into the corners, there is not much visible noise, the captured dynamic range is sufficient to avoid saturating all but the brightest highlights, and the colors are pleasing.
Trying to find out if phone cameras might be suitable for outdoor nighttime photography was a fun experiment, and clearly the result is yes, they are. However, arriving at the final images required a lot of careful post-processing on a desktop computer, and the procedure is too cumbersome for all but the most dedicated cellphone photographers. However, with the right software a phone should be able to process the images internally, and if steps such as painting layer masks by hand can be eliminated, it might be possible to do point-and-shoot photography in very low light conditions. Almost - the cellphone would still have to rest on the ground or be mounted on a tripod.
Here’s a Google Photos album
with more examples of photos that were created with the technique described above.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog