Google Research Blog
The latest news from Research at Google
Experimental Nighttime Photography with Nexus and Pixel
Tuesday, April 25, 2017
Posted by Florian Kainz, Software Engineer, Google Daydream
On a full moon night last year I carried a professional DSLR camera, a heavy lens and a tripod up to a hilltop in the
just north of San Francisco to take a picture of the Golden Gate Bridge and the lights of the city behind it.
A view of the Golden Gate Bridge from the Marin Headlands, taken with a DSLR camera (Canon 1DX, Zeiss Otus 28mm f/1.4 ZE). Click
for the full resolution image.
I thought the photo of the moonlit landscape came out well so I showed it to my (then) teammates in
, a Google Research team that focuses on
- developing algorithms that assist in taking pictures, usually with smartphones and similar small cameras. Seeing my nighttime photo, one of the Gcam team members challenged me to re-take it, but with a phone camera instead of a DSLR. Even though cameras on cellphones have come a long way, I wasn’t sure whether it would be possible to come close to the DSLR shot.
Probably the most successful Gcam project to date is
the image processing pipeline that enables the HDR+ mode
in the camera app on Nexus and Pixel phones. HDR+ allows you to take photos at low-light levels by rapidly shooting a burst of up to ten short exposures and averaging them them into a single image, reducing blur due to camera shake while collecting enough total light to yield
surprisingly good pictures
. Of course there are limits to what HDR+ can do. Once it gets dark enough the camera just cannot gather enough light and challenging shots like nighttime landscapes are still beyond reach.
To learn what was possible with a cellphone camera in extremely low-light conditions, I looked to the experimental
app, written by Marc Levoy and presented at the
ICCV 2015 Extreme Imaging Workshop
, which can produce pictures with even less light than HDR+. It does this by accumulating more exposures, and merging them under the assumption that the scene is static and any differences between successive exposures must be due to camera motion or sensor noise. The app reduces noise further by dropping image resolution to about 1 MPixel. With SeeInTheDark it is just possible to take pictures, albeit fairly grainy ones, by the light of the full moon.
However, in order to keep motion blur due to camera shake and moving objects in the scene at acceptable levels, both HDR+ and SeeInTheDark must keep the exposure times for individual frames below roughly one tenth of a second. Since the user can’t hold the camera perfectly still for extended periods, it doesn’t make sense to attempt to merge a large number of frames into a single picture. Therefore, HDR+ merges at most ten frames, while SeeInTheDark progressively discounts older frames as new ones are captured. This limits how much light the camera can gather and thus affects the quality of the final pictures at very low light levels.
Of course, if we want to take high-quality pictures of low-light scenes (such as a landscape illuminated only by the moon), increasing the exposure time to more than one second and mounting the phone on a tripod or placing it on some other solid support makes the task a lot easier. Google’s Nexus 6P and Pixel phones support exposure times of 4 and 2 seconds respectively. As long as the scene is static, we should be able to record and merge dozens of frames to produce a single final image, even if shooting those frames takes several minutes.
Even with the use of a tripod, a sharp picture requires the camera’s lens to be focused on the subject, and this can be tricky in scenes with very low light levels. The two autofocus mechanisms employed by cellphone cameras —
— fail when it’s dark enough that the camera's image sensor returns mostly noise. Fortunately, the interesting parts of outdoor scenes tend to be far enough away that simply setting the focus distance to infinity produces sharp images.
Experiments & Results
Taking all this into account, I wrote a simple Android camera app with manual control over exposure time, ISO and focus distance. When the shutter button is pressed the app waits a few seconds and then records up to 64 frames with the selected settings. The app saves the raw frames captured from the sensor as
files, which can later be downloaded onto a PC for processing.
To test my app, I visited the
Point Reyes lighthouse
on the California coast some thirty miles northwest of San Francisco on a full moon night. I pointed a Nexus 6P phone at the building and shot a burst of 32 four-second frames at ISO 1600. After covering the camera lens with opaque adhesive tape I shot an additional 32 black frames. Back at the office I loaded the raw files into Photoshop. The individual frames were very grainy, as one would expect given the tiny sensor in a cellphone camera, but computing the mean of all 32 frames cleaned up most of the grain, and subtracting the mean of the 32 black frames removed faint grid-like patterns caused by local variations in the sensor's black level. The resulting image, shown below, looks surprisingly good.
Point Reyes lighthouse at night, photographed with Google Nexus 6P (full resolution image
The lantern in the lighthouse is overexposed, but the rest of the scene is sharp, not too grainy, and has pleasing, natural looking colors. For comparison, a hand-held HDR+ shot of the same scene looks like this:
Point Reyes Lighthouse at night, hand-held HDR+ shot (full resolution image
). The inset rectangle has been brightened in Photoshop to roughly match the previous picture.
Satisfied with these results, I wanted to see if I could capture a nighttime landscape as well as the stars in the clear sky above it, all in one picture. When I took the photo of the lighthouse a thin layer of clouds conspired with the bright moonlight to make the stars nearly invisible, but on a clear night a two or four second exposure can easily capture the brighter stars. The stars are not stationary, though;
they appear to rotate around the celestial poles
, completing a full turn every 24 hours. The motion is slow enough to be invisible in exposures of only a few seconds, but over the minutes it takes to record a few dozen frames the stars move enough to turn into streaks when the frames are merged. Here is an example:
The North Star above Mount Burdell, single 2-second exposure. (full resolution image
Mean of 32 2-second exposures (full resolution image
Seeing streaks instead of pinpoint stars in the sky can be avoided by shifting and rotating the original frames such that the stars align. Merging the aligned frames produces an image with a clean-looking sky, and many faint stars that were hidden by noise in the individual frames become visible. Of course, the ground is now motion-blurred as if the camera had followed the rotation of the sky.
Mean of 32 2-second exposures, stars aligned (full resolution image
We now have two images; one where the ground is sharp, and one where the sky is sharp, and we can combine them into a single picture that is sharp everywhere. In Photoshop the easiest way to do that is with a hand-painted layer mask. After adjusting brightness and colors to taste, slight cropping, and removing an ugly "No Trespassing" sign we get a presentable picture:
The North Star above Mount Burdell, shot with Google Pixel, final image (full resolution image
Using Even Less Light
The pictures I've shown so far were shot on nights with a full moon, when it was bright enough that one could easily walk outside without a lantern or a flashlight. I wanted to find out if it was possible to take cellphone photos in even less light. Using a Pixel phone, I tried a scene illuminated by a three-quarter moon low in the sky, and another one with no moon at all. Anticipating more noise in the individual exposures, I shot 64-frame bursts. The processed final images still look fine:
Wrecked fishing boat in Inverness and the Big Dipper, 64 2-second exposures, shot with Google Pixel (full resolution image
Stars above Pierce Point Ranch, 64 2-second exposures, shot with Google Pixel (full resolution image
In the second image the distant lights of the cities around the San Francisco Bay caused the sky near the horizon to glow, but without moonlight the night was still dark enough to make the Milky Way visible. The picture looks noticeably grainier than my earlier moonlight shots, but it's not too bad.
Pushing the Limits
How far can we go? Can we take a cellphone photo with only starlight - no moon, no artificial light sources nearby, and no background glow from a distant city?
To test this I drove to a point on the California coast a little north of the mouth of the
, where nights can get really dark, and pointed my Pixel phone at the summer sky above the ocean. Combining 64 two-second exposures taken at ISO 12800, and 64 corresponding black black frames did produce a recognizable image of the Milky Way. The constellations
are clearly visible, and squinting hard enough one can just barely make out the horizon and one or two rocks in the ocean, but overall, this is not a picture you'd want to print out and frame. Still, this may be the lowest-light cellphone photo ever taken.
Only starlight, shot with Google Pixel (full resolution image
Here we are approaching the limits of what the Pixel camera can do. The camera cannot handle exposure times longer than two seconds. If this restriction was removed we could expose individual frames for eight to ten seconds, and the stars still would not show noticeable motion blur. With longer exposures we could lower the ISO setting, which would significantly reduce noise in the individual frames, and we would get a correspondingly cleaner and more detailed final picture.
Getting back to the original challenge - using a cellphone to reproduce a night-time DSLR shot of the Golden Gate - I did that. Here is what I got:
Golden Gate Bridge at night, shot with Google Nexus 6P (full resolution image
The Moon above San Francisco, shot with Google Nexus 6P (full resolution image
At 9 to 10 MPixels the resolution of these pictures is not as high as what a DSLR camera might produce, but otherwise image quality is surprisingly good: the photos are sharp all the way into the corners, there is not much visible noise, the captured dynamic range is sufficient to avoid saturating all but the brightest highlights, and the colors are pleasing.
Trying to find out if phone cameras might be suitable for outdoor nighttime photography was a fun experiment, and clearly the result is yes, they are. However, arriving at the final images required a lot of careful post-processing on a desktop computer, and the procedure is too cumbersome for all but the most dedicated cellphone photographers. However, with the right software a phone should be able to process the images internally, and if steps such as painting layer masks by hand can be eliminated, it might be possible to do point-and-shoot photography in very low light conditions. Almost - the cellphone would still have to rest on the ground or be mounted on a tripod.
Here’s a Google Photos album
with more examples of photos that were created with the technique described above.
Research at Google and ICLR 2017
Monday, April 24, 2017
Posted by Ian Goodfellow, Staff Research Scientist, Google Brain Team
This week, Toulon, France hosts the
5th International Conference on Learning Representations
(ICLR 2017), a conference focused on how one can learn meaningful and useful representations of data for
. ICLR includes conference and workshop tracks, with invited talks along with oral and poster presentations of some of the latest research on deep learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.
At the forefront of innovation in cutting-edge technology in
, Google focuses on both theory and application, developing learning approaches to understand and generalize. As Platinum Sponsor of ICLR 2017, Google will have a strong presence with over 50 researchers attending (many from the
Google Brain team
Google Research Europe
), contributing to and learning from the broader academic research community by presenting papers and posters, in addition to participating on organizing committees and in workshops.
If you are attending ICLR 2017, we hope you'll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at ICLR 2017 in the list below (Googlers highlighted in
Area Chairs include:
Program Chairs include:
Understanding Deep Learning Requires Rethinking Generalization
(Best Paper Award)
, Benjamin Recht*, Oriol Vinyals
Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data
(Best Paper Award)
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
Shixiang (Shane) Gu*, Timothy Lillicrap, Zoubin Ghahramani, Richard E.
Neural Architecture Search with Reinforcement Learning
Adversarial Machine Learning at Scale
Ian J. Goodfellow
Capacity and Trainability in Recurrent Neural Networks
Improving Policy Gradient by Exploring Under-Appreciated Rewards
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
, Krzysztof Maziarz,
Unrolled Generative Adversarial Networks
, Ben Poole*, David Pfau,
Categorical Reparameterization with Gumbel-Softmax
, Shixiang (Shane) Gu*, Ben Poole*
Decomposing Motion and Content for Natural Video Sequence Prediction
Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin,
Density Estimation Using Real NVP
Latent Sequence Decompositions
William Chan*, Yu Zhang*,
, Navdeep Jaitly*
Learning a Natural Language Interface with Neural Programmer
Quoc V. Le
, Andrew McCallum*, Dario
Deep Information Propagation
, Surya Ganguli,
Identity Matters in Deep Learning
, Tengyu Ma
A Learned Representation For Artistic Style
Adversarial Training Methods for Semi-Supervised Text Classification
Andrew M. Dai
Quoc V. Le
Learning to Remember Rare Events
Deep Learning with Dynamic Computation Graphs
HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving
Workshop Track Abstracts
Particle Value Functions
Chris J. Maddison,
, Nicolas Heess, Arnaud Doucet, Andriy Mnih, Yee Whye Teh
Neural Combinatorial Optimization with Reinforcement Learning
Quoc V. Le
Short and Deep: Sketching and Neural Networks
Explaining the Learning Dynamics of Direct Feedback Alignment
Samuel S. Schoenholz
Training a Subsampling Mechanism in Expectation
Tuning Recurrent Neural Networks with Reinforcement Learning
, Shixiang (Shane) Gu*, Richard E. Turner,
REBAR: Low-Variance, Unbiased Gradient Estimates for Discrete Latent Variable Models
, Andriy Mnih, Chris J. Maddison,
Adversarial Examples in the Physical World
Regularizing Neural Networks by Penalizing Confident Output Distributions
Unsupervised Perceptual Rewards for Imitation Learning
Changing Model Behavior at Test-time Using Reinforcement Learning
* Work performed while at Google
† Work performed while at OpenAI
PhotoScan: Taking Glare-Free Pictures of Pictures
Thursday, April 20, 2017
Posted by Ce Liu, Michael Rubinstein, Mike Krainin and Bill Freeman, Research Scientists
Yesterday, we released an
update to PhotoScan
, an app for
that allows you to digitize photo prints with just a smartphone. One of the key features of PhotoScan is the ability to remove glare from prints, which are often glossy and reflective, as are the plastic album pages or glass-covered picture frames that host them. To create this feature, we developed a unique blend of computer vision and image processing techniques that can carefully align and combine several slightly different pictures of a print to separate the glare from the image underneath.
A regular digital picture of a physical print.
Glare-free digital output from PhotoScan
When taking a single picture of a photo, determining which regions of the picture are the actual photo and which regions are glare is challenging to do automatically. Moreover, the glare may often saturate regions in the picture, rendering it impossible to see or recover the parts of the photo underneath it. But if we take several pictures of the photo while moving the camera, the position of the glare tends to change, covering different regions of the photo. In most cases we found that every pixel of the photo is likely not to be covered by glare in at least one of the pictures. While no single view may be glare-free, we can combine multiple pictures of the printed photo taken at different angles to remove the glare. The challenge is that the images need to be aligned very accurately in order to combine them properly, and this processing needs to run very quickly on the phone to provide a near instant experience.
The captured, input images (5 in total).
If we stabilize the images on the photo, we can see just the glare moving, covering different parts of the photo. Notice no single image is glare-free.
Our technique is inspired by our earlier work published at
, which we dubbed “
”. It uses similar principles to remove various types of obstructions from the field of view. However, the algorithm we originally proposed was based on a generative model where the motion and appearance of both the main scene and the obstruction layer are estimated. While that model is quite powerful and can remove a variety of obstructions, it is too computationally expensive to be run on smartphones. We therefore developed a simpler model that treats glare as an outlier, and only attempts to register the underlying, glare-free photo. While this model is simpler, the task is still quite challenging as the registration needs to be highly accurate and robust.
How it Works
We start from a series of pictures of the print taken by the user while moving the camera. The first picture - the “reference frame” - defines the desired output viewpoint. The user is then instructed to take four additional frames. In each additional frame, we detect
sparse feature points
) and use them to establish
mapping each frame to the reference frame.
Left: Detected feature matches between the reference frame and each other frame (left), and the warped frames according to the estimated homographies (right).
While the technique may sound straightforward, there is a catch - homographies are only able to align flat images. But printed photos are often not entirely flat (as is the case with the example shown above). Therefore, we use
— a fundamental, computer vision representation for motion, which establishes pixel-wise mapping between two images — to correct the non-planarities. We start from the homography-aligned frames, and compute “flow fields” to warp the images and further refine the registration. In the example below, notice how the corners of the photo on the left slightly “move” after registering the frames using only homographies. The right hand side shows how the photo is better aligned after refining the registration using optical flow.
Comparison between the warped frames using homographies (left) and after the additional warp refinement using optical flow (right).
The difference in the registration is subtle, but has a big impact on the end result. Notice how small misalignments manifest themselves as duplicated image structures in the result, and how these artifacts are alleviated with the additional flow refinement.
Comparison between the glare removal result with (right) and without (left) optical flow refinement. In the result using homographies only (left), notice artifacts around the eye, nose and teeth of the person, and duplicated stems and flower petals on the fabric.
Here too, the challenge was to make optical flow, a naturally slow algorithm, work very quickly on the phone. Instead of computing optical flow at each pixel as done traditionally (the number of flow vectors computed is equal to the number of input pixels), we represent a flow field by a smaller number of control points, and express the motion at each pixel in the image as a function of the motion at the control points. Specifically, we divide each image into tiled, non-overlapping cells to form a coarse grid, and represent the flow of a pixel in a cell as the
of the flow at the four corners of the cell that contains it.
The grid setup for grid optical flow. A point p is represented as the
of the four corner points of the cell that encapsulates it.
Illustration of the computed flow field on one of the frames.
The flow color coding: orientation and magnitude represented by hue and saturation, respectively.
This results in a much smaller problem to solve, since the number of flow vectors to compute now equals the number of grid points, which is typically much smaller than the number of pixels. This process is similar in nature to the spline-based image registration described in
Szeliski and Coughlan (1997)
. With this algorithm, we were able to reduce the optical flow computation time by a factor of ~40 on a Pixel phone!
Flipping between the homography-registered frame and the flow-refined warped frame (using the above flow field), superimposed on the (clean) reference frame, shows how the computed flow field “snaps” image parts to their corresponding parts in the reference frame, improving the registration.
Finally, in order to compose the glare-free output, for any given location in the registered frames, we examine the pixel values, and use a soft minimum algorithm to obtain the darkest observed value. More specifically, we compute the expectation of the minimum brightness over the registered frames, assigning less weight to pixels close to the (warped) image boundaries. We use this method rather than computing the minimum directly across the frames due to the fact that corresponding pixels at each frame may have slightly different brightness. Therefore, per-pixel minimum can produce visible seams due to sudden intensity changes at boundaries between overlaid images.
Regular minimum (left) versus soft minimum (right) over the registered frames.
The algorithm can support a variety of scanning conditions — matte and gloss prints, photos inside or outside albums, magazine covers.
To get the final result, the Photos team has developed a method that automatically detects and crops the photo area, and rectifies it to a frontal view. Because of perspective distortion, the scanned rectangular photo usually appears to be a quadrangle on the image. The method analyzes image signals, like color and edges, to figure out the exact boundary of the original photo on the scanned image, then applies a geometric transformation to rectify the quadrangle area back to its original rectangular shape yielding high-quality, glare-free digital version of the photo.
So overall, quite a lot going on under the hood, and all done almost instantaneously on your phone! To give PhotoScan a try, download the app on
Teaching Machines to Draw
Thursday, April 13, 2017
Posted by David Ha, Google Brain Resident
Abstract visual communication is a key part of how people convey ideas to one another. From a young age, children develop the ability to depict objects, and arguably even emotions, with only a few pen strokes. These simple drawings may not resemble reality as captured by a photograph, but they do tell us something about how people represent and reconstruct images of the world around them.
Vector drawings produced by sketch-rnn.
In our recent paper, “
A Neural Representation of Sketch Drawings
”, we present a generative
recurrent neural network
capable of producing sketches of common objects, with the goal of training a machine to draw and generalize abstract concepts in a manner similar to humans. We train our model on a dataset of hand-drawn sketches, each represented as a sequence of motor actions controlling a pen: which direction to move, when to lift the pen up, and when to stop drawing. In doing so, we created a model that potentially has many applications, from assisting the creative process of an artist, to helping teach students how to draw.
While there is a already a large body of existing work on generative modelling of images using neural networks, most of the work focuses on modelling
represented as a 2D grid of pixels. While these models are currently able to generate realistic images, due to the high dimensionality of a 2D grid of pixels, a key challenge for them is to generate images with coherent structure. For example, these models sometimes produce amusing images of cats with three or more eyes, or dogs with multiple heads.
Examples of animals generated with the wrong number of body parts, produced using previous GAN models trained on 128x128 ImageNet dataset. The image above is Figure 29 of
Generative Adversarial Networks
, Ian Goodfellow, NIPS 2016 Tutorial.
In this work, we investigate a lower-dimensional vector-based representation inspired by how people draw. Our model, sketch-rnn, is based on the
(seq2seq) autoencoder framework. It incorporates
as recurrent neural network cells. The goal of a seq2seq autoencoder is to train a network to encode an input sequence into a vector of floating point numbers, called a
vector, and from this latent vector reconstruct an output sequence using a decoder that replicates the input sequence as closely as possible.
Schematic of sketch-rnn.
In our model, we deliberately add noise to the latent vector. In our paper, we show that by inducing noise into the communication channel between the encoder and the decoder, the model is no longer be able to reproduce the input sketch exactly, but instead must learn to capture the essence of the sketch as a noisy latent vector. Our decoder takes this latent vector and produces a sequence of motor actions used to construct a new sketch. In the figure below, we feed several actual sketches of cats into the encoder to produce reconstructed sketches using the decoder.
Reconstructions from a model trained on cat sketches.
It is important to emphasize that the reconstructed cat sketches are
copies of the input sketches, but are instead
sketches of cats with similar characteristics as the inputs. To demonstrate that the model is not simply copying from the input sequence, and that it actually learned something about the way people draw cats, we can try to feed in non-standard sketches into the encoder:
When we feed in a sketch of a three-eyed cat, the model generates a similar looking cat that has two eyes instead, suggesting that our model has learned that cats usually only have two eyes. To show that our model is not simply choosing the closest normal-looking cat from a large collection of memorized cat-sketches, we can try to input something totally different, like a sketch of a toothbrush. We see that the network generates a cat-like figure with long whiskers that mimics the features and orientation of the toothbrush. This suggests that the network has learned to encode an input sketch into a set of abstract cat-concepts embedded into the latent vector, and is also able to reconstruct an entirely new sketch based on this latent vector.
Not convinced? We repeat the experiment again on a model trained on pig sketches and arrive at similar conclusions. When presented with an eight-legged pig, the model generates a similar pig with only four legs. If we feed a truck into the pig-drawing model, we get a pig that looks a bit like the truck.
Reconstructions from a model trained on pig sketches.
To investigate how these latent vectors encode conceptual animal features, in the figure below, we first obtain two latent vectors encoded from two very different pigs, in this case a pig head (in the green box) and a full pig (in the orange box). We want to get a sense of how our model learned to represent pigs, and one way to do this is to interpolate between the two different latent vectors, and visualize each generated sketch from each interpolated latent vector. In the figure below, we visualize how the sketch of the pig head slowly morphs into the sketch of the full pig, and in the process show how the model organizes the concepts of pig sketches. We see that the latent vector controls the relatively position and size of the nose relative to the head, and also the existence of the body and legs in the sketch.
Latent space interpolations generated from a model trained on pig sketches.
We would also like to know if our model can learn representations of multiple animals, and if so, what would they look like? In the figure below, we generate sketches from interpolating latent vectors between a cat head and a full pig. We see how the representation slowly transitions from a cat head, to a cat with a tail, to a cat with a fat body, and finally into a full pig. Like a child learning to draw animals, our model learns to construct animals by attaching a head, feet, and a tail to its body. We see that the model is also able to draw cat heads that are distinct from pig heads.
Latent Space Interpolations from a model trained on sketches of both cats and pigs.
These interpolation examples suggest that the latent vectors indeed encode conceptual features of a sketch. But can we use these features to augment other sketches without such features - for example, adding a body to a cat's head?
Learned relationships between abstract concepts, explored using latent vector arithmetic.
Indeed, we find that sketch drawing analogies are possible for our model trained on both cat and pig sketches. For example, we can subtract the latent vector of an encoded pig head from the latent vector of a full pig, to arrive at a vector that represents the concept of a body. Adding this difference to the latent vector of a cat head results in a full cat (i.e. cat head + body = full cat). These drawing analogies allow us to explore how the model organizes its latent space to represent different concepts in the manifold of generated sketches.
In addition to the research component of this work, we are also super excited about potential creative applications of sketch-rnn. For instance, even in the simplest use case, pattern designers can apply sketch-rnn to generate a large number of similar, but unique designs for textile or wallpaper prints.
Similar, but unique cats, generated from a single input sketch (green and yellow boxes).
As we saw earlier, a model trained to draw pigs can be made to draw pig-like trucks if given an input sketch of a truck. We can extend this result to applications that might help creative designers come up with abstract designs that can resonate more with their target audience.
For instance, in the figure below, we feed sketches of four different chairs into our cat-drawing model to produce four chair-like cats. We can go further and incorporate the interpolation methodology described earlier to explore the latent space of chair-like cats, and produce a large grid of generated designs to select from.
Exploring the latent space of generated chair-cats.
Exploring the latent space between different objects can potentially enable creative designers to find interesting intersections and relationships between different drawings.
Exploring the latent space of generated sketches of everyday objects.
Latent space interpolation from left to right, and then top to bottom.
We can also use the decoder module of sketch-rnn as a standalone model and train it to predict different possible endings of incomplete sketches. This technique can lead to applications where the model assists the creative process of an artist by suggesting alternative ways to finish an incomplete sketch. In the figure below, we draw different incomplete sketches (in red), and have the model come up with different possible ways to complete the drawings.
The model can start with incomplete sketches (the red partial sketches to the left of the vertical line) and automatically generate different completions.
We can take this concept even further, and have different models complete the same incomplete sketch. In the figures below, we see how to make the same circle and square figures become a part of various ants, flamingos, helicopters, owls, couches and even paint brushes. By using a diverse set of models trained to draw various objects, designers can explore creative ways to communicate meaningful visual messages to their audience.
Predicting the endings of the same circle and square figures (center) using various sketch-rnn models trained to draw different objects.
We are very excited about the future possibilities of generative vector image modelling. These models will enable many exciting new creative applications in a variety of different directions. They can also serve as a tool to help us improve our understanding of our own creative thought processes. Learn more about sketch-rnn by reading our paper, “
A Neural Representation of Sketch Drawings
We thank Ian Johnson, Jonas Jongejan, Martin Wattenberg, Mike Schuster, Ben Poole, Kyle Kastner, Junyoung Chung, Kyle McDonald for their help with this project. This work was done as part of the
Google Brain Residency
Introducing tf-seq2seq: An Open Source Sequence-to-Sequence Framework in TensorFlow
Tuesday, April 11, 2017
Posted by Anna Goldie and Denny Britz, Research Software Engineer and Google Brain Resident, Google Brain Team
(Crossposted on the
Google Open Source Blog
Last year, we announced
Google Neural Machine Translation
(GNMT), a sequence-to-sequence (“
”) model which is now used in Google Translate production systems. While GNMT achieved huge improvements in translation quality, its impact was limited by the fact that the framework for training these models was unavailable to external researchers.
Today, we are excited to introduce
, an open source seq2seq framework in TensorFlow that makes it easy to experiment with seq2seq models and achieve state-of-the-art results. To that end, we made the tf-seq2seq codebase clean and modular, maintaining full test coverage and documenting all of its functionality.
Our framework supports various configurations of the standard seq2seq model, such as depth of the encoder/decoder, attention mechanism, RNN cell type, or beam size. This versatility allowed us to discover optimal hyperparameters and outperform other frameworks, as described in our paper, “
Massive Exploration of Neural Machine Translation Architectures
A seq2seq model translating from Mandarin to English. At each time step, the encoder takes in one Chinese character and its own previous state (black arrow), and produces an output vector (blue arrow). The decoder then generates an English translation word-by-word, at each time step taking in the last word, the previous state, and a weighted combination of all the outputs of the encoder (aka attention , depicted in blue) and then producing the next English word. Please note that in our implementation we use wordpieces  to handle rare words.
In addition to machine translation, tf-seq2seq can also be applied to any other sequence-to-sequence task (i.e. learning to produce an output sequence given an input sequence), including machine summarization, image captioning, speech recognition, and conversational modeling. We carefully designed our framework to maintain this level of generality and provide tutorials, preprocessed data, and other utilities for
We hope that you will use tf-seq2seq to accelerate (or kick off) your own deep learning research. We also welcome your contributions to our
, where we have a variety of open issues that we would love to have your help with!
We’d like to thank Eugene Brevdo, Melody Guan, Lukasz Kaiser, Quoc V. Le, Thang Luong, and Chris Olah for all their help. For a deeper dive into how seq2seq models work, please see the resources below.
Massive Exploration of Neural Machine Translation Architectures
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le
Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le. NIPS, 2014
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. ICLR, 2015
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016
Attention and Augmented Recurrent Neural Networks
Chris Olah, Shan Carter. Distill, 2016
Neural Machine Translation and Sequence-to-sequence Models: A Tutorial
Announcing the 2017 Google PhD Fellows for North America, Europe and the Middle East
Monday, April 10, 2017
Posted by Michael Rennaker, Program Manager
Google created the
PhD Fellowship program
in 2009 to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its eighth year, our fellowship program has supported hundreds of future faculty, industry researchers, innovators and entrepreneurs.
Reflecting our continuing commitment to supporting and building relationships with the academic community, we are excited to announce the 33 recipients from North America, Europe and the Middle East. We offer our sincere congratulations to Google’s 2017 Class of Google PhD Fellows.
Algorithms, Optimizations and Markets
Chiu Wai Sam Wong,
University of California, Berkeley
University of Southern California
University of Illinois, Urbana-Champaign
University of Minnesota - Twin Cities
Fondation Sciences Mathématiques de Paris
University College London
New York University
University of Amsterdam
University of Toronto
Machine Perception, Speech Technology and Computer Vision
Saarland University - Saarbrücken GSCS and MPI Institute for Informatics
Imperial College London
University of California, San Diego
University of Texas, Austin
Natural Language Processing
Jianpeng Cheng, The University of Edinburgh
Kevin Clark, Stanford University
Tim Rocktaschel, University College London
Privacy and Security
ENS - École Normale Supérieure
University of Maryland, College Park
Programming Languages and Software Engineering
Christoffer Quist Adamsen,
Muhammad Ali Gulzar,
University of California, Los Angeles
Structured Data and Database Management
University of Illinois, Urbana-Champaign
Systems and Networking
Ahmed M. Said Mohamed Tawfik Issa,
Georgia Institute of Technology
University of California, Irvine
University of California, Berkeley
Predicting Properties of Molecules with Machine Learning
Friday, April 07, 2017
Posted by George Dahl, Research Scientist, Google Brain Team
Recently there have been many exciting applications of machine learning (ML) to chemistry, particularly in chemical search problems, from
to finding better
. Historically, chemists have used numerical approximations to Schrödinger’s equation, such as
Density Functional Theory
(DFT), in these sorts of chemical searches. However, the computational cost of these approximations limits the size of the search. In the hope of enabling larger searches, several research groups have created ML models to predict chemical properties using training data generated by DFT (e.g.
Rupp et al.
Behler and Parrinello
). Expanding upon this previous work, we have been applying various modern ML methods to the
–a public collection of molecules paired with DFT-computed electronic, thermodynamic, and vibrational properties.
We have recently posted two papers describing our research in this area that grew out of a collaboration between the
Google Brain team
, the Google Accelerated Science team,
, and the
University of Basel
includes a new featurization of molecules and a systematic assessment of a multitude of machine learning methods on the QM9 benchmark. After trying many existing approaches on this benchmark, we worked on improving the most promising deep neural network models.
The resulting second paper, “
Neural Message Passing for Quantum Chemistry
,” describes a model family called Message Passing Neural Networks (MPNNs), which are defined abstractly enough to include many previous neural net models that are invariant to graph symmetries. We developed novel variations within the MPNN family which significantly outperform all baseline methods on the QM9 benchmark, with improvements of nearly a factor of four on some targets.
One reason molecular data is so interesting from a machine learning standpoint is that one natural representation of a molecule is as a graph with
atoms as nodes and bonds as edges
. Models that can leverage inherent symmetries in data will tend to generalize better — part of the success of convolutional neural networks on images is due to their ability to incorporate our prior knowledge about the invariances of image data (e.g. a picture of a dog shifted to the left is still a picture of a dog). Invariance to graph symmetries is a particularly desirable property for machine learning models that operate on graph data, and there has been a lot of interesting research in this area as well (e.g.
Li et al.
Duvenaud et al.
Kearnes et al.
Defferrard et al.
). However, despite this progress, much work remains. We would like to find the best versions of these models for chemistry (and other) applications and map out the connections between different models proposed in the literature.
Our MPNNs set a new state of the art for predicting all 13 chemical properties in QM9. On this particular set of molecules, our model can predict 11 of these properties accurately enough to potentially be useful to chemists, but up to 300,000 times faster than it would take to simulate them using DFT. However, much work remains to be done before MPNNs can be of real practical use to chemists. In particular, MPNNs must be applied to a significantly more diverse set of molecules (e.g. larger or with a more varied set of heavy atoms) than exist in QM9. Of course, even with a realistic training set, generalization to very different molecules could still be poor. Overcoming both of these challenges will involve making progress on questions–such as generalization–that are at the heart of machine learning research.
Predicting the properties of molecules is a practically important problem that both benefits from advanced machine learning techniques and presents interesting fundamental research challenges for learning algorithms. Eventually, such predictions could aid the design of new medicines and materials that benefit humanity. At Google, we feel that it’s important to disseminate our research and to help train new researchers in machine learning. As such, we are delighted that the first and second authors of our MPNN paper are
Google Brain residents
Federated Learning: Collaborative Machine Learning without Centralized Training Data
Thursday, April 06, 2017
Posted by Brendan McMahan and Daniel Ramage, Research Scientists
Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we're introducing an additional approach:
Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the
Mobile Vision API
On-Device Smart Reply
) by bringing model
to the device as well.
It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.
Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated.
Federated Learning allows for smarter models, lower latency, and less power consumption, all while ensuring privacy. And this approach has another immediate benefit: in addition to providing an update to the shared model, the improved model on your phone can also be used immediately, powering experiences personalized by the way you use your phone.
We're currently testing Federated Learning in
Gboard on Android
, the Google Keyboard. When Gboard shows a suggested query, your phone locally stores information about the current context and whether you clicked the suggestion. Federated Learning processes that history on-device to suggest improvements to the next iteration of Gboard’s query suggestion model.
To make Federated Learning possible, we had to overcome many algorithmic and technical challenges. In a typical machine learning system, an optimization algorithm like
Stochastic Gradient Descent
(SGD) runs on a large dataset partitioned homogeneously across servers in the cloud. Such highly iterative algorithms require low-latency, high-throughput connections to the training data. But in the Federated Learning setting, the data is distributed across millions of devices in a highly uneven fashion. In addition, these devices have significantly higher-latency, lower-throughput connections and are only intermittently available for training.
These bandwidth and latency limitations motivate our
Federated Averaging algorithm
, which can train deep networks using 10-100x less communication compared to a naively federated version of SGD. The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps. Since it takes fewer iterations of high-quality updates to produce a good model, training can use much less communication. As upload speeds are typically
than download speeds, we also developed a novel way to reduce upload communication costs up to another 100x by
using random rotations and quantization. While these approaches are focused on training deep networks, we've also
for high-dimensional sparse convex models which excel on problems like click-through-rate prediction.
Deploying this technology to millions of heterogenous phones running Gboard requires a sophisticated technology stack. On device training uses a miniature version of
. Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance.
Your phone participates in Federated Learning only
when it won't negatively impact your experience.
The system then needs to communicate and aggregate the model updates in a secure, efficient, scalable, and fault-tolerant way. It's only the combination of research with this infrastructure that makes the benefits of Federated Learning possible.
Federated learning works without the need to store user data in the cloud, but we're not stopping there. We've developed a
Secure Aggregation protocol
that uses cryptographic techniques so a coordinating server can only decrypt the average update if 100s or 1000s of users have participated — no individual phone's update can be inspected before averaging. It's the first protocol of its kind that is practical for deep-network-sized problems and real-world connectivity constraints. We designed Federated Averaging so the coordinating server only needs the average update, which allows Secure Aggregation to be used; however the protocol is general and can be applied to other problems as well. We're working hard on a production implementation of this protocol and expect to deploy it for Federated Learning applications in the near future.
Our work has only scratched the surface of what is possible. Federated Learning can't solve all machine learning problems (for example, learning to
recognize different dog breeds
by training on carefully labeled examples), and for many other models the necessary training data is already stored in the cloud (like training spam filters for Gmail). So Google will continue to advance the state-of-the-art for cloud-based ML, but we are also committed to ongoing research to expand the range of problems we can solve with Federated Learning. Beyond Gboard query suggestions, for example, we hope to improve the language models that power your keyboard based on what you actually type on your phone (which can have a style all its own) and photo rankings based on what kinds of photos people look at, share, or delete.
Applying Federated Learning requires machine learning practitioners to adopt new tools and a new way of thinking: model development, training, and evaluation with no direct access to or labeling of raw data, with communication cost as a limiting factor. We believe the user benefits of Federated Learning make tackling the technical challenges worthwhile, and are publishing our work with hopes of a widespread conversation within the machine learning community.
This post reflects the work of many people in Google Research, including Blaise Agüera y Arcas, Galen Andrew, Dave Bacon, Keith Bonawitz, Chris Brumme, Arlie Davis, Jac de Haan, Hubert Eichner, Wolfgang Grieskamp, Wei Huang, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Nicholas Kong, Ben Kreuter, Alison Lentz, Stefano Mazzocchi, Sarvar Patel, Martin Pelikan, Aaron Segal, Karn Seth, Ananda Theertha Suresh, Iulia Turc, Felix Yu, Antonio Marcedone and our partners in the Gboard team.
Keeping fake listings off Google Maps
Thursday, April 06, 2017
Posted by Doug Grundman, Maps Anti-Abuse, and Kurt Thomas, Security & Anti-Abuse Research
(Crossposted on the
Google Security blog
Google My Business
enables millions of business owners to create listings and share information about their business on Google Maps and Search, making sure everything is up-to-date and accurate for their customers. Unfortunately, some actors attempt to abuse this service to register fake listings in order to defraud legitimate business owners, or to
charge exorbitant service fees for services
Over a year ago, we teamed up with the University of California, San Diego to research the actors behind fake listings, in order to improve our products and keep our users safe. The full report, “
Pinning Down Abuse on Google Maps
”, will be presented tomorrow at the 2017
International World Wide Web Conference
Our study shows that fewer than 0.5% of local searches lead to fake listings. We’ve also improved how we verify new businesses, which has reduced the number of fake listings by 70% from its all-time peak back in June 2015.
What is a fake listing?
For over a year, we tracked the bad actors behind fake listings. Unlike email-based scams
selling knock-off products online
, local listing scams require physical proximity to potential victims. This fundamentally changes both the scale and types of abuse possible.
Bad actors posing as locksmiths, plumbers, electricians, and other contractors were the most common source of abuse—roughly 2 out of 5 fake listings. The actors operating these fake listings would cycle through non-existent postal addresses and disposable VoIP phone numbers even as their listings were discovered and disabled. The purported addresses for these businesses were irrelevant as the contractors would travel directly to potential victims.
Another 1 in 10 fake listings belonged to real businesses that bad actors had improperly claimed ownership over, such as hotels and restaurants. While making a reservation or ordering a meal was indistinguishable from the real thing, behind the scenes, the bad actors would deceive the actual business into paying referral fees for organic interest.
How does Google My Business verify information?
Google My Business currently verifies the information provided by business owners before making it available to users. For freshly created listings, we physically mail a postcard to the new listings’ address to ensure the location really exists. For businesses changing owners, we make an automated call to the listing’s phone number to verify the change.
Unfortunately, our research showed that these processes can be abused to get fake listings on Google Maps. Fake contractors would request hundreds of postcard verifications to non-existent suites at a single address, such as 123 Main St #456 and 123 Main St #789, or to stores that provided PO boxes. Alternatively, a phishing attack could maliciously repurpose freshly verified business listings by tricking the legitimate owner into sharing verification information sent either by phone or postcard.
Keeping deceptive businesses out — by the numbers
Leveraging our study’s findings, we’ve made significant changes to how we verify addresses and are even
piloting an advanced verification process
for locksmiths and plumbers. Improvements we’ve made include prohibiting bulk registrations at most addresses, preventing businesses from relocating impossibly far from their original address without additional verification, and detecting and ignoring intentionally mangled text in address fields designed to confuse our algorithms. We have also adapted our anti-spam machine learning systems to detect data discrepancies common to fake or deceptive listings.
Combined, here’s how these defenses stack up:
We detect and disable 85% of fake listings before they even appear on Google Maps.
We’ve reduced the number of abusive listings by 70% from its peak back in June 2015.
We’ve also reduced the number of impressions to abusive listings by 70%.
As we’ve shown, verifying local information comes with a number of unique anti-abuse challenges. While fake listings may slip through our defenses from time to time, we are constantly improving our systems to better serve both users and business owners.
Security and Privacy
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog