Google Research Blog
The latest news from Research at Google
Adding Sound Effect Information to YouTube Captions
Thursday, March 23, 2017
Posted by Sourish Chaudhuri, Software Engineer, Sound Understanding
The effect of audio on our perception of the world can hardly be overstated. Its importance as a communication medium via speech is obviously the most familiar, but there is also significant information conveyed by ambient sounds. These ambient sounds create context that we instinctively respond to, like getting startled by sudden commotion, the use of music as a narrative element, or how laughter is used as an audience cue in sitcoms.
YouTube has provided automatic caption tracks
for videos, focusing heavily on speech transcription in order to make the content hosted more accessible. However, without similar descriptions of the ambient sounds in videos, much of the information and impact of a video is not captured by speech transcription alone. To address this,
the addition of sound effect information to the automatic caption track in YouTube videos, enabling greater access to the richness of all the audio content.
In this post, we discuss the backend system developed for this effort, a collaboration among the Accessibility, Sound Understanding and YouTube teams that used machine learning (ML) to enable the first ever automatic sound effect captioning system for YouTube.
Click the CC button to see the sound effect captioning system in action.
The application of ML – in this case, a
Deep Neural Network
(DNN) model – to the captioning task presented unique challenges. While the process of analyzing the time-domain audio signal of a video to detect various ambient sounds is similar to other well known classification problems (such as object detection in images), in a product setting the solution faces additional difficulties. In particular, given an arbitrary segment of audio, we need our models to be able to 1) detect the desired sounds, 2) temporally localize the sound in the segment and 3) effectively integrate it in the caption track, which may have parallel and independent speech recognition results.
A DNN Model for Ambient Sound
The first challenge we faced in developing the model was the task of obtaining enough labeled data suitable for training our neural network. While labeled ambient sound information is difficult to come by, we were able to generate a large enough dataset for training using weakly labeled data. But of all the ambient sounds in a given video, which ones should we train our DNN to detect?
For the initial launch of this feature, we chose [APPLAUSE], [MUSIC] and [LAUGHTER], prioritized based upon our analysis of human-created caption tracks that indicates that they are among the most frequent sounds that are manually captioned. While the sound space is obviously far richer and provides even more contextually relevant information than these three classes, the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of “what was it that rang – a bell, an alarm, a phone?”
Much of our initial work on detecting these ambient sounds also included developing the infrastructure and analysis frameworks to enable scaling for future work, including both the detection of sound events and their integration into the automatic caption track. Investing in the development of this infrastructure has the added benefit of allowing us to easily incorporate more sound types in the future, as we expand our algorithms to understand a wider vocabulary of sounds (e.g. [RING], [KNOCK], [BARK]). In doing so, we will be able to incorporate the detected sounds into the narrative to provide more relevant information (e.g. [PIANO MUSIC], [RAUCOUS APPLAUSE]) to viewers.
Dense Detections to Captions
When a video is uploaded to YouTube, the sound effect recognition pipeline runs on the audio stream in the video. The DNN looks at short segments of audio and predicts whether that segment contains any one of the sound events of interest – since multiple sound effects can co-occur, our model makes a prediction at each time step for each of the sound effects. The segment window is then slid to the right (i.e. a slightly later point in time) and the model is used to make a prediction again, and so on till it reaches the end. This results in a dense stream the (likelihood of) presence of each of the sound events in our vocabulary at 100 frames per second.
The dense prediction stream is not directly exposed to the user, of course, since that would result in captions flickering on and off, and because we know that a number of sound effects have some degree of temporal continuity when they occur; e.g. “music” and “applause” will usually be present for a few seconds at least. To incorporate this intuition, we smooth over the dense prediction stream using a modified
containing two states: ON and OFF, with the predicted segments for each sound effect corresponding to the ON state. The figure below provides an illustration of the process in going from the dense detections to the final segments determined to contain sound effects of interest.
(Left) The dense sequence of probabilities from our DNN for the occurrence over time of single sound category in a video. (Center) Binarized segments based on the modified Viterbi algorithm. (Right) The duration-based filter removes segments that are shorter in duration than desired for the class.
A classification-based system such as this one will certainly have some errors, and needs to be able to trade off false positives against missed detections as per the product goals. For example, due to the weak labels in the training dataset, the model was often confused between events that tended to co-occur. For example, a segment labeled “laugh” would usually contain both speech and laughter and the model for “laugh” would have a hard time distinguishing them in test data. In our system, we allow further restrictions based on time spent in the ON state (i.e. do not determine sound X to be detected unless it was determined to be present for at least Y seconds) to push performance toward a desired point in the precision-recall curve.
Once we were satisfied with the performance of our system in temporally localizing sound effect captions based on our offline evaluation metrics, we were faced with the following: how do we combine the sound effect and speech captions to create a single automatic caption track, and how (or when) do we present sound effect information to the user to make it most useful to them?
Adding Sound Effect Information into the Automatic Captions Track
Once we had a system capable of accurately detecting and classifying the ambient sounds in video, we investigated how to convey that information to the viewer in an effective way. In collaboration with our User Experience (UX) research teams, we explored various design options and tested them in a qualitative pilot usability study. The participants of the study had different hearing levels and varying needs for captions. We asked participants a number of questions including whether it improved their overall experience, their ability to follow events in the video and extract relevant information from the caption track, to understand the effect of variables such as:
Using separate parts of the screen for speech and sound effect captions.
Interleaving the speech and sound effect captions as they occur.
Only showing sound effect captions at the end of sentences or when there is a pause in speech (even if they occurred in the middle of speech).
How hearing users perceive captions when watching with the sound off.
While it wasn’t surprising that almost all users appreciated the added sound effect information when it was accurate, we also paid specific attention to the feedback when the sound detection system made an error (a false positive when determining presence of a sound, or failing to detect an occurrence). This presented a surprising result: when sound effect information was incorrect, it did not detract from the participant’s experience in roughly 50% of the cases. Based upon participant feedback, the reasons for this appear to be:
Participants who could hear the audio were able to ignore the inaccuracies.
Participants who could not hear the audio interpreted the error as the presence of a sound event, and that they had not missed out on critical speech information.
Overall, users reported that they would be fine with the system making the occasional mistake as long as it was able to provide good information far more often than not.
Our work toward enabling automatic sound effect captions for YouTube videos and the initial rollout is a step toward making the richness of content in videos more accessible to our users who experience videos in different ways and in different environments that require captions. We’ve developed a framework to enrich the automatic caption track with sound effects, but there is still much to be done here. We hope that this will spur further work and discussion in the community around improving captions using not only automatic techniques, but also around ways to make creator-generated and community-contributed caption tracks richer (including perhaps, starting with the auto-captions) and better to further improve the viewing experience for our users.
Distill: Supporting Clarity in Machine Learning
Monday, March 20, 2017
Posted by Shan Carter, Software Engineer and Chris Olah, Research Scientist, Google Brain Team
Science isn't just about discovering new results. It’s also about human understanding. Scientists need to develop notations, analogies, visualizations, and explanations of ideas. This human dimension of science isn't a minor side project. It's deeply tied to the heart of science.
That’s why, in collaboration with OpenAI, DeepMind, YC Research, and others, we’re excited to announce the launch of
, a new open science journal and ecosystem supporting human understanding of machine learning. Distill is an independent organization, dedicated to fostering a new segment of the research community.
Modern web technology gives us powerful new tools for expressing this human dimension of science. We can create interactive diagrams and user interfaces the enable intuitive exploration of research ideas. Over the last few years we've seen
An interactive diagram explaining the Neural Turing Machine from
Olah & Carter, 2016
Unfortunately, while there are a plethora of conferences and journals in machine learning, there aren’t any research venues that are dedicated to publishing this kind of work. This is partly an issue of focus, and partly because traditional publication venues can't, by virtue of their medium, support interactive visualizations. Without a venue to publish in, many significant contributions don’t count as “real academic contributions” and their authors can’t access the academic support structure.
That’s why Distill aims to build an ecosystem to support this kind of work, starting with three pieces: a research journal, prizes recognizing outstanding work, and tools to facilitate the creation of interactive articles.
Distill is an ecosystem to support clarity in Machine Learning
Led by a diverse steering committee of leaders from the machine learning and user interface communities, we are very excited to see where Distill will go. To learn more about Distill, see the
or read the
Announcing Guetzli: A New Open Source JPEG Encoder
Thursday, March 16, 2017
Posted by Robert Obryk and Jyrki Alakuijala, Software Engineers, Google Research Europe
(Cross-posted on the
Google Open Source Blog
At Google, we care about giving users the best possible online experience, both through our own services and products and by contributing new tools and industry standards for use by the online community. That’s why we’re excited to announce
a new open source algorithm
that creates high quality JPEG images with file sizes 35% smaller than currently available methods, enabling webmasters to create webpages that can load faster and use even less data.
Guetzli [guɛtsli] —
in Swiss German — is a JPEG encoder for digital images and web graphics that can enable faster online experiences by producing smaller JPEG files while still maintaining compatibility with existing browsers, image processing applications and the JPEG standard. From the practical viewpoint this is very similar to our
algorithm, which produces smaller
files without needing to introduce a new format, and different than the techniques used in
RNN-based image compression
, which all need client changes for compression gains at internet scale.
The visual quality of JPEG images is directly correlated to its multi-stage compression process:
color space transform
discrete cosine transform
. Guetzli specifically targets the quantization stage in which the more visual quality loss is introduced, the smaller the resulting file. Guetzli strikes a balance between minimal loss and file size by employing a search algorithm that tries to overcome the difference between the
modeling of JPEG's format, and
Guetzli’s psychovisual model
, which approximates color perception and visual masking in a more thorough and detailed way than what is achievable by simpler color transforms and the discrete cosine transform. However, while Guetzli creates smaller image file sizes, the tradeoff is that these search algorithms take significantly longer to create compressed images than currently available methods.
Figure 1. 16x16 pixel synthetic example of a phone line hanging against a blue sky — traditionally a case where JPEG compression algorithms suffer from artifacts. Uncompressed original is on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) and has a smaller file size.
And while Guetzli produces smaller image file sizes without sacrificing quality, we additionally found that in
where compressed image file sizes are kept constant that human raters consistently preferred the images Guetzli produced over libjpeg images, even when the libjpeg files were the same size or even slightly larger. We think this makes the slower compression a worthy tradeoff.
Figure 2. 20x24 pixel zoomed areas from a picture of a cat’s eye. Uncompressed original on the left. Guetzli (on the right)
shows less ringing artefacts than libjpeg (middle) without requiring a larger file size.
It is our hope that webmasters and graphic designers will find Guetzli useful and apply it to their photographic content, making users’ experience smoother on image-heavy websites in addition to reducing load times and bandwidth costs for mobile users. Last, we hope that the new explicitly psychovisual approach in Guetzli will inspire further image and video compression research.
An Upgrade to SyntaxNet, New Models and a Parsing Competition
Wednesday, March 15, 2017
Posted by David Weiss and Slav Petrov, Research Scientists
At Google, we continuously improve the language understanding capabilities used in applications ranging from
generation of email responses
. Last summer, we open-sourced
, a neural-network framework for analyzing and understanding the grammatical structure of sentences. Included in our release was
, a state-of-the-art model that we had trained for analyzing English, followed quickly by a collection of pre-trained models for 40 additional languages, which we dubbed
. While we were excited to share our research and to provide these resources to the broader community, building machine learning systems that work well for languages other than English remains an ongoing challenge. We are excited to announce a few new research resources, available now, that address this problem.
We are releasing a
major upgrade to SyntaxNet
. This upgrade incorporates nearly a year’s worth of our research on multilingual language understanding, and is available to anyone interested in building systems for processing and understanding text. At the core of the upgrade is a
that enables learning of richly layered representations of input sentences. More specifically, the upgrade extends
to allow joint modeling of multiple levels of linguistic structure, and to allow neural-network architectures to be created dynamically during processing of a sentence or document.
Our upgrade makes it, for example, easy to build
that learn to compose individual characters into words (e.g. ‘c-a-t’ spells ‘cat’). By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). Parsey and Parsey’s Cousins, on the other hand, operated over sequences of words. As a result, they were forced to memorize words seen during training and relied mostly on the context to determine the grammatical function of previously unseen words.
As an example, consider the following (meaningless but grammatically correct) sentence:
This sentence was
originally coined by Andrew Ingraham
who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the
. We know too that one
." Systematic patterns in
allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.
To showcase the new capabilities provided by our upgrade to SyntaxNet, we are releasing a set of new pretrained models called
. These models use the character-based input representation mentioned above and are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context. The ParseySaurus models are far more accurate than
(reducing errors by as much as 25%), particularly for
languages like Russian, or
like Turkish and Hungarian. In those languages there can be dozens of forms for each word and many of these forms might never be observed during training - even in a very large corpus.
Consider the following fictitious Russian
, where again the stems are meaningless, but the suffixes define an unambiguous interpretation of the sentence structure:
Even though our Russian ParseySaurus model has never seen these words, it can correctly analyze the sentence by inspecting the character sequences which constitute each word. In doing so, the system can determine many properties of the words (notice how many more morphological features there are here than in the English example). To see the sentence as ParseySaurus does, here is a visualization of how the model analyzes this sentence:
Each square represents one node in the neural network graph, and lines show the connections between them. The left-side “tail” of the graph shows the model consuming the input as one long string of characters. These are intermittently passed to the right side, where the rich web of connections shows the model composing words into phrases and producing a syntactic parse. Check out the full-size rendering
You might be wondering whether character-based modeling are all we need or whether there are other techniques that might be important. SyntaxNet has lots more to offer, like
and different training objectives, but there are of course also many other possibilities. To find out what works well in practice we are helping co-organize, together with Charles University and other colleagues, a
multilingual parsing competition
at this year’s
Conference on Computational Natural Language Learning
(CoNLL) with the goal of building syntactic parsing systems that work well in real-world settings and for 45 different languages.
The competition is made possible by the
(UD) initiative, whose goal is to develop cross-linguistically consistent treebanks. Because machine learned models can only be as good as the data that they have access to, we have been contributing data to UD
. For the competition, we partnered with UD and
to build a new multilingual evaluation set consisting of 1000 sentences that have been translated into 20+ different languages and annotated by linguists with parse trees. This evaluation set is the first of its kind (in the past, each language had its own independent evaluation set) and will enable more consistent cross-lingual comparisons. Because the sentences have the same meaning and have been annotated according to the same guidelines, we will be able to get closer to answering the question of which languages might be harder to parse.
We hope that the upgraded SyntaxNet framework and our the pre-trained ParseySaurus models will inspire researchers to participate in the competition. We have additionally created a
showing how to load a
image and train models on the
Google Cloud Platform
, to facilitate participation by smaller teams with limited resources. So, if you have an idea for making your own models with the SyntaxNet framework,
sign up to compete
! We believe that the configurations that we are releasing are a good place to start, but we look forward to seeing how participants will be able to extend and improve these models or perhaps create better ones!
Thanks to everyone involved who made this competition happen, including our collaborators at
, who provide another baseline implementation to make it easy to enter the competition. Happy parsing from the main developers, Chris Alberti, Daniel Andor, Ivan Bogatyy, Mark Omernick, Zora Tung and Ji Ma!
Natural Language Processing
Natural Language Understanding
Quick Access in Drive: Using Machine Learning to Save You Time
Friday, March 10, 2017
Posted by Sandeep Tata, Software Engineer, Google Research
At Google, we research cutting-edge machine learning (ML) techniques that allow us to provide products and services aimed at helping you focus on what’s important. From providing
helping you respond to emails
, it is our goal to help you save time, making life — and work — a little more convenient.
have shown that finding information is second only to managing email as a drain on workplace productivity. To help address this, last year we launched
, a feature in Google Drive that uses ML to surface the most relevant documents as soon as you visit the Google Drive home screen. Originally available only for G Suite customers on Android, Quick Access is now available for anyone who uses Google Drive (on the
), saving you from having to enter a search or to browse through your folders. Our metrics show that Quick Access takes you to the documents you need in half the time compared to manually navigating or searching.
Quick Access uses deep neural networks to determine patterns from various signals, such as activity in Drive, meetings on your Calendar, and more, to anticipate your needs and show the appropriate documents on the Drive home screen. Traditional ML approaches require domain experts to derive complex features from data, which are in turn used to train the model. For Quick Access, however, we constructed thousands of simple features from the various signals above (for instance, the timestamps of the last 20 edit events on a document would constitute 20 simple input features), and combined them with the power of deep neural networks to learn from the aggregated activity of our users. By using deep neural networks we were able to develop accurate predictive models with simpler features and less feature engineering effort.
Quick Access suggestions on the top row in Drive on a desktop browser.
The model computes a relevance score for each of the documents in Drive and the top scoring documents are presented on the home screen. For example, if you have a Calendar entry for a meeting with a coworker in the next few minutes, Quick Access might predict that the presentation you’ve been working on with that coworker is more relevant compared to your monthly budget spreadsheet or the photos you uploaded last week. If you’ve been updating a spreadsheet every weekend, then next weekend, Quick Access will likely display that spreadsheet ahead of the other documents you viewed during the week.
We hope Quick Access helps you use Drive more effectively, allowing you to save time and be more productive. To learn more, watch
this talk from Google Cloud Next ‘17
that dives into more details on the ML behind Quick Access.
Thanks to Alexandrin Popescul and Marc Najork for contributions that made this application of machine learning technology possible. This work was in close collaboration with several engineers on the Drive team including Sean Abraham, Brian Calaci, Mike Colagrosso, Mike Procopio, Jesse Sterr, and Timothy Vis.
Assisting Pathologists in Detecting Cancer with Deep Learning
Friday, March 03, 2017
Posted by Martin Stumpe, Technical Lead, and Lily Peng, Product Manager
A pathologist’s report after reviewing a patient’s biological tissue samples is often the gold standard in the diagnosis of many diseases. For cancer in particular, a pathologist’s diagnosis has a profound impact on a patient’s therapy. The reviewing of pathology slides is a very complex task, requiring years of training to gain the expertise and experience to do well.
Even with this extensive training, there can be substantial variability in the diagnoses given by different pathologists for the same patient, which can lead to misdiagnoses. For example, agreement in diagnosis for some forms of breast cancer can be
as low as 48%
for prostate cancer. The lack of agreement is not surprising given the massive amount of information that must be reviewed in order to make an accurate diagnosis. Pathologists are responsible for reviewing all the biological tissues visible on a slide. However, there can be many slides per patient, each of which is 10+ gigapixels when digitized at 40X magnification. Imagine having to go through a thousand 10 megapixel (MP) photos, and having to be responsible for every pixel. Needless to say, this is a lot of data to cover, and often time is limited.
To address these issues of limited time and diagnostic variability, we are investigating how deep learning can be applied to digital pathology, by creating an automated detection algorithm that can naturally complement pathologists’ workflow. We used images (graciously provided by the
Radboud University Medical Center
) which have also been used for the
2016 ISBI Camelyon Challenge
to train algorithms that were optimized for localization of breast cancer that has spread (metastasized) to lymph nodes adjacent to the breast.
The results? Standard “off-the-shelf” deep learning approaches like
Inception (aka GoogLeNet) worked reasonably well
for both tasks, although the tumor probability prediction heatmaps produced were a bit noisy. After additional customization, including training networks to examine the image at different magnifications (much like what a pathologist does), we showed that it was possible to train a model that either matched or exceeded the performance of a pathologist who had unlimited time to examine the slides.
: Images from two lymph node biopsies.
earlier results of our deep learning tumor detection.
our current results. Notice the visibly reduced noise (potential false positives) between the two versions.
In fact, the prediction heatmaps produced by the algorithm had improved so much that the localization score (
) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint
. We were not the only ones to see promising results, as other groups were getting
scores as high as 81%
with the same dataset. Even more exciting for us was that our model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “
Detecting Cancer Metastases on Gigapixel Pathology Images
A closeup of a lymph node biopsy. The tissue contains a breast cancer metastasis as well as
, which look similar to tumor but are benign normal tissue. Our algorithm successfully identifies the tumor region (bright green) and is not confused by the macrophages.
While these results are promising, there are a few important caveats to consider.
Like most metrics, the FROC localization score is not perfect. Here, the
FROC score is defined
as the sensitivity (percentage of tumors detected) at a few pre-defined average false positives per slide. It is pretty rare for a pathologist to make a false positive call (mistaking normal cells as tumor). For example, the score of 73% mentioned above corresponds to a 73% sensitivity and zero false positives. By contrast, our algorithm’s sensitivity rises when more false positives are allowed. At 8 false positives per slide, our algorithms had a sensitivity of 92%.
These algorithms perform well for the tasks for which they are trained, but lack the breadth of knowledge and experience of human pathologists — for example, being able to detect other abnormalities that the model has not been explicitly trained to classify (e.g. inflammatory process, autoimmune disease, or other types of cancer).
To ensure the best clinical outcome for patients, these algorithms need to be incorporated in a way that complements the pathologist’s workflow. We envision that algorithm such as ours could improve the efficiency and consistency of pathologists. For example, pathologists could reduce their false negative rates (percentage of undetected tumors) by reviewing the top ranked predicted tumor regions including up to 8 false positive regions per slide. As another example, these algorithms could enable pathologists to easily and accurately measure tumor size, a factor that is
associated with prognosis
Training models is just the first of many steps in translating interesting research to a real product. From clinical validation to regulatory approval, much of the journey from “bench to bedside” still lies ahead — but we are off to a very promising start, and we hope by sharing our work, we will be able to accelerate progress in this space.
For those who might be interested, the
, which builds upon the 2016 challenge, is currently underway.
The pathologist ended up spending 30 hours on this task on 130 slides.
Google Research Awards 2016
Thursday, February 23, 2017
Posted by Maggie Johnson, Director of Education and University Relations, Google
We’ve just completed another round of the
Google Research Awards
, our annual open call for proposals on computer science and related topics including
natural language processing
. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers.
This round we received 876 proposals covering 44 countries and over 300 universities. After expert reviews and committee discussions, we decided to fund 143 projects. Here are a few observations from this round:
The subject areas that received the most support were
Proposals related to Machine learning represented 20% of the total submissions received, up from 12% in 2015.
Proportionally, proposals from Europe had a 4% higher acceptance rate, attributed to our
increased research presence in Zürich
Congratulations to the well-deserving
recipients of this round’s awards
. If you are interested in applying for the next round (deadline is September 30th), please visit
for more information.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog