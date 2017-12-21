Google Research Blog
Evaluation of Speech for the Google Assistant
Thursday, December 21, 2017
Posted by Enrique Alfonseca, Staff Research Scientist, Google Assistant
Voice interactions with technology are becoming a key part of our lives — from asking your phone for traffic conditions to work to using a smart device at home to turn on the lights or play music. The Google Assistant is designed to provide help and information across a variety of platforms, and is built to bring together a number of products — including Google Maps, Search, Google Photos, third party services, and more. For some of these products, we have released specific evaluation guidelines, like
Search Quality Rating Guidelines
. However, the Google Assistant needs its own guidelines in place, as many of its interactions utilize what is called “eyes-free technology,” when there is no screen as part of the experience.
In the past we have received requests to see our evaluation guidelines from academics who are researching improvements in voice interactions, question answering and voice-guided exploration. To facilitate their evaluations, we are
publishing some of the first Google Assistant guidelines
. It is our hope that making these guidelines public will help the research community build and evaluate their own systems.
Creating the Guidelines
For many queries, responses are presented on the display (like a phone) with a graph, a table, or an interactive element, like you’d see for [
weather this weekend
].
But spoken responses are very different from display results, as what’s on screen needs to be translated into useful speech. Furthermore, the contents of the voice response are sometimes sourced from the web, and in those cases it’s important to provide the user with a link to the original source. While users looking at their mobile device can click through to read the original web page, an eyes free solution presents unique challenges. In order to generate the optimal audio response, we use a combination of
explicit linguistic knowledge and deep learning solutions
that allow us to keep answers grammatical, fluent and concise.
How do we ensure that we consistently meet user expectations on quality, across all answer types and languages? One of the tools we use to measure that are human evaluations. In these, we ask raters to make sure that answers are satisfactory across several dimensions:
Information Satisfaction:
the content of the answer should meet the information needs of the user.
Length:
when a displayed answer is too long, users can quickly scan it visually and locate the relevant information. For voice answers, that is not possible. It is much more important to ensure that we provide a helpful amount of information, hopefully not too much or too little. Some of
our previous work
is currently in use for identifying the most relevant fragments of answers.
Formulation:
it is much easier to understand a badly formulated written answer than an ungrammatical spoken answer, so more care has to be placed in ensuring grammatical correctness.
Elocution:
spoken answers must have proper pronunciation and prosody. Improvements in text-to-speech generation, such as
WaveNet
and
Tacotron 2
, are quickly reducing the gap with human performance.
The current version of the guidelines can be found
here
. Of course, guidelines are often updated, and these are just a snapshot of something that is a living, changing, always-work-in-progress evaluation!
Labels:
Information Retrieval
,
Search
,
TTS
,
Voice Search
Tacotron 2: Generating Human-like Speech from Text
Tuesday, December 19, 2017
Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception Teams
Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as
Tacotron
and
WaveNet
, we added more improvements to end up with our new system,
Tacotron 2
. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.
A full description of our new system can be found in our paper “
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
.” In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a
WaveNet
-like architecture.
A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to
the paper
.
You can listen to some of the
Tacotron 2 audio samples
that demonstrate the results of our state-of-the-art TTS system. In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.
While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “
decorum
” and “
merlot
”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.
Acknowledgements
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Sound Understanding team, TTS Research team, and TensorFlow team.
Labels:
Audio
,
Deep Learning
,
Google Brain
,
Machine Perception
,
Publications
,
Research
,
TTS
Introducing NIMA: Neural Image Assessment
Monday, December 18, 2017
Posted by Hossein Talebi, Software Engineer and Peyman Milanfar Research Scientist, Machine Perception
Quantification of image quality and aesthetics has been a long-standing problem in image processing and computer vision. While technical quality assessment deals with measuring pixel-level degradations such as noise, blur, compression artifacts, etc., aesthetic assessment captures semantic level characteristics associated with emotions and beauty in images. Recently, deep
convolutional neural networks
(CNNs) trained with human-labelled data have been used to
address the subjective nature of image quality
for specific classes of images, such as landscapes. However, these approaches can be limited in their scope, as they typically categorize images to two classes of low and high quality. Our proposed method predicts the distribution of ratings. This leads to a more accurate quality prediction with higher correlation to the ground truth ratings, and is applicable to general images.
In “
NIMA: Neural Image Assessment
” we introduce a deep CNN that is trained to predict which images a typical user would rate as looking good (technically) or attractive (aesthetically). NIMA relies on the success of state-of-the-art deep
object recognition
networks, building on their ability to understand general categories of objects despite many variations. Our proposed network can be used to not only score images reliably and with high correlation to human perception, but also it is useful for a variety of labor intensive and subjective tasks such as intelligent photo editing, optimizing visual quality for increased user engagement, or minimizing perceived visual errors in an imaging pipeline.
Background
In general, image quality assessment can be categorized into full-reference and no-reference approaches. If a reference “ideal” image is available, image quality metrics such as
PSNR
,
SSIM
, etc. have been developed. When a reference image is not available, “blind” (or no-reference) approaches rely on statistical models to predict image quality. The main goal of both approaches is to predict a quality score that correlates well with human perception. In a deep CNN approach to image quality assessment, weights are initialized by training on object classification related datasets (e.g.
ImageNet
), and then fine-tuned on annotated data for perceptual quality assessment tasks.
NIMA
Typical aesthetic prediction methods categorize images as low/high quality. This is despite the fact that each image in the training data is associated to a histogram of human ratings, rather than a single binary score. A histogram of ratings is an indicator of overall quality of an image, as well as agreements among raters. In our approach, instead of classifying images a low/high score or regressing to the mean score, the NIMA model produces a distribution of ratings for any given image — on a scale of 1 to 10, NIMA assigns likelihoods to each of the possible scores. This is more directly in line with how training data is typically captured, and it turns out to be a better predictor of human preferences when measured against other approaches (more details are available in our
paper
).
Various functions of the NIMA vector score (such as the mean) can then be used to rank photos aesthetically. Some test photos from the large-scale database for Aesthetic Visual Analysis (
AVA
) dataset, as ranked by NIMA, are shown below. Each AVA photo is scored by an average of 200 people in response to
photography contests
. After training, the aesthetic ranking of these photos by NIMA closely matches the mean scores given by human raters. We find that NIMA performs equally well on other datasets, with predicted quality scores close to human ratings.
Ranking some examples labelled with the “landscape” tag from
AVA
dataset using NIMA. Predicted NIMA (and ground truth) scores are shown below each image.
NIMA scores can also be used to compare the quality of images of the same subject which may have been distorted in various ways. Images shown in the following example are part of the
TID2013
test set, which contain various types and levels of distortions.
Ranking some examples from
TID2013
dataset using NIMA. Predicted NIMA scores are shown below each image.
Perceptual Image Enhancement
As we’ve shown in another recent
paper
, quality and aesthetic scores can also be used to perceptually tune image enhancement operators. In other words, maximizing NIMA score as part of a loss function can increase the likelihood of enhancing perceptual quality of an image. The following example shows that NIMA can be used as a training loss to tune a tone enhancement algorithm. We observed that the baseline aesthetic ratings can be improved by contrast adjustments directed by the NIMA score. Consequently, our model is able to guide a deep CNN filter to find aesthetically near-optimal settings of its parameters, such as brightness, highlights and shadows.
NIMA can be used as a training loss to enhance images. In this example, local tone and contrast of images is enhanced by training a deep CNN with NIMA as its loss. Test images are obtained from the
MIT-Adobe FiveK dataset
.
Looking Ahead
Our work on NIMA suggests that quality assessment models based on machine learning may be capable of a wide range of useful functions. For instance, we may enable users to easily find the best pictures among many; or to even enable improved picture-taking with real-time feedback to the user. On the post-processing side, these models may be used to guide enhancement operators to produce perceptually superior results. In a direct sense, the NIMA network (and others like it) can act as reasonable, though imperfect, proxies for human taste in photos and possibly videos. We’re excited to share these results, though we know that the quest to do better in understanding what quality and aesthetics mean is an ongoing challenge — one that will involve continuing retraining and testing of our models.
Labels:
Computational Imaging
,
Computer Vision
,
Image Processing
,
Machine Learning
Improving End-to-End Models For Speech Recognition
Thursday, December 14, 2017
Posted by Tara N. Sainath, Research Scientist, Speech Team and Yonghui Wu, Software Engineer, Google Brain Team
Traditional automatic speech recognition (ASR) systems, used for a variety of voice search applications at Google, are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets [1]. AMs take acoustic features and predict a set of subword units, typically context-dependent or context-independent phonemes. Next, a hand-designed lexicon (the PM) maps a sequence of
phonemes
produced by the acoustic model to words. Finally, the LM assigns probabilities to word sequences. Training independent components creates added complexities and is suboptimal compared to training all components jointly. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate components jointly as a single system. While these end-to-end models have shown promising results in the literature [2, 3], it is not yet clear if such approaches can improve on current state-of-the-art conventional systems.
Today we are excited to share “
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
[4],” which describes a new end-to-end model that surpasses the performance of a conventional production system [1]. We show that our end-to-end system achieves a
word error rate
(WER) of 5.6%, which corresponds to a 16% relative improvement over a strong conventional system which achieves a 6.7% WER. Additionally, the end-to-end model used to output the initial word hypothesis, before any hypothesis rescoring, is 18 times smaller than the conventional model, as it contains no separate LM and PM.
Our system builds on the Listen-Attend-Spell (LAS) end-to-end architecture, first presented in [2]. The LAS architecture consists of 3 components. The
listener
encoder component, which is similar to a standard AM, takes the a time-frequency representation of the input speech signal,
x
, and uses a set of neural network layers to map the input to a higher-level feature representation,
h
enc
. The output of the encoder is passed to an
attender
, which uses
h
enc
to learn an alignment between input features
x
and predicted subword units {y
n
, … y
0
}, where each subword is typically a
grapheme
or
wordpiece
. Finally, the output of the attention module is passed to the
speller
(i.e., decoder), similar to an LM, that produces a probability distribution over a set of hypothesized words.
Components of the LAS End-to-End Model.
All components of the LAS model are trained jointly as a single end-to-end neural network, instead of as separate modules like conventional systems, making it much simpler.
Additionally, because the LAS model is fully neural, there is no need for external, manually designed components such as finite state transducers, a lexicon, or text normalization modules. Finally, unlike conventional models, training end-to-end models does not require bootstrapping from decision trees or time alignments generated from a separate system, and can be trained given pairs of text transcripts and the corresponding acoustics.
In [4], we introduce a variety of novel structural improvements, including improving the attention vectors passed to the decoder and training with longer subword units (i.e., wordpieces). In addition, we also introduce numerous optimization improvements for training, including the use of minimum word error rate training [5]. These structural and optimization improvements are what accounts for obtaining the 16% relative improvement over the conventional model.
Another exciting potential application for this research is multi-dialect and multi-lingual systems, where the simplicity of optimizing a single neural network makes such a model very attractive. Here data for all dialects/languages can be combined to train one network, without the need for a separate AM, PM and LM for each dialect/language. We find that these models work well on 7 english dialects [6] and 9 Indian languages [7], while outperforming a model trained separately on each individual language/dialect.
While we are excited by our results, our work is not done. Currently, these models cannot process speech in real time [8, 9, 10], which is a strong requirement for latency-sensitive applications such as voice search. In addition, these models still compare negatively to production when evaluated on live production data. Furthermore, our end-to-end model is learned on 22 million audio-text pair utterances compared to a conventional system that is typically trained on significantly larger corpora. In addition, our proposed model is not able to learn proper spellings for rarely used words such as proper nouns, which is normally performed with a hand-designed PM. Our ongoing efforts are focused now on addressing these challenges.
Acknowledgements
This work was done as a strong collaborative effort between Google Brain and Speech teams. Contributors include Tara Sainath, Rohit Prabhavalkar, Bo Li, Kanishka Rao, Shankar Kumar, Shubham Toshniwal, Michiel Bacchiani and Johan Schalkwyk from the Speech team; as well as Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-cheng Chiu, Anjuli Kannan, Ron Weiss, Navdeep Jaitly, William Chan, Yu Zhang and Jan Chorowski from the Google Brain team. The work is described in more detail in papers [4-12].
Labels:
Acoustic Modeling
,
Deep Learning
,
Google Brain
,
Speech Recognition
A Summary of the First Conference on Robot Learning
Wednesday, December 13, 2017
Posted by Vincent Vanhoucke, Principal Scientist, Google Brain Team and Melanie Saldaña, Program Manager, University Relations
Whether in the form of autonomous vehicles, home assistants or disaster rescue units, robotic systems of the future will need to be able to operate safely and effectively in human-centric environments. In contrast to to their industrial counterparts, they will require a very high level of perceptual awareness of the world around them, and to adapt to continuous changes in both their goals and their environment. Machine learning is a natural answer to both the problems of perception and generalization to unseen environments, and with the recent rapid progress in computer vision and learning capabilities, applying these new technologies to the field of robotics is becoming a very central research question.
This past November, Google helped kickstart and host the first
Conference on Robot Learning (CoRL)
at our campus in Mountain View. The goal of CoRL was to bring machine learning and robotics experts together for the first time in a single-track conference, in order to foster new research avenues between the two disciplines. The sold-out conference attracted 350 researchers from many institutions worldwide, who collectively presented
74 original papers
, along with
5 keynotes
by some of the most innovative researchers in the field.
Prof. Sergey Levine, CoRL 2017 co-chair, answering audience questions.
Sayna Ebrahimi (UC Berkeley) presenting her research.
Videos of the inaugural CoRL are available on the
conference website
. Additionally, we are delighted to announce that next year, CoRL moves to Europe! CoRL 2018 will be chaired by
Professor Aude Billard
from the
École Polytechnique Fédérale de Lausanne
, and will tentatively be held in the
Eidgenössische Technische Hochschule
(ETH) in Zürich on October 29th-31st, 2018. Looking forward to seeing you there!
Prof. Ken Goldberg, CoRL 2017 co-chair, and Jeffrey Mahler (UC Berkeley) during a break.
Labels:
conference
,
conferences
,
Google Brain
,
Machine Learning
,
Robotics
,
University Relations
TFGAN: A Lightweight Library for Generative Adversarial Networks
Tuesday, December 12, 2017
Posted by Joel Shor, Senior Software Engineer, Machine Perception
(Crossposted on the
Google Open Source Blog
)
Training a neural network usually involves defining a loss function, which tells the network how close or far it is from its objective. For example, image classification networks are often given a loss function that penalizes them for giving wrong classifications; a network that mislabels a dog picture as a cat will get a high loss. However, not all problems have easily-defined loss functions, especially if they involve human perception, such as
image compression
or
text-to-speech systems
.
Generative Adversarial Networks
(GANs), a machine learning technique that has led to improvements in a wide range of applications including
generating images from text
,
superresolution
, and
helping robots learn to grasp
, offer a solution. However, GANs introduce new theoretical and software engineering challenges, and it can be difficult to keep up with the rapid pace of GAN research.
A video of a generator improving over time. It begins by producing random noise, and eventually learns to generate MNIST digits.
In order to make GANs easier to experiment with, we’ve open sourced
TFGAN
, a lightweight library designed to make it easy to train and evaluate GANs. It provides the infrastructure to easily train a GAN, provides well-tested loss and evaluation metrics, and gives easy-to-use
examples
that highlight the expressiveness and flexibility of TFGAN. We’ve also released a
tutorial
that includes a high-level API to quickly get a model trained on your data.
This demonstrates the effect of an adversarial loss on
image compression
. The top row shows image patches from the
ImageNet dataset
. The middle row shows the results of compressing and uncompressing an image through an image compression neural network trained on a traditional loss. The bottom row shows the results from a network trained with a traditional loss and an adversarial loss. The GAN-loss images are sharper and more detailed, even if they are less like the original.
TFGAN supports experiments in a few important ways. It provides simple function calls that cover the majority of GAN use-cases so you can get a model running on your data in just a few lines of code, but is built in a modular way to cover more exotic GAN designs as well. You can just use the modules you want — loss, evaluation, features, training, etc. are all independent. TFGAN’s lightweight design also means you can use it alongside other frameworks, or with native TensorFlow code. GAN models written using TFGAN will easily benefit from future infrastructure improvements, and you can select from a large number of already-implemented losses and features without having to rewrite your own. Lastly, the code is well-tested, so you don’t have to worry about numerical or statistical mistakes that are easily made with GAN libraries.
Most neural text-to-speech (TTS) systems produce over-smoothed spectrograms. When applied to the
Tacotron
TTS system, a GAN can recreate some of the realistic-texture, which reduces artifacts in the resulting audio.
When you use TFGAN, you’ll be using the same infrastructure that many Google researchers use, and you’ll have access to the cutting-edge improvements that we develop with the library. Anyone can contribute to the github repositories, which we hope will facilitate code-sharing among ML researchers and users.
Labels:
Machine Learning
,
open source
,
Software
,
TensorFlow
Introducing Appsperiments: Exploring the Potentials of Mobile Photography
Monday, December 11, 2017
Posted by Alex Kauffmann, Interaction Researcher, Google Research
Each of the world's approximately two billion smartphone owners is carrying a camera capable of capturing photos and video of a tonal richness and quality unimaginable even five years ago. Until recently, those cameras behaved mostly as optical sensors, capturing light and operating on the resulting image's pixels. The next generation of cameras, however, will have the capability to blend hardware and computer vision algorithms that operate as well on an image's semantic content, enabling radically new creative mobile photo and video applications.
Today, we're launching the first installment of a series of photography
appsperiments
: usable and useful mobile photography experiences built on experimental technology. Our "appsperimental" approach was inspired in part by
Motion Stills
, an app developed by researchers at Google that converts short videos into cinemagraphs and time lapses using experimental stabilization and rendering technologies. Our appsperiments replicate this approach by building on other technologies in development at Google. They rely on
object recognition
,
person segmentation
, stylization algorithms, efficient image encoding and decoding technologies, and perhaps most importantly, fun!
Storyboard
Storyboard (
Android
) transforms your videos into single-page comic layouts, entirely on device. Simply shoot a video and load it in Storyboard. The app automatically selects interesting video frames, lays them out, and applies one of six visual styles. Save the comic or pull down to refresh and instantly produce a new one. There are approximately 1.6 trillion different possibilities!
Selfissimo!
Selfissimo! (
iOS
,
Android
) is an automated selfie photographer that snaps a stylish black and white photo each time you pose. Tap the screen to start a photoshoot. The app encourages you to pose and captures a photo whenever you stop moving. Tap the screen to end the session and review the resulting contact sheet, saving individual images or the entire shoot.
Scrubbies
Scrubbies (
iOS
) lets you easily manipulate the speed and direction of video playback to produce delightful video loops that highlight actions, capture funny faces, and replay moments. Shoot a video in the app and then remix it by scratching it like a DJ. Scrubbing with one finger plays the video. Scrubbing with two fingers captures the playback so you can save or share it.
Try them out and tell us what you think using the in-app feedback links. The feedback and ideas we get from the new and creative ways people use our appsperiments will help guide some of the technology we develop next.
Acknowledgements
These appsperiments represent a collaboration across many teams at Google. We would like to thank the core contributors Andy Dahley, Ashley Ma, Dexter Allen, Ignacio Garcia Dorado, Madison Le, Mark Bowers, Pascal Getreuer, Robin Debreuil, Suhong Jin, and William Lindmeier. We also wish to give special thanks to Buck Bourdon, Hossein Talebi, Kanstantsin Sokal, Karthik Raveendran, Matthias Grundmann, Peyman Milanfar, Suril Shah, Tomas Izo, Tyler Mullen, and Zheng Sun.
Labels:
Computational Photography
,
Machine Perception
,
Research
