Google Research Blog
The latest news from Research at Google
Google voice search: faster and more accurate
Thursday, September 24, 2015
Posted by Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan Schalkwyk – Google Speech Team
Back in 2012,
that Google voice search had taken a new turn by adopting
Deep Neural Networks
(DNNs) as the core technology used to model the sounds of a language. These replaced the 30-year old standard in the industry: the Gaussian Mixture Model (GMM). DNNs were better able to assess which sound a user is producing at every instant in time, and with this they delivered greatly increased speech recognition accuracy.
Today, we’re happy to announce we built even better neural network acoustic models using
Connectionist Temporal Classification
sequence discriminative training techniques
. These models are a special extension of
recurrent neural networks
(RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast!
In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example - /m j u z i @ m/ in
- it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - “museum” - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a
Long Short-Term Memory
(LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models
already improved the quality
of our recognizer significantly.
The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct.
The tricky part though was how to make this happen in real-time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence
We now had a faster and more accurate acoustic model and were excited to launch it on real voice traffic. However, we had to solve another problem - the model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal! This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech.
The CTC recognizer outputs spikes as it identifies various phonetic units (in various colors) in the input speech signal. The x-axis shows the acoustic input timing for phonemes and y-axis shows the posterior probabilities as predicted by the neural network. The dotted line shows where the model chooses not to output a phoneme.
We are happy to announce that
our new acoustic models
are now used for voice searches and commands in the
(on Android and iOS), and for dictation on Android devices. In addition to requiring much lower computational resources, the new models are more accurate, robust to noise, and faster to respond to voice search queries - so give it a try, and happy (voice) searching!
Google Voice Search
A Beginner’s Guide to Deep Neural Networks
Tuesday, September 22, 2015
Posted by Natalie Hammel and Lorraine Yurshansky, creators of Nat & Lo’s 20% Project
Last year, we (a couple of people who knew nothing about how voice search works) set out to make a video
about the research that’s gone into teaching computers to recognize speech and understand language
Making the video was eye-opening and brain-opening. It introduced us to concepts we’d never heard of – like machine learning and artificial neural networks – and ever since, we’ve been kind of fascinated by them. Machine learning, in particular, is a very active area of Computer Science research, with far-ranging applications beyond voice search – like
image recognition and description
Google Voice transcription
So... still curious to know more (and having just started
) we found Google researchers
and ambushed them with our machine learning questions.
This video is our attempt to distill what we learned from talking with them, but if anything in it piques your curiosity, or you have other questions, you’re in luck! On
Friday, September 25, at 1 PM PDT / 4 PM EST
Greg and Chris will be doing an
Ask Me Anything on Reddit
(see the calendar
) to answer your deep learning questions.
Everyone who’s curious is welcome to join, ask questions, and hopefully gain a better understanding of the world of machine learning and deep neural networks. (And we’ll be hanging out with them, too...in case you have any questions about video making or dogs.) We hope to see you this Friday!
Information sharing for more efficient network utilization and management
Thursday, September 17, 2015
Andreas Terzis, Software Engineer
As Internet traffic has grown and changed, Google and other content and application providers have worked cooperatively with Internet service providers (ISPs) so that services can be delivered quickly, efficiently and cost-effectively. For example, rather than content having to traverse a long distance and many different networks to reach an Internet access provider’s network, a content provider might store (cache) the data close by and interconnect (‘peer’) directly with the access provider. Google has invested billions of dollars in the network and infrastructure necessary to bring our services as close to your Internet access provider’s front door as possible, for free – which both reduces ISPs’ costs and improves the user experience.
Content and application providers can also tune their services for congested and/or lower bandwidth environments. For instance, YouTube detects how smoothly a video is playing and adjusts the quality to account for temporary fluctuations in bandwidth or congestion. In the
Google Video Quality Report
, we transparently reveal the speeds YouTube is experiencing on different networks.
As more of Internet traffic becomes encrypted, some network operators have expressed concern about the effect encryption might have on their ability to manage their networks. We don’t think there has to be a trade-off here – there are ways to do effective network management of encrypted traffic today, and, through further cooperation between content and application providers and ISPs, we believe this could be made easier while still respecting encryption.
To spur discussion and collaboration on this front, we recently submitted a
to a workshop organized by the
Internet Architecture Board
outlining some ideas. We advocate for a model where ISPs selectively share network state to content and applications providers, enabling them to adapt to available network resources.
For example, we recently proposed to the
Internet Engineering Task Force
the concept of
(TG), whereby mobile network operators could share information about the throughput of a radio downlink. Preliminary field tests in a production LTE network showed that TG reduces YouTube join latency, defined as the amount of time until the video starts playing, by 8% on average, rebuffering time by 20% on average, and rebuffer count by 2% on average. In addition to improving quality of experience for users, this mechanism improves the utilization of providers’ networks. Encryption of traffic would have no impact on the efficacy of this approach; it works equally well with encrypted and unencrypted traffic.
Throughput Guidance is one possible solution and many questions remain unanswered. It’s still relatively early days in our exploration of this and the other measures in our short paper, and we’re looking forward to getting feedback and collaborating with network operators and others.
Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)
Tuesday, September 08, 2015
Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource Languages
Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.
Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.
One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.
Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the
about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.
Crowd-sourcing projects for automatic speech recognition
(ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.
Many years ago,
, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio.
, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!
Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.
With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.
In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice.
, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using
proved to be unnecessary with our limited pool of candidates.
All 5 software engineers –
Md. Arifuzzaman Arif
Sabbir Yousuf Sanny
recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan.
, who had helped with the Korean TTS recordings, directed our volunteers.
TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers.
Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice
ChitChat allowed us to troubleshoot on the fly as recordings could be monitored from another room using the admin panel. In total, we recorded 2000 Bangla and English phrases mined from Wikipedia. In 30-60 minute intervals, the participants recorded over 250 sentences each.
In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.
Another challenge was that we didn’t have a pronunciation lexicon for Bangla as spoken in Bangladesh. We worked initially with
the publicly available TTS data from the Indian Institute of Information Technology
, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.
Deciding to proceed anyway,
, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by
, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.
When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile,
, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.
NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)
Automatic Speech Recognition
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog