Google Research Blog
The latest news from Research at Google
Under the hood of Croatian, Filipino, Ukrainian, and Vietnamese in Google Voice Search
Thursday, July 25, 2013
Posted by Eugene Weinstein and Pedro Moreno, Google Speech Team
Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.
For example, since Vietnamese is a
tonal language
, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (
tonemes
) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.
Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called
code switching
. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?
In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.
The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale
neural network
learning infrastructure (yes, the same one that spontaneously
discovered cats
on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.
The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
Wednesday, July 17, 2013
Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research
“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato
When you type in a search query -- perhaps
Plato
-- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The
Knowledge Graph
and
Freebase
are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.
We’ve previously released
data to help with disambiguation
and recently awarded
$1.2M in research grants
to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
These Freebase Annotations of the ClueWeb Corpora (FACC) consist of
ClueWeb09 FACC
and
ClueWeb12 FACC
. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (
Freebase MID’s
). For example:
Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.
Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.
The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several
TREC query sets
, including those from the
Million Query Track
and
Web Track
.
If you would prefer a human-annotated set, you might want to look at the
Wikilinks Corpus
we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.
You can find more detail and download the data on the pages for the two sets:
ClueWeb09 FACC
and
ClueWeb12 FACC
. You can also subscribe to our
data release mailing list
to learn about releases as they happen.
Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
New research from Google shows that 88% of the traffic generated by mobile search ads is not replaced by traffic originating from mobile organic search
Tuesday, July 16, 2013
Posted by Shaun Lysen, Statistician at Google
Often times people are presented with two choices after making a search on their devices - they could either click on the organic results for their query, or on the ads that appear on the page. Website owners who want to build a strong online presence often wonder how to balance organic search and paid search ads in driving website traffic. But what happens when ads are paused? Would businesses see an increase in organic traffic that could make up for the loss in paid traffic? To answer these questions, we released a “
Search Ads Pause
” analysis in 2011 showing that 89% of traffic generated by search ads is not replaced by organic clicks.
As smartphones become increasingly important to consumers, we recently conducted the same studies for mobile devices to understand the role of mobile search ads in driving site traffic. From March 2012 - April 2013, we ran 327 unique studies across US-based mobile advertising accounts from 12 key industries.
We selected AdWords accounts that exhibited sharp changes in advertisers’ spending on mobile search (ad spend) and identified stable periods before the spend change (pre-period) and after the spend change (post-period). We observed the number of organic and paid clicks, and the number of times organic results appear on the first page of search results (impressions) during both the pre-period and post-period. Google then created a proprietary statistical model to predict what the number of organic and paid clicks would have been in the post-period had the ad spend not changed, and compared those figures to the actual number of clicks observed. We then were able to estimate what percentage of paid clicks are incremental, i.e. a visit to the advertiser’s site from an ad click would not have been replaced by a visit to the site from an organic click.
The final results showed that mobile search ads contribute to a very high proportion of incremental traffic to websites. On average, 88% of mobile paid clicks are lost and not recovered when a mobile search campaign is paused. This finding is consistently high across the 12 key industries, including automotive, travel, retail and more. The full study, including details around the methodology and findings, can be found in the paper ‘
Incremental Clicks Impact of Mobile Search Advertising
’.
Google Databoard: A new way to explore industry research
Tuesday, July 09, 2013
Posted by Adam Grunewald, Mobile Marketing Manager
It’s important for people to stay up to date about the most recent research and insights related to their work or personal lives. But it can be difficult to keep up with all the new studies and updated data that’s out there. To make life a bit easier, we’re introducing a new take on how research can be presented. The
Databoard for Research Insights
enables people to explore and interact with some of Google’s recent research in a unique and immersive way. The Databoard uses responsive design to to offer an engaging experience across devices. Additionally, the tool is a new venture into data visualization and shareability with bite-sized charts and stats that can be shared with your friends or coworkers. The Databoard is currently home to several of Google’s market research studies for businesses, but we believe that this way of conveying data can work across all forms of research.
Here are some of the things that make the Databoard different from other ways research is released today:
Easy to use
All of the information in the Databoard is presented in a bite-sized way so that you can quickly find relevant information. You can explore an entire study or jump straight to the topics or data points you care about. The Databoard is also optimized for all devices so you can explore the research on your computer, tablet or smartphone.
Meant to be shared
Most people, when they find a compelling piece of data, want to share it! Whether it’s with a colleague, client, or a community on a blog or social network, compelling insights and data are meant to be shared. With the databoard, you can easily share individual charts and insights or collections of data with anyone through email or social networks, just look for the share button at the top of each chart or insight.
Create a cohesive story
Most research studies set out to answer a specific question, like how people use their smartphones in stores, or how a specific type of consumer shops. This means that businesses need to look across multiple pieces of research to craft a comprehensive business or marketing strategy. With this in mind, the Databoard lets you curate a customized infographic out of the charts or data points you find important across multiple Google research studies. Creating an infographic is quick and easy, and you can share the finished product with your friends or colleagues.
The databoard is currently home to six research studies including
The New Multi-screen World
,
Mobile In-store shopper research
and
Mobile search moments
. New studies will be added frequently. To get started creating your own infographic,
visit the Databoard now
.
Conference Report: USENIX Annual Technical Conference (ATC) 2013
Wednesday, July 03, 2013
Posted by Murray Stokely, Google Storage Analytics Team
This year marks Google’s eleventh consecutive year as a sponsor of the
USENIX Annual Technical Conference
(ATC), just one of the co-located events at
USENIX Federated Conference Week
(FCW), which combines numerous conferences and workshops covering fields such as
Autonomic Computing
,
Feedback Computing
and much more in an intensive week of research, trends, and community interaction.
ATC provides a broad forum for computing systems research with an emphasis on implementations and experimental results. In addition to the Googlers presenting publications, we had two members on the program committee of ATC and several keynote speakers, invited speakers, panelists, committee members, and participants at the other co-located events at FCW.
In the paper
Janus: Optimal Flash Provisioning for Cloud Storage Workloads
, Googler Christoph Albrecht and co-authors demonstrated a system that allows users to make informed flash memory provisioning and partitioning decisions in cloud-scale distributed file systems that include both flash storage and disk tiers. As flash memory is still expensive, it is best to use it only for workloads that can make good use of it. Janus creates long term workload characterizations based on RPC samples and file age metadata. It uses these workload characterizations to formulate and solve an optimization problem that maximizes the reads sent to the flash tier. Based on evaluations from workloads using Janus, in use at Google for the past 6 months, the authors conclude that the recommendation system is quite effective, with flash hit rates using the optimized recommendations 47-76% higher than the option of using the flash as an unpartitioned tier.
In
packetdrill: Scriptable Network Stack Testing, from Sockets to Packets
, Google’s Neal Cardwell and co-authors showcased a portable, open-source scripting tool that enables testing the correctness and performance of network protocols. Despite their importance in modern computer systems, network protocols often undergo only ad hoc testing before their deployment, in large part due to their complexity. Furthermore, new algorithms have unforeseen interactions with other features, so testing has only become more daunting as
TCP
has evolved. The
packetdrill
tool was instrumental in the development of three new features for Linux TCP—Early Retransmit, Fast Open, and Loss Probes—and allowed the authors to find and fix 10 bugs in Linux. Furthermore, the team uses
packetdrill
in all phases of the development process for the kernel used in one of the world’s largest Linux installations. In the hope that sharing
packetdrill
with the community will make the process of improving Internet protocols an easier one, the
source code and test scripts for
packetdrill
have been made freely available.
There were also additional refereed publications with Google co-authors at some of the co-located events at FCW, notably
NicPic: Scalable and Accurate End-Host Rate Limiting
, which outlines a system which enables accurate network traffic scheduling in a scalable fashion, and
AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service
, a system that efficiently handles dynamic application workloads, reducing both penalties and user dissatisfaction.
Google is proud to support the academic community through conference participation and sponsorship. In particular, we are happy to mention one of the other interesting papers from this year’s USENIX FCW, co-authored by former
Google PhD fellowship
recipient Ashok Anand,
MiG: Efficient Migration of Desktop VM Using Semantic Compression
.
USENIX is a supporter of
open access
, so the papers and videos from the talks are available on the
conference website
.
Natural Language Understanding-focused awards announced
Tuesday, July 02, 2013
Posted by Massimiliano Ciaramita, Research Scientist and David Harper, Head University Relations (EMEA)
Some of the biggest challenges for the scientific community today involve understanding the principles and mechanisms that underlie natural language use on the Web. An example of long-standing problem is language ambiguity; when somebody types the word “Rio” in a query do they mean the city, a movie, a casino, or something else? Understanding the difference can be crucial to help users get the answer they are looking for. In the past few years, a significant effort in industry and academia has focused on disambiguating language with respect to Web-scale knowledge repositories such as Wikipedia and Freebase. These resources are used primarily as canonical, although incomplete, collections of “entities”. As entities are often connected in multiple ways, e.g., explicitly via hyperlinks and implicitly via factual information, such resources can be naturally thought of as (knowledge) graphs. This work has provided the first breakthroughs towards anchoring language in the Web to interpretable, albeit initially shallow, semantic representations. Google has brought the vision of semantic search directly to millions of users via the adoption of the
Knowledge Graph
. This massive change to search technology has also been called a shift “from strings to things”.
Understanding natural language is at the core of Google's work to help people get the information they need as quickly and easily as possible. At Google we work hard to advance the state of the art in natural language processing, to improve the understanding of fundamental principles, and to solve the algorithmic and engineering challenges to make these technologies part of everyday life. Language is inherently productive; an infinite number of meaningful new expressions can be formed by combining the meaning of their components systematically. The logical next step is the semantic modeling of structured meaningful expressions -- in other words, “what is said” about entities. We envision that knowledge graphs will support the next leap forward in language understanding towards scalable compositional analyses, by providing a universe of entities, facts and relations upon which semantic composition operations can be designed and implemented.
So we’ve just awarded over $1.2 million to support several natural language understanding research awards given to university research groups doing work in this area. Research topics range from semantic parsing to statistical models of life stories and novel compositional inference and representation approaches to modeling relations and events in the Knowledge Graph.
These awards went to researchers in nine universities and institutions worldwide, selected after a rigorous internal review:
Mark Johnson and Lan Du (Macquarie University) and Wray Buntine (NICTA) for “Generative models of Life Stories”
Percy Liang and Christopher Manning (Stanford University) for “Tensor Factorizing Knowledge Graphs”
Sebastian Riedel (University College London) and Andrew McCallum (University of Massachusetts, Amherst) for “Populating a Knowledge Base of Compositional Universal Schema”
Ivan Titov (University of Amsterdam) for “Learning to Reason by Exploiting Grounded Text Collections”
Hans Uszkoreit (Saarland University and DFKI), Feiyu Xu (DFKI and Saarland University) and Roberto Navigli (Sapienza University of Rome) for “Language Understanding cum Knowledge Yield”
Luke Zettlemoyer (University of Washington) for “Weakly Supervised Learning for Semantic Parsing with Knowledge Graphs”
We believe the results will be broadly useful to product development and will further scientific research. We look forward to working with these researchers, and we hope we will jointly push the frontier of natural language understanding research to the next level.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PhotoScan
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.