Google Research Blog
The latest news from Research at Google
Efficient Smart Reply, now for Gmail
Wednesday, May 17, 2017
Posted by Brian Strope, Research Scientist, and Ray Kurzweil, Engineering Director, Google Research
Last year we launched Smart Reply, a feature for
Inbox by Gmail
that uses
machine learning to suggest replies to email
. Since the initial release, usage of Smart Reply has grown significantly, making up about 12% of replies in Inbox on mobile. Based on our examination of the use of Smart Reply in Inbox and our ideas about how humans learn and use language, we have created a new version of
Smart Reply for Gmail
. This version increases the percentage of usable suggestions and is more algorithmically efficient.
Novel thinking: hierarchy
Inspired by how humans understand languages and concepts, we turned to hierarchical models of language, an approach that uses
hierarchies of modules, each of which can learn, remember, and recognize a sequential pattern
.
The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc. Consider the message, "That interesting person at the cafe we like gave me a glance." The hierarchical chunks in this sentence are highly variable. The subject of the sentence is "That interesting person at the cafe we like." The modifier "interesting" tells us something about the writer's past experiences with the person. We are told that the location of an incident involving both the writer and the person is "at the cafe." We are also told that "we," meaning the writer and the person being written to, like the cafe. Additionally, each word is itself part of a hierarchy, sometimes more than one. A cafe is a type of restaurant which is a type of store which is a type of establishment, and so on.
In proposing an appropriate response to this message we might consider the meaning of the word "glance," which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, "Cool!" Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions.
Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. Moreover, a hierarchical approach to learning is well suited to the hierarchical nature of language. We have found that this approach works well for suggesting possible responses to emails. We use a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales, similar to how we understand speech and language.
Each module processes inputs and provides transformed representations of those inputs on its outputs (which are, in turn, available for the next level). In the Smart Reply system, and the figure above, the repeated structure has two layers of hierarchy. The first makes each feature useful as a predictor of the final result, and the second combines these features. By definition, the second works at a more abstract representation and considers a wider timescale.
By comparison, the initial release of Smart Reply encoded input emails word-by-word with a
long-short-term-memory
(LSTM) recurrent neural network, and then decoded potential replies with yet another word-level LSTM. While this type of modeling is very effective in many contexts, even with Google infrastructure, it’s an approach that requires substantial computation resources. Instead of working word-by-word, we found an effective and highly efficient path by processing the problem more all-at-once, by comparing a simple hierarchy of vector representations of multiple features corresponding to longer time spans.
Semantics
We have also considered whether the mathematical space of these vector representations is implicitly semantic. Do the hierarchical network representations reflect a coarse “understanding” of the actual meaning of the inputs and the responses in order to determine which go together, or do they reflect more consistent syntactical patterns? Given many real examples of which pairs go together and, perhaps more importantly which do not, we found that our networks are surprisingly effective and efficient at deriving representations that meet the training requirements.
So far we see that the system can find responses that are on point, without an overlap of keywords or even synonyms of keywords.More directly, we’re delighted when the system suggests results that show understanding and are helpful.
The key to this work is the confidence and trust people give us when they use the Smart Reply feature. As always, thank you for showing us the ways that work (and the ways that don’t!). With your help, we’ll do our best to keep learning.
An Upgrade to SyntaxNet, New Models and a Parsing Competition
Wednesday, March 15, 2017
Posted by David Weiss and Slav Petrov, Research Scientists
At Google, we continuously improve the language understanding capabilities used in applications ranging from
generation of email responses
to
translation
. Last summer, we open-sourced
SyntaxNet
, a neural-network framework for analyzing and understanding the grammatical structure of sentences. Included in our release was
Parsey McParseface
, a state-of-the-art model that we had trained for analyzing English, followed quickly by a collection of pre-trained models for 40 additional languages, which we dubbed
Parsey's Cousins
. While we were excited to share our research and to provide these resources to the broader community, building machine learning systems that work well for languages other than English remains an ongoing challenge. We are excited to announce a few new research resources, available now, that address this problem.
SyntaxNet Upgrade
We are releasing a
major upgrade to SyntaxNet
. This upgrade incorporates nearly a year’s worth of our research on multilingual language understanding, and is available to anyone interested in building systems for processing and understanding text. At the core of the upgrade is a
new technology
that enables learning of richly layered representations of input sentences. More specifically, the upgrade extends
TensorFlow
to allow joint modeling of multiple levels of linguistic structure, and to allow neural-network architectures to be created dynamically during processing of a sentence or document.
Our upgrade makes it, for example, easy to build
character-based models
that learn to compose individual characters into words (e.g. ‘c-a-t’ spells ‘cat’). By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). Parsey and Parsey’s Cousins, on the other hand, operated over sequences of words. As a result, they were forced to memorize words seen during training and relied mostly on the context to determine the grammatical function of previously unseen words.
As an example, consider the following (meaningless but grammatically correct) sentence:
This sentence was
originally coined by Andrew Ingraham
who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the
doshes
are
distimmed
by the
gostak
. We know too that one
distimmer
of
doshes
is a
gostak
." Systematic patterns in
morphology
and
syntax
allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.
ParseySaurus
To showcase the new capabilities provided by our upgrade to SyntaxNet, we are releasing a set of new pretrained models called
ParseySaurus
. These models use the character-based input representation mentioned above and are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context. The ParseySaurus models are far more accurate than
Parsey’s Cousins
(reducing errors by as much as 25%), particularly for
morphologically-rich
languages like Russian, or
agglutinative languages
like Turkish and Hungarian. In those languages there can be dozens of forms for each word and many of these forms might never be observed during training - even in a very large corpus.
Consider the following fictitious Russian
sentence
, where again the stems are meaningless, but the suffixes define an unambiguous interpretation of the sentence structure:
Even though our Russian ParseySaurus model has never seen these words, it can correctly analyze the sentence by inspecting the character sequences which constitute each word. In doing so, the system can determine many properties of the words (notice how many more morphological features there are here than in the English example). To see the sentence as ParseySaurus does, here is a visualization of how the model analyzes this sentence:
Each square represents one node in the neural network graph, and lines show the connections between them. The left-side “tail” of the graph shows the model consuming the input as one long string of characters. These are intermittently passed to the right side, where the rich web of connections shows the model composing words into phrases and producing a syntactic parse. Check out the full-size rendering
here
.
A Competition
You might be wondering whether character-based modeling are all we need or whether there are other techniques that might be important. SyntaxNet has lots more to offer, like
beam search
and different training objectives, but there are of course also many other possibilities. To find out what works well in practice we are helping co-organize, together with Charles University and other colleagues, a
multilingual parsing competition
at this year’s
Conference on Computational Natural Language Learning
(CoNLL) with the goal of building syntactic parsing systems that work well in real-world settings and for 45 different languages.
The competition is made possible by the
Universal Dependencies
(UD) initiative, whose goal is to develop cross-linguistically consistent treebanks. Because machine learned models can only be as good as the data that they have access to, we have been contributing data to UD
since 2013
. For the competition, we partnered with UD and
DFKI
to build a new multilingual evaluation set consisting of 1000 sentences that have been translated into 20+ different languages and annotated by linguists with parse trees. This evaluation set is the first of its kind (in the past, each language had its own independent evaluation set) and will enable more consistent cross-lingual comparisons. Because the sentences have the same meaning and have been annotated according to the same guidelines, we will be able to get closer to answering the question of which languages might be harder to parse.
We hope that the upgraded SyntaxNet framework and our the pre-trained ParseySaurus models will inspire researchers to participate in the competition. We have additionally created a
tutorial
showing how to load a
Docker
image and train models on the
Google Cloud Platform
, to facilitate participation by smaller teams with limited resources. So, if you have an idea for making your own models with the SyntaxNet framework,
sign up to compete
! We believe that the configurations that we are releasing are a good place to start, but we look forward to seeing how participants will be able to extend and improve these models or perhaps create better ones!
Thanks to everyone involved who made this competition happen, including our collaborators at
UD-Pipe
, who provide another baseline implementation to make it easy to enter the competition. Happy parsing from the main developers, Chris Alberti, Daniel Andor, Ivan Bogatyy, Mark Omernick, Zora Tung and Ji Ma!
On-Device Machine Intelligence
Thursday, February 09, 2017
Posted by Sujith Ravi, Staff Research Scientist, Google Research
To build the cutting-edge technologies that enable
conversational understanding
and
image recognition
, we often apply combinations of machine learning technologies such as
deep neural networks
and
graph-based machine learning
. However, the machine learning systems that power most of these applications run in the cloud and are computationally intensive and have significant memory requirements. What if you want machine intelligence to run on your personal phone or smartwatch, or on
IoT
devices, regardless of whether they are connected to the cloud?
Yesterday, we announced the launch of
Android Wear 2.0
, along with brand new wearable devices, that will run Google's first entirely “on-device” ML technology for powering smart messaging. This on-device ML system, developed by the Expander research team, enables technologies like
Smart Reply
to be used for any application,
including third-party messaging apps
, without ever having to connect with the cloud…so now you can respond to incoming chat messages directly from your watch, with a tap.
The research behind this began last year while our team was developing the machine learning systems that enable conversational understanding capability in
Allo
and
Inbox
. The Android Wear team reached out to us and was interested to know whether it would be possible to deploy this Smart Reply technology directly onto a smart device. Because of the limited computing power and memory on smart devices, we quickly realized that it was not possible to do so. Our product manager, Patrick McGregor, realized that this presented a unique challenge and an opportunity for the Expander team to return to the drawing board to design a completely new, lightweight, machine learning architecture — not only to enable Smart Reply on Android Wear, but also to power a wealth of other on-device mobile applications. Together with Tom Rudick, Nathan Beach, and other colleagues from the Android Wear team, we set out to build the new system.
Learning with Projections
A simple strategy to build lightweight conversational models might be to create a small dictionary of common rules (input → reply mappings) on the device and use a naive look-up strategy at inference time. This can work for simple prediction tasks involving a small set of classes using a handful of features (such as binary
sentiment classification
from text, e.g. “
I love this movie
” conveys a positive sentiment whereas the sentence “
The acting was horrible
” is negative). But, it does not scale to complex natural language tasks involving rich vocabularies and the wide language variability observed in chat messages. On the other hand, machine learning models like
recurrent neural network
s (such as
LSTM
s), in conjunction with
graph learning
, have proven to be extremely powerful tools for complex sequence learning in natural language understanding tasks, including Smart Reply. However, compressing such rich models to fit in device memory
and
produce robust predictions at low computation cost (rapidly on-demand) is extremely challenging. Early experiments with restricting the model to predict only a small handful of replies or using other techniques like
quantization
or
character-level models
did not produce useful results.
Instead, we built a different solution for the on-device ML system. We first use a fast, efficient mechanism to group similar incoming messages and project them to similar (“nearby”) bit vector representations. While there are several ways to perform this projection step, such as using
word embeddings
or
encoder networks
, we employ a modified version of
locality sensitive hashing
(LSH) to reduce dimension from millions of unique words to a short, fixed-length sequence of bits. This allows us to compute a projection for an incoming message very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming messages, word embeddings, or even the full model used for training.
Projection step:
Similar messages are grouped together and projected to nearby vectors. For example, the messages "
hey, how's it going?
" and "
How's it going buddy?
" share similar content and might be projected to the same vector 11100011. Another related message “
Howdy, everything going well
?” is mapped to a nearby vector 11100110 that differs only in 2 bits.
Next, our system takes the incoming message along with its projections and jointly trains a “message projection model” that learns to predict likely replies using our
semi-supervised
graph learning
framework. The graph learning framework enables training a robust model by combining semantic relationships from multiple sources — message/reply interactions, word/phrase similarity, semantic cluster information — learning useful projection operations that can be mapped to good reply predictions.
Learning step:
(Top) Messages along with
projections
and corresponding
replies
, if available, are used in a machine learning framework to jointly learn a “message projection model”. (Bottom) The message projection model learns to associate replies with the projections of the corresponding incoming messages. For example, the model projects two different messages “
Howdy, everything going well?
” and “
How’s it going buddy?
” (bottom center) to nearby bit vectors and learns to map these to relevant replies (bottom right).
It’s worth noting that while the message projection model can be trained using complex machine learning architectures and the power of the cloud, as described above, the model itself resides and performs inference completely on device. Apps running on the device can pass a user’s incoming messages and receive reply predictions from the on-device model without data leaving the device. The model can also be adapted to cater to the user’s writing style and individual preferences to provide a personalized experience.
Inference step:
The model applies the learned projections to an incoming message (or sequence of messages) and suggests relevant and diverse replies. Inference is performed on the device, allowing the model to adapt to user data and personal writing styles.
To get the on-device system to work out of the box, we had to make a few additional improvements such as optimizing for speeding up computations on device and generating rich, diverse replies from the model. We will have a forthcoming scientific publication that describes the on-device machine learning work in more detail.
Converse from Your Wrist
When we embarked on our journey to build this technology from scratch, we weren’t sure if the predictions would be useful or of sufficient quality. We’re quite surprised and excited about how well it works even on Android wearable devices with very limited computation and memory resources. We look forward to continuing to improve the models to provide users with more delightful conversational experiences, and we will be leveraging this on-device ML platform to enable completely new applications in the months to come.
You can now use this feature to respond to your messages directly from your Google watches or any watch that runs Android Wear 2.0. It is already enabled on Google Hangouts, Google Messenger, and many third-party messaging apps. We also provide an
API
for developers of third-party Wear apps.
Acknowledgements
On behalf of the Google Expander team, I would also like to thank the following people who helped make this technology a success: Andrei Broder, Andrew Tomkins, David Singleton, Mirko Ranieri, Robin Dua and Yicheng Fan.
A Large Corpus for Supervised Word-Sense Disambiguation
Wednesday, January 18, 2017
Posted by Colin Evans and Dayu Yuan, Software Engineers
Understanding the various meanings of a particular word in text is key to understanding language. For example, in the sentence “
he will receive stock in the reorganized company
”, we know that “
stock
” refers to “
the capital raised by a business or corporation through the issue and subscription of shares
” as defined in the
New Oxford American Dictionary
(NOAD), based on the context. However, there are more than 10 other definitions for “
stock
” in NOAD, ranging from “
goods in a store
”to “
a medieval device for punishment
”. For a computer algorithm, distinguishing between these meanings is so difficult that it has been described as “
AI-complete
” in the past (
Navigli, 2009
;
Ide and Veronis 1998
;
Mallery 1988
).
In order to help further progress on this challenge, we’re happy to announce the
release of word-sense annotations
on the popular
MASC
and
SemCor
datasets, manually annotated with senses from the NOAD. We’re also releasing mappings from the NOAD senses to
English Wordnet
, which is more commonly used by the research community. This is one of the largest releases of fully sense-annotated English corpora.
Supervised Word-Sense Disambiguation
Humans distinguish between meanings of words in text easily because we have access to an enormous amount of common-sense knowledge about how the world works, and how this connects to language. For an example of the difficulty, “
[stock] in a business”
implies the financial sense, but “
[stock] in a bodega
” is more likely to refer to goods on the shelves of a store, even though a bodega is a kind of business. Acquiring sufficient knowledge in a form that a machine can use, and then applying it to understanding the words in text, is a challenge.
Supervised
word-sense disambiguation
(WSD) is the problem of building a machine-learned system using human-labeled data that can assign a dictionary sense to all words used in text (in contrast to
entity disambiguation
, which focuses on nouns, mostly proper). Building a supervised model that performs better than just assigning the most frequent sense of a word without considering the surrounding text is difficult, but supervised models can perform well when supplied with significant amounts of training data. (
Navigli, 2009
)
By releasing this dataset, it is our hope that the research community will be able to further the advance of algorithms that allow machines to understand language better, allowing applications such as:
Facilitating the automatic construction of databases from text in order to answer questions and connect knowledge in documents. For example, understanding that a “hemi
engine
” is a kind of automotive machinery, and a “locomotive
engine
” is a kind of train, or that “Kanye West is a
star
” implies that he is a celebrity, but “Sirius is a
star
” implies that it is an astronomical object.
Disambiguating words in queries, so that results for “
date
palm” and “
date
night” or “web
spam
” and “
spam
recipe” can have distinct interpretations for different senses, and documents returned from a query have the same meaning that is implied by the query.
Manual Annotation
In the manually labeled data sets that we are releasing, each sense annotation is labeled by five raters. To ensure high quality of the sense annotation, raters are first trained with gold annotations, which were labeled by experienced linguists in a separate pilot study before the annotation task. The figure below shows an example of a rater’s work page in our annotation tool.
The left side of the page lists all candidate dictionary senses (in this case, the word “
general
”). Example sentences from the dictionary are also provided. The to-be-annotated words, highlighted within a sentence, are shown on the right side of the work page. Besides linking a dictionary sense to a word, raters could also label one of the three exceptions: (1)
The word is a typo
(2)
None of the above
and (3)
I can’t decide
. Raters could also check whether the word usage is a metaphor and leave comments.
The sense annotation task used for this data release achieves an inter-rater reliability score of 0.869 using
Krippendorff's alpha
(α >= 0.67 is considered an acceptable level of reproducibility, and α >= 0.80 is considered a highly reproducible result) (
Krippendorff, 2004
). Annotation counts are listed below.
Total
noun
verb
adjective
adverb
SemCor
115k
38k
57k
11.6k
8.6k
MASC
133k
50k
12.7k
13.6k
4.2k
Wordnet Mappings
We’ve also included two sets of mappings from NOAD to
Wordnet
. A smaller set of 2200 words was manually mapped in a process similar to the sense annotations described above, and a larger set was created algorithmically. Together, these mappings allow for resources in Wordnet to be applied to this NOAD corpus, and for systems built using Wordnet to be evaluated using this corpus.
You can learn more about our full research results on this corpus using LSTM-based language models and semi-supervised learning in “
Semi-supervised Word Sense Disambiguation with Neural Models
”.
Acknowledgements
The datasets were built with help from Eric Altendorf, Heng Chen, Jutta Degener, Ryan Doherty, David Huynh, Ji Li, Julian Richardson and Binbin Ruan.
Graph-powered Machine Learning at Google
Thursday, October 06, 2016
Posted by Sujith Ravi, Staff Research Scientist, Google Research
Recently, there have been significant advances in
Machine Learning
that enable computer systems to solve complex real-world problems. One of those advances is Google’s large scale,
graph-based
machine learning platform, built by the Expander team in Google Research. A technology that is behind many of the Google products and features you may use everyday, graph-based machine learning is a powerful tool that can be used to power useful features such as
reminders in Inbox
and
smart messaging in Allo
, or used in conjunction with deep neural networks to power the latest image recognition system in
Google Photos
.
Learning with Minimal Supervision
Much of the recent success in
deep learning
, and machine learning in general, can be attributed to models that demonstrate high predictive capacity when trained on large amounts of labeled data -- often millions of training examples. This is commonly referred to as “
supervised learning
” since it requires supervision, in the form of labeled data, to train the machine learning systems. (Conversely, some machine learning methods operate directly on raw data without any supervision, a paradigm referred to as
unsupervised learning
.)
However, the more difficult the task, the harder it is to get sufficient high-quality labeled data. It is often prohibitively labor intensive and time-consuming to collect labeled data for every new problem. This motivated the Expander research team to build new technology for powering machine learning applications at scale and with minimal supervision.
Expander’s technology draws inspiration from how humans learn to generalize and bridge the gap between what they already know (labeled information) and novel, unfamiliar observations (unlabeled information). Known as “
semi-supervised
” learning, this powerful technique enables us to build systems that can work in situations where training data may be sparse. One of the key advantages to a graph-based semi-supervised machine learning approach is the fact that (a) one models labeled and unlabeled data
jointly
during learning, leveraging the underlying structure in the data, (b) one can easily combine multiple types of signals (for example, relational information from
Knowledge Graph
along with raw features) into a single graph representation and learn over them. This is in contrast to other machine learning approaches, such as neural network methods, in which it is typical to
first
train a system using labeled data with features and
then
apply the trained system to unlabeled data.
Graph Learning: How It Works
At its core, Expander’s platform combines semi-supervised machine learning with large-scale graph-based learning by building a multi-graph representation of the data with nodes corresponding to objects or concepts and edges connecting concepts that share similarities. The graph typically contains both labeled data (nodes associated with a known output category or label) and unlabeled data (nodes for which no labels were provided). Expander’s framework then performs semi-supervised learning to label all nodes jointly by propagating label information across the graph.
However, this is easier said than done! We have to (1) learn efficiently at scale with minimal supervision (i.e., tiny amount of labeled data), (2) operate over multi-modal data (i.e., heterogeneous representations and various sources of data), and (3) solve challenging prediction tasks (i.e., large, complex output spaces) involving high dimensional data that might be noisy.
One of the primary ingredients in the entire learning process is the graph and choice of connections. Graphs come in all sizes, shapes and can be combined from multiple sources. We have observed that it is often beneficial to learn over multi-graphs that combine information from multiple types of data representations (e.g., image pixels, object categories and chat response messages for
PhotoReply in Allo
). The Expander team’s graph learning platform automatically generates graphs directly from data based on the inferred or known relationships between data elements. The data can be structured (for example,
relational data
) or unstructured (for example,
sparse
or dense feature representations extracted from raw data).
To understand how Expander’s system learns, let us consider an example graph shown below.
There are two types of nodes in the graph: “grey” represents unlabeled data whereas the colored nodes represent labeled data. Relationships between node data is represented via edges and thickness of each edge indicates strength of the connection. We can formulate the semi-supervised learning problem on this toy graph as follows:
predict a color (“red” or “blue”) for every node in the graph
. Note that the specific choice of graph structure and colors depend on the task. For example, as shown in
this research paper
we recently published, a graph that we built for the
Smart Reply feature in Inbox
represents email messages as nodes and colors indicate semantic categories of user responses (e.g., “yes”, “awesome”, “funny”).
The Expander graph learning framework solves this labeling task by treating it as an optimization problem. At the simplest level, it learns a color label assignment for every node in the graph such that neighboring nodes are assigned similar colors depending on the strength of their connection. A naive way to solve this would be to try to learn a label assignment for all nodes at once -- this method does not scale to large graphs. Instead, we can optimize the problem formulation by propagating colors from labeled nodes to their neighbors, and then repeating the process. In each step, an unlabeled node is assigned a label by inspecting color assignments of its neighbors. We can update every node’s label in this manner and iterate until the whole graph is colored. This process is a far more efficient way to optimize the same problem and the sequence of iterations converges to a unique solution in this case. The solution at the end of the graph propagation looks something like this:
Semi-supervised learning on a graph
In practice, we use complex optimization functions defined over the graph structure, which incorporate additional information and constraints for semi-supervised graph learning that can lead to hard,
non-convex
problems. The
real
challenge, however, is to scale this efficiently to graphs containing billions of nodes, trillions of edges and for complex tasks involving billions of different label types.
To tackle this challenge, we created an approach outlined in
Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation
, published last year. It introduces a
streaming algorithm
to process information propagated from neighboring nodes in a distributed manner that makes it work on very large graphs. In addition, it addresses other practical concerns, notably it guarantees that the space complexity or memory requirements of the system stays constant regardless of the difficulty of the task, i.e., the overall system uses the same amount of memory regardless of whether the number of prediction labels is two (as in the above toy example) or a million or even a billion. This enables wide-ranging applications for natural language understanding, machine perception, user modeling and even joint
multimodal
learning for tasks involving multiple modalities such as text, image and video inputs.
Language Graphs for Learning Humor
As an example use of graph-based machine learning, consider
emotion labeling
, a language understanding task in
Smart Reply for Inbox
, where the goal is to label words occurring in natural language text with their fine-grained emotion categories. A neural network model is first applied to a text corpus to learn word embeddings, i.e., a mathematical vector representation of the meaning of each word. The dense embedding vectors are then used to build a sparse graph where nodes correspond to words and edges represent semantic relationship between them. Edge strength is computed using similarity between embedding vectors — low similarity edges are ignored. We seed the graph with emotion labels known
a priori
for a few nodes (e.g., laugh is labeled as “funny”) and then apply semi-supervised learning over the graph to discover emotion categories for remaining words (e.g., ROTFL gets labeled as “funny” owing to its multi-hop semantic connection to the word “laugh”).
Learning emotion associations using graph constructed from word embedding vectors
For applications involving large datasets or dense representations that are observed (e.g., pixels from images) or learned using neural networks (e.g., embedding vectors), it is infeasible to compute pairwise similarity between all objects to construct edges in the graph. The Expander team
solves
this problem by leveraging approximate, linear-time graph construction algorithms.
Graph-based Machine Intelligence in Action
The Expander team’s machine learning system is now being used on massive graphs (containing billions of nodes and trillions of edges) to recognize and understand concepts in natural language, images, videos, and queries, powering Google products for applications like
reminders
,
question answering
,
language translation
,
visual object recognition
,
dialogue understanding
, and more.
We are excited that with the
recent release of Allo
, millions of chat users are now experiencing smart messaging technology powered by the Expander team’s system for understanding and assisting with chat conversations in multiple languages. Also, this technology isn’t used only for large-scale models in the cloud - as
announced this past week
, Android Wear has opened up an
on-device Smart Reply capability
for developers that will provide smart replies for any messaging application. We’re excited to tackle even more challenging Internet-scale problems with Expander in the years to come.
Acknowledgements
We wish to acknowledge the hard work of all the researchers, engineers, product managers, and leaders across Google who helped make this technology a success. In particular, we would like to highlight the efforts of Allan Heydon, Andrei Broder, Andrew Tomkins, Ariel Fuxman, Bo Pang, Dana Movshovitz-Attias, Fritz Obermeyer, Krishnamurthy Viswanathan, Patrick McGregor, Peter Young, Robin Dua, Sujith Ravi and Vivek Ramavajjala.
Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities
Monday, August 08, 2016
Posted by Chris Alberti, Dave Orr & Slav Petrov, Google Natural Language Understanding Team
Just in time for
ACL 2016
, we are pleased to announce that Parsey McParseface,
released in May as part of SyntaxNet
and the basis for the
Cloud Natural Language API
, now has 40 cousins! Parsey’s Cousins is a collection of pretrained syntactic models for 40 languages, capable of analyzing the native language of more than half of the world’s population at often unprecedented
accuracy
. To better address the linguistic phenomena occurring in these languages we have endowed SyntaxNet with new abilities for
Text Segmentation
and
Morphological Analysis
.
When we released Parsey, we were already planning to expand to more languages, and it soon became clear that this was both urgent and important, because researchers were having trouble creating top notch SyntaxNet models for other languages.
The reason for that is a little bit subtle. SyntaxNet, like other
TensorFlow
models, has a lot of knobs to turn, which affect accuracy and speed. These knobs are called hyperparameters, and control things like the learning rate and its decay, momentum, and random initialization. Because neural networks are more sensitive to the choice of these hyperparameters than many other machine learning algorithms, picking the right hyperparameter setting is very important. Unfortunately there is no tested and proven way of doing this and picking good hyperparameters is mostly an empirical science -- we try a bunch of settings and see what works best.
An additional challenge is that training these models can take a long time, several days on very fast hardware. Our solution is to train many models in parallel via
MapReduce
, and when one looks promising, train a bunch more models with similar settings to fine-tune the results. This can really add up -- on average, we train more than 70 models per language. The plot below shows how the accuracy varies depending on the hyperparameters as training progresses. The best models are up to 4% absolute more accurate than ones trained without hyperparameter tuning.
Held-out set accuracy for various English parsing models with different hyperparameters (each line corresponds to one training run with specific hyperparameters). In some cases training is a lot slower and in many cases a suboptimal choice of hyperparameters leads to significantly lower accuracy. We are releasing the best model that we were able to train for each language.
In order to do a good job at analyzing the grammar of other languages, it was not sufficient to just fine-tune our English setup. We also had to expand the capabilities of SyntaxNet. The first extension is a model for text segmentation, which is the task of identifying word boundaries. In languages like English, this isn’t very hard -- you can mostly look for spaces and punctuation. In Chinese, however, this can be very challenging, because words are not separated by spaces. To correctly analyze dependencies between Chinese words, SyntaxNet needs to understand text segmentation -- and now it does.
Analysis of a Chinese string into a parse tree showing dependency labels, word tokens, and parts of speech (read top to bottom for each word token).
The second extension is a model for morphological analysis. Morphology is a language feature that is poorly represented in English. It describes inflection: i.e., how the grammatical function and meaning of the word changes as its spelling changes. In English, we add an -s to a word to indicate plurality. In Russian, a
heavily inflected language
, morphology can indicate number, gender, whether the word is the subject or object of a sentence, possessives, prepositional phrases, and more. To understand the syntax of a sentence in Russian, SyntaxNet needs to understand morphology -- and now it does.
Parse trees showing dependency labels, parts of speech, and morphology.
As you might have noticed, the parse trees for all of the sentences above look very similar. This is because we follow the content-head principle, under which dependencies are drawn between content words, with function words becoming leaves in the parse tree. This idea was developed by the
Universal Dependencies
project in order to increase parallelism between languages. Parsey’s Cousins are trained on
treebanks
provided by this project and are designed to be cross-linguistically consistent and thus easier to use in multi-lingual language understanding applications.
Using the same set of labels across languages can help us understand how sentences in different languages, or variations in the same language, convey the same meaning. In all of the above examples, the root indicates the main verb of the sentence and there is a passive nominal subject (indicated by the arc labeled with ‘nsubjpass’) and a passive auxiliary (‘auxpass’). If you look closely, you will also notice some differences because the grammar of each language differs. For example, English uses the preposition ‘by,’ where Russian uses morphology to mark that the phrase ‘the publisher (издателем)’ is in
instrumental case
-- the meaning is the same, it is just expressed differently.
Google has been involved in the Universal Dependencies project since its
inception
and we are very excited to be able to bring together our efforts on datasets and modeling. We hope that this release will facilitate research progress in building computer systems that can understand all of the world’s languages.
Parsey's Cousins can be found on
GitHub
, along with
Parsey McParseface
and
SyntaxNet
.
Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source
Thursday, May 12, 2016
Posted by Slav Petrov, Senior Staff Research Scientist
At Google, we spend a lot of time thinking about how
computer systems
can
read
and
understand
human language
in order
to process it
in
intelligent ways
. Today, we are excited to share the fruits of our research with the broader community by releasing
SyntaxNet
, an open-source neural network framework implemented in
TensorFlow
that provides a foundation for
Natural Language Understanding
(NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as
Parsey McParseface
, an English parser that we have trained for you and that you can use to analyze English text.
Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the
most accurate such model in the world
, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.
How does SyntaxNet work?
SyntaxNet is a framework for what’s known in academic circles as a
syntactic parser
, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for
Alice saw Bob
:
This structure encodes that
Alice
and
Bob
are nouns and
saw
is a verb. The main verb
saw
is the root of the sentence and
Alice
is the subject (nsubj) of
saw
, while
Bob
is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:
This structure again encodes the fact that
Alice
and
Bob
are the subject and object respectively of
saw
, in addition that
Alice
is modified by a relative clause with the verb
reading
, that
saw
is modified by the temporal modifier
yesterday
, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example
whom did Alice see?
,
who saw Bob?
,
what had Alice been reading about?
or
when did Alice see Bob?
.
Why is Parsing So Hard For Computers to Get Right?
One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence
Alice drove down the street in her car
has at least two possible dependency parses:
The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition
in
can either modify
drove
or
street
; this example is an instance of what is called
prepositional phrase attachment ambiguity
.
Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.
SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use
beam search
in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence
I booked a ticket to Google
.
Furthermore, as described in our
paper
, it is critical to tightly
integrate learning and search
in order to achieve the highest prediction accuracy. Parsey McParseface and other
SyntaxNet
models are some of the most complex networks that we have trained with the
TensorFlow
framework at Google. Given some data from the Google supported
Universal Dependencies
project, you can train a parsing model on your own machine.
So How Accurate is Parsey McParseface?
On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old
Penn Treebank
), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already
better than any previous approach
. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the
Google WebTreebank
(released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.
While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across
all
languages and contexts.
To get started, see the
SyntaxNet
code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.
On the Personalities of Dead Authors
Wednesday, February 24, 2016
Posted by Marc Pickett, Software Engineer, Chris Tar, Engineering Manager and Brian Strope, Research Scientist
“Great, ice cream for dinner!”
How would you interpret that? If a 6 year old says it, it feels very different than if a parent says it. People are good at inferring the deeper meaning of language based on both the context in which something was said, and their knowledge of the personality of the speaker.
But can one program a computer to understand the intended meaning from natural language in a way similar to us? Developing a system that knows definitions of words and rules of grammar is one thing, but giving a computer conversational context along with the expectations of a speaker’s behaviors and language patterns is quite another!
To tackle this challenge, a
Natural Language Understanding
research group, led by Ray Kurzweil, works on building systems able to understand natural language at a deeper level. By experimenting with systems able to perceive and project different personality types, it is our goal to enable computers to interpret the meaning of natural language similar to the way we do.
One way to explore this research is to build a system capable of sentence prediction. Can we build a system that can, given a sentence from a book and knowledge of the author’s style and “personality”, predict what the author is most likely to write next?
We started by utilizing the works of a thousand different authors found on
Project Gutenberg
to see if we could train a
Deep Neural Network
(DNN) to predict, given an input sentence, what sentence would come next. The idea was to see whether a DNN could - given millions of lines from a jumble of authors - “learn” a pattern or style that would lead one sentence to follow another.
This initial system had no author ID at the input - we just gave it pairs (line, following line) from 80% of the literary sample (saving 20% of it as a validation holdout). The labels at the output of the network are a simple YES or NO, depending on whether the example was truly a pair of sentences in sequence from the training data, or a randomly matched pair. This initial system had an error rate of 17.2%, where a random guess would be 50%. A slightly more sophisticated version also adds a fixed number of previous sentences for context, which decreased the error down to 12.8%.
We then improved that initial system by giving the network an additional signal per example: a unique ID representing the author. We told it who was saying what. All examples from that author were now accompanied by this ID during training time. The new system learned to leverage the Author ID, and decreased the relative error by 12.3% compared to the previous system (from 12.8% down to 11.1%). At some level, the system is saying “I've been told that this is Shakespeare, who tends to write like this, so I'll take that into account when weighing which sentence is more likely to follow”. On a slightly different ranking task (pick which of two responses most likely follows, instead of just a yes/no on a given trigger/response pair), including the fixed window of previous sentences along with this author ID resulted in an error rate of less than 5%.
The 300 dimensional vectors our system derived to do these predictions are presumably representative of the Author’s word choice, thinking, and style. We call these “Author vectors”, analogous to
word vectors
or
paragraph vectors
. To get an intuitive sense of what these vectors are capturing, we projected the 300 dimensional space into two dimensions and plotted them as shown in the figure below. This gives some semblance of similarity and relative positions of authors in the space.
A two-dimensional representation of the vector embeddings for some of the authors in our study. To project the 300 dimensional vectors to two dimensions, we used the
t-SNE algorithm
. Note that contemporaries and influencers tend to be near each other (E.g., Nathaniel Hawthorne and Herman Melville, or Marlowe and Shakespeare).
It is interesting to consider which dimensions are most pertinent to defining personality and style, and which are more related to content or areas of interest. In the example above, we find Shakespeare and Marlowe in adjacent space. At the very least, these two dimensions reflect similarities of contemporary authors, but are there also measurable variables corresponding to “snark”, or humor, or sarcasm? Or perhaps there is something related to interests in sports?
With this working, we wondered, “How would the model respond to the questions of a personality test?” But to simulate how different authors might respond to questions found in such tests, we needed a NN that, rather than strictly making a yes/no decision, would produce a yes/no decision while being influenced by the author vector - including sentences it hasn't seen before.
To simulate different authors’ responses to questions, we use the author vectors described above as inputs to our more general networks. In that way, we get the performance and generalization of the network across all authors and text it learned on, but influenced by what’s unique to a chosen author. Combined with our generative model, these vectors allow us to generate responses as different authors. In effect, one can chat with a statistical representation of the text written by Shakespeare!
Once we set the author vector for a chosen author, we posed the Myers Briggs questions to the system as the “current sentence”, set the author vector for the chosen author, and gave the Myers Briggs response options as the next-sentence candidates. When we asked “Are you more of”: “a private person” or “an outgoing person” to our model of Shakespeare’s texts, it predicted “a private person”. When we changed the author vector to Mark Twain and pose the same question, we got “an outgoing person”.
If you're interested in more predictions our models made,
here's the complete list
for the small dataset of authors that we used. We have no reason to believe that these assessments are particularly accurate, since our systems weren't trained to do that well. Also, the responses are based on the writings of the author. Dialogs from fictional characters are not necessarily representative of the author’s actual personality. But we do know that these kinds of text-based systems can predict these kinds of classifications (for example
this UPenn study
used language use in public posts to predict users' personality traits). So we thought it would be interesting to see what we could get from our early models.
Though we can in no way claim that these models accurately respond with with the authors would have said, there are a few amusing anecdotes. When asked “Who is your favorite author?” and gave the options “Mark Twain”, “William Shakespeare”, “Myself”, and “Nobody”, the Twain model responded with “Mark Twain” and the Shakespeare model responded with “William Shakespeare”. Another example comes from the personality test: “When the phone rings” Shakespeare's model “hope[s] someone else will answer”, while Twain's “[tries] to get to it first”. Fitting, perhaps, since the telephone was patented during Twain's lifetime, but after Shakespeare.
This work is an early step towards better understanding intent, and how long-term context influences interpretation of text. In addition to being fun and interesting, this work has the potential to enrich products through personalization. For example, it could help provide more personalized response options for the recently introduced
Smart Reply feature
in Inbox by Gmail.
New ways to add Reminders in Inbox by Gmail
Wednesday, June 17, 2015
Posted by Dave Orr, Google Research Product Manager
07/11/2016 update - this feature is no longer supported. We apologize for the confusion and continue to look for ways to make Inbox for Gmail more useful.
Last week, Inbox by Gmail
opened up
and improved many of your favorite features, including two new ways to add Reminders.
First up, when someone emails you a to-do, Inbox can now suggest adding a Reminder so you don’t forget. Here's how it looks if your spouse emails you and asks you to buy milk on the way home:
To help you add Reminders, the Google Research team used
natural language understanding
technology to teach Inbox to recognize to-dos in email.
And much like Gmail and Inbox get better when you report spam, your feedback helps improve these suggested Reminders. You can accept or reject them with a single click:
The other new way to add Reminders in Inbox is to create Reminders in Google Keep--they will appear in Inbox with a link back to the full note in Google Keep.
Hopefully, this little extra help gets you back to what matters more quickly and easily. Try the new features out, and as always, let us know what you think using the feedback link in the app.
Teaching machines to read between the lines (and a new corpus with entity salience annotations)
Monday, August 25, 2014
Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager
Language understanding systems are largely trained on freely available data, such as the
Penn Treebank
, perhaps the most widely used linguistic resource ever created. We have previously released
lots of linguistic data
ourselves, to contribute to the language understanding community as well as encourage further research into these areas.
Now, we’re releasing a new dataset, based on another great resource: the
New York Times Annotated Corpus
, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages
use of the metadata
for all kinds of things, and has set up
a forum
to discuss related research.
We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.
One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a
581 word article
, and compare that to the usual frequency of “coach” --
more like 5 in 330,000 words
-- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous
TFIDF
, long used to index web pages.
Congratulations to
Becky Hammon
, first female NBA coach! Image via Wikipedia.
Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the
Knowledge Graph
, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.
Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.
To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved
Freebase entity IDs
and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper:
A New Entity Salience Task with Millions of Training Examples
(Jesse Dunietz and Dan Gillick).
Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult
word sense disambiguation
problem, which we’ve
previously touched on
), the annotations are limited to names.
Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.
Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.
Download the data directly
from Google Drive
, or visit the project home page with more information at
our Google Code site
. We look forward to seeing what you come up with!
A Billion Words: Because today's language modeling standard should be higher
Wednesday, April 30, 2014
Posted by Dave Orr, Product Manager, and Ciprian Chelba, Research Scientist
Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans
pronounce “ladder” and “latter” identically
, for instance. Keyboard inputs on mobile devices have a similar problem, especially for
IME keyboards
. For example, the input patterns for “Yankees” and “takes” look very similar:
Photo credit: Kurt Partridge
But in this context -- the previous two words, “New York” -- “Yankees” is much more likely.
One key way computers use context is with
language models
. These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on. Often those are specialized: word order for queries versus web pages can be very different. Either way, having an accurate language model with wide coverage drives the quality of all these applications.
Due to interactions between components, one thing that can be tricky when evaluating the quality of such complex systems is error attribution. Good engineering practice is to evaluate the quality of each module separately, including the language model. We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques.
To that end,
we are releasing scripts
that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an
arXiv paper
. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks.
The benchmark scripts and data are freely available, and can be found here:
http://www.statmt.org/lm-benchmark/
The field needs a new and better standard benchmark. Currently, researchers report from a set of their choice, and results are very hard to reproduce because of a lack of a standard in preprocessing. We hope that this will solve both those problems, and become the standard benchmark for language modeling experiments. As more researchers use the new benchmark, comparisons will be easier and more accurate, and progress will be faster.
For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals.
Free Language Lessons for Computers
Tuesday, December 03, 2013
Posted by Dave Orr, Google Research Product Manager
Not everything that can be counted counts.
Not everything that counts can be counted.
-
William Bruce Cameron
50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.
These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.
But data by itself doesn’t mean much. Data is only valuable in the right context, and only if it leads to increased knowledge. Labeled data is critical to train and evaluate machine-learned systems in many arenas, improving systems that can increase our ability to understand the world. Advances in natural language understanding, information retrieval, information extraction, computer vision, etc. can help us
tell stories
, mine for valuable insights, or
visualize information
in beautiful and compelling ways.
That’s why we are pleased to be able to release sets of labeled data from various domains and with various annotations, some automatic and some manual. Our hope is that the research community will use these datasets in ways both straightforward and surprising, to improve systems for annotation or understanding, and perhaps launch new efforts we haven’t thought of.
Here’s a listing of the major datasets we’ve released in the last year, or you can subscribe to our
mailing list
. Please tell us what you’ve managed to accomplish, or send us pointers to papers that use this data. We want to see what the research world can do with what we’ve created.
50,000 Lessons on How to Read: a Relation Extraction Corpus
What is it
: A human-judged dataset of two relations involving public figures on
Wikipedia
: about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated from an institution.”
Where can I find it
:
https://code.google.com/p/relation-extraction-corpus/
I want to know more
: Here’s a
handy blog post
with a broader explanation, descriptions and examples of the data, and plenty of links to learn more.
11 Billion Clues in 800 Million Documents
What is it
: We took the ClueWeb corpora and automatically labeled concepts and entities with
Freebase concept IDs
, an example of entity resolution. This dataset is huge: nearly 800 million web pages.
Where can I find it
: We released two corpora:
ClueWeb09 FACC
and
ClueWeb12 FACC
.
I want to know more
: We described the process and results in a recent blog post.
Features Extracted From YouTube Videos for Multiview Learning
What is it
: Multiple feature families from a set of public YouTube videos of games. The videos are labeled with one of 30 categories, and each has an associated set of visual, auditory, and and textual features.
Where can I find it
: The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
Google’s repository
.
I want to know more
: Read more about the data and uses for it
here
.
40 Million Entities in Context
What is it
: A disambiguation set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. This is another entity resolution corpus, since the links can be used to disambiguate the mentions, but unlike the ClueWeb example above, the links are inserted by the web page authors and can therefore be considered human annotation.
Where can I find it
: Here’s the
WikiLinks corpus
, and tools can be found to help use this data on our partner’s page:
Umass Wiki-links
.
I want to know more
: Other disambiguation sets, data formats, ideas for uses of this data, and more can be found at our
blog post announcing the release
.
Distributing the Edit History of Wikipedia Infoboxes
What is it
: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy resource. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
Where can I find it
:
Download from Google
or from
Wikimedia Deutschland
.
I want to know more
: We
posted
a detailed look at the data, the process for gathering it, and where to find it. You can also read a
paper
we published on the release.
Note the change in the capital of Palau.
Syntactic Ngrams over Time
What is it
: We automatically syntactically analyzed 350 billion words from the 3.5 million English language books in
Google Books
, and collated and released a set of fragments -- billions of unique tree fragments with counts sorted into types. The underlying corpus is the same one that underlies the recently updated
Google Ngram Viewer
.
Where can I find it
:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
I want to know more
: We discussed the nature of dependency parses and describe the data and release in a
blog post
. We also published a
paper about the release
.
Dictionaries for linking Text, Entities, and Ideas
What is it
: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Where can I find it
:
http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
I want to know more
: A description of the data, several examples, and ideas for uses for it can be found in a
blog post
or in the
associated paper
.
Other datasets
Not every release had its own blog post describing it. Here are some other releases:
Automatic
Freebase annotations
of Trec’s Million Query and Web track queries.
A
set of Freebase triples
that have been deleted from Freebase over time -- 63 million of them.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PhotoScan
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.