Google Research Blog
The latest news from Research at Google
Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities
Monday, August 08, 2016
Posted by Chris Alberti, Dave Orr & Slav Petrov, Google Natural Language Understanding Team
Just in time for
ACL 2016
, we are pleased to announce that Parsey McParseface,
released in May as part of SyntaxNet
and the basis for the
Cloud Natural Language API
, now has 40 cousins! Parsey’s Cousins is a collection of pretrained syntactic models for 40 languages, capable of analyzing the native language of more than half of the world’s population at often unprecedented
accuracy
. To better address the linguistic phenomena occurring in these languages we have endowed SyntaxNet with new abilities for
Text Segmentation
and
Morphological Analysis
.
When we released Parsey, we were already planning to expand to more languages, and it soon became clear that this was both urgent and important, because researchers were having trouble creating top notch SyntaxNet models for other languages.
The reason for that is a little bit subtle. SyntaxNet, like other
TensorFlow
models, has a lot of knobs to turn, which affect accuracy and speed. These knobs are called hyperparameters, and control things like the learning rate and its decay, momentum, and random initialization. Because neural networks are more sensitive to the choice of these hyperparameters than many other machine learning algorithms, picking the right hyperparameter setting is very important. Unfortunately there is no tested and proven way of doing this and picking good hyperparameters is mostly an empirical science -- we try a bunch of settings and see what works best.
An additional challenge is that training these models can take a long time, several days on very fast hardware. Our solution is to train many models in parallel via
MapReduce
, and when one looks promising, train a bunch more models with similar settings to fine-tune the results. This can really add up -- on average, we train more than 70 models per language. The plot below shows how the accuracy varies depending on the hyperparameters as training progresses. The best models are up to 4% absolute more accurate than ones trained without hyperparameter tuning.
Held-out set accuracy for various English parsing models with different hyperparameters (each line corresponds to one training run with specific hyperparameters). In some cases training is a lot slower and in many cases a suboptimal choice of hyperparameters leads to significantly lower accuracy. We are releasing the best model that we were able to train for each language.
In order to do a good job at analyzing the grammar of other languages, it was not sufficient to just fine-tune our English setup. We also had to expand the capabilities of SyntaxNet. The first extension is a model for text segmentation, which is the task of identifying word boundaries. In languages like English, this isn’t very hard -- you can mostly look for spaces and punctuation. In Chinese, however, this can be very challenging, because words are not separated by spaces. To correctly analyze dependencies between Chinese words, SyntaxNet needs to understand text segmentation -- and now it does.
Analysis of a Chinese string into a parse tree showing dependency labels, word tokens, and parts of speech (read top to bottom for each word token).
The second extension is a model for morphological analysis. Morphology is a language feature that is poorly represented in English. It describes inflection: i.e., how the grammatical function and meaning of the word changes as its spelling changes. In English, we add an -s to a word to indicate plurality. In Russian, a
heavily inflected language
, morphology can indicate number, gender, whether the word is the subject or object of a sentence, possessives, prepositional phrases, and more. To understand the syntax of a sentence in Russian, SyntaxNet needs to understand morphology -- and now it does.
Parse trees showing dependency labels, parts of speech, and morphology.
As you might have noticed, the parse trees for all of the sentences above look very similar. This is because we follow the content-head principle, under which dependencies are drawn between content words, with function words becoming leaves in the parse tree. This idea was developed by the
Universal Dependencies
project in order to increase parallelism between languages. Parsey’s Cousins are trained on
treebanks
provided by this project and are designed to be cross-linguistically consistent and thus easier to use in multi-lingual language understanding applications.
Using the same set of labels across languages can help us understand how sentences in different languages, or variations in the same language, convey the same meaning. In all of the above examples, the root indicates the main verb of the sentence and there is a passive nominal subject (indicated by the arc labeled with ‘nsubjpass’) and a passive auxiliary (‘auxpass’). If you look closely, you will also notice some differences because the grammar of each language differs. For example, English uses the preposition ‘by,’ where Russian uses morphology to mark that the phrase ‘the publisher (издателем)’ is in
instrumental case
-- the meaning is the same, it is just expressed differently.
Google has been involved in the Universal Dependencies project since its
inception
and we are very excited to be able to bring together our efforts on datasets and modeling. We hope that this release will facilitate research progress in building computer systems that can understand all of the world’s languages.
Parsey's Cousins can be found on
GitHub
, along with
Parsey McParseface
and
SyntaxNet
.
Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source
Thursday, May 12, 2016
Posted by Slav Petrov, Senior Staff Research Scientist
At Google, we spend a lot of time thinking about how
computer systems
can
read
and
understand
human language
in order
to process it
in
intelligent ways
. Today, we are excited to share the fruits of our research with the broader community by releasing
SyntaxNet
, an open-source neural network framework implemented in
TensorFlow
that provides a foundation for
Natural Language Understanding
(NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as
Parsey McParseface
, an English parser that we have trained for you and that you can use to analyze English text.
Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the
most accurate such model in the world
, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.
How does SyntaxNet work?
SyntaxNet is a framework for what’s known in academic circles as a
syntactic parser
, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for
Alice saw Bob
:
This structure encodes that
Alice
and
Bob
are nouns and
saw
is a verb. The main verb
saw
is the root of the sentence and
Alice
is the subject (nsubj) of
saw
, while
Bob
is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:
This structure again encodes the fact that
Alice
and
Bob
are the subject and object respectively of
saw
, in addition that
Alice
is modified by a relative clause with the verb
reading
, that
saw
is modified by the temporal modifier
yesterday
, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example
whom did Alice see?
,
who saw Bob?
,
what had Alice been reading about?
or
when did Alice see Bob?
.
Why is Parsing So Hard For Computers to Get Right?
One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence
Alice drove down the street in her car
has at least two possible dependency parses:
The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition
in
can either modify
drove
or
street
; this example is an instance of what is called
prepositional phrase attachment ambiguity
.
Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.
SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use
beam search
in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence
I booked a ticket to Google
.
Furthermore, as described in our
paper
, it is critical to tightly
integrate learning and search
in order to achieve the highest prediction accuracy. Parsey McParseface and other
SyntaxNet
models are some of the most complex networks that we have trained with the
TensorFlow
framework at Google. Given some data from the Google supported
Universal Dependencies
project, you can train a parsing model on your own machine.
So How Accurate is Parsey McParseface?
On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old
Penn Treebank
), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already
better than any previous approach
. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the
Google WebTreebank
(released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.
While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across
all
languages and contexts.
To get started, see the
SyntaxNet
code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.
On the Personalities of Dead Authors
Wednesday, February 24, 2016
Posted by Marc Pickett, Software Engineer, Chris Tar, Engineering Manager and Brian Strope, Research Scientist
“Great, ice cream for dinner!”
How would you interpret that? If a 6 year old says it, it feels very different than if a parent says it. People are good at inferring the deeper meaning of language based on both the context in which something was said, and their knowledge of the personality of the speaker.
But can one program a computer to understand the intended meaning from natural language in a way similar to us? Developing a system that knows definitions of words and rules of grammar is one thing, but giving a computer conversational context along with the expectations of a speaker’s behaviors and language patterns is quite another!
To tackle this challenge, a
Natural Language Understanding
research group, led by Ray Kurzweil, works on building systems able to understand natural language at a deeper level. By experimenting with systems able to perceive and project different personality types, it is our goal to enable computers to interpret the meaning of natural language similar to the way we do.
One way to explore this research is to build a system capable of sentence prediction. Can we build a system that can, given a sentence from a book and knowledge of the author’s style and “personality”, predict what the author is most likely to write next?
We started by utilizing the works of a thousand different authors found on
Project Gutenberg
to see if we could train a
Deep Neural Network
(DNN) to predict, given an input sentence, what sentence would come next. The idea was to see whether a DNN could - given millions of lines from a jumble of authors - “learn” a pattern or style that would lead one sentence to follow another.
This initial system had no author ID at the input - we just gave it pairs (line, following line) from 80% of the literary sample (saving 20% of it as a validation holdout). The labels at the output of the network are a simple YES or NO, depending on whether the example was truly a pair of sentences in sequence from the training data, or a randomly matched pair. This initial system had an error rate of 17.2%, where a random guess would be 50%. A slightly more sophisticated version also adds a fixed number of previous sentences for context, which decreased the error down to 12.8%.
We then improved that initial system by giving the network an additional signal per example: a unique ID representing the author. We told it who was saying what. All examples from that author were now accompanied by this ID during training time. The new system learned to leverage the Author ID, and decreased the relative error by 12.3% compared to the previous system (from 12.8% down to 11.1%). At some level, the system is saying “I've been told that this is Shakespeare, who tends to write like this, so I'll take that into account when weighing which sentence is more likely to follow”. On a slightly different ranking task (pick which of two responses most likely follows, instead of just a yes/no on a given trigger/response pair), including the fixed window of previous sentences along with this author ID resulted in an error rate of less than 5%.
The 300 dimensional vectors our system derived to do these predictions are presumably representative of the Author’s word choice, thinking, and style. We call these “Author vectors”, analogous to
word vectors
or
paragraph vectors
. To get an intuitive sense of what these vectors are capturing, we projected the 300 dimensional space into two dimensions and plotted them as shown in the figure below. This gives some semblance of similarity and relative positions of authors in the space.
A two-dimensional representation of the vector embeddings for some of the authors in our study. To project the 300 dimensional vectors to two dimensions, we used the
t-SNE algorithm
. Note that contemporaries and influencers tend to be near each other (E.g., Nathaniel Hawthorne and Herman Melville, or Marlowe and Shakespeare).
It is interesting to consider which dimensions are most pertinent to defining personality and style, and which are more related to content or areas of interest. In the example above, we find Shakespeare and Marlowe in adjacent space. At the very least, these two dimensions reflect similarities of contemporary authors, but are there also measurable variables corresponding to “snark”, or humor, or sarcasm? Or perhaps there is something related to interests in sports?
With this working, we wondered, “How would the model respond to the questions of a personality test?” But to simulate how different authors might respond to questions found in such tests, we needed a NN that, rather than strictly making a yes/no decision, would produce a yes/no decision while being influenced by the author vector - including sentences it hasn't seen before.
To simulate different authors’ responses to questions, we use the author vectors described above as inputs to our more general networks. In that way, we get the performance and generalization of the network across all authors and text it learned on, but influenced by what’s unique to a chosen author. Combined with our generative model, these vectors allow us to generate responses as different authors. In effect, one can chat with a statistical representation of the text written by Shakespeare!
Once we set the author vector for a chosen author, we posed the Myers Briggs questions to the system as the “current sentence”, set the author vector for the chosen author, and gave the Myers Briggs response options as the next-sentence candidates. When we asked “Are you more of”: “a private person” or “an outgoing person” to our model of Shakespeare’s texts, it predicted “a private person”. When we changed the author vector to Mark Twain and pose the same question, we got “an outgoing person”.
If you're interested in more predictions our models made,
here's the complete list
for the small dataset of authors that we used. We have no reason to believe that these assessments are particularly accurate, since our systems weren't trained to do that well. Also, the responses are based on the writings of the author. Dialogs from fictional characters are not necessarily representative of the author’s actual personality. But we do know that these kinds of text-based systems can predict these kinds of classifications (for example
this UPenn study
used language use in public posts to predict users' personality traits). So we thought it would be interesting to see what we could get from our early models.
Though we can in no way claim that these models accurately respond with with the authors would have said, there are a few amusing anecdotes. When asked “Who is your favorite author?” and gave the options “Mark Twain”, “William Shakespeare”, “Myself”, and “Nobody”, the Twain model responded with “Mark Twain” and the Shakespeare model responded with “William Shakespeare”. Another example comes from the personality test: “When the phone rings” Shakespeare's model “hope[s] someone else will answer”, while Twain's “[tries] to get to it first”. Fitting, perhaps, since the telephone was patented during Twain's lifetime, but after Shakespeare.
This work is an early step towards better understanding intent, and how long-term context influences interpretation of text. In addition to being fun and interesting, this work has the potential to enrich products through personalization. For example, it could help provide more personalized response options for the recently introduced
Smart Reply feature
in Inbox by Gmail.
New ways to add Reminders in Inbox by Gmail
Wednesday, June 17, 2015
Posted by Dave Orr, Google Research Product Manager
Last week, Inbox by Gmail
opened up
and improved many of your favorite features, including two new ways to add Reminders.
First up, when someone emails you a to-do, Inbox can now suggest adding a Reminder so you don’t forget. Here's how it looks if your spouse emails you and asks you to buy milk on the way home:
To help you add Reminders, the Google Research team used
natural language understanding
technology to teach Inbox to recognize to-dos in email.
And much like Gmail and Inbox get better when you report spam, your feedback helps improve these suggested Reminders. You can accept or reject them with a single click:
The other new way to add Reminders in Inbox is to create Reminders in Google Keep--they will appear in Inbox with a link back to the full note in Google Keep.
Hopefully, this little extra help gets you back to what matters more quickly and easily. Try the new features out, and as always, let us know what you think using the feedback link in the app.
Teaching machines to read between the lines (and a new corpus with entity salience annotations)
Monday, August 25, 2014
Posted by Dan Gillick, Research Scientist, and Dave Orr, Product Manager
Language understanding systems are largely trained on freely available data, such as the
Penn Treebank
, perhaps the most widely used linguistic resource ever created. We have previously released
lots of linguistic data
ourselves, to contribute to the language understanding community as well as encourage further research into these areas.
Now, we’re releasing a new dataset, based on another great resource: the
New York Times Annotated Corpus
, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages
use of the metadata
for all kinds of things, and has set up
a forum
to discuss related research.
We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.
One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a
581 word article
, and compare that to the usual frequency of “coach” --
more like 5 in 330,000 words
-- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous
TFIDF
, long used to index web pages.
Congratulations to
Becky Hammon
, first female NBA coach! Image via Wikipedia.
Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the
Knowledge Graph
, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.
Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.
To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved
Freebase entity IDs
and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper:
A New Entity Salience Task with Millions of Training Examples
(Jesse Dunietz and Dan Gillick).
Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult
word sense disambiguation
problem, which we’ve
previously touched on
), the annotations are limited to names.
Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.
Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.
Download the data directly
from Google Drive
, or visit the project home page with more information at
our Google Code site
. We look forward to seeing what you come up with!
A Billion Words: Because today's language modeling standard should be higher
Wednesday, April 30, 2014
Posted by Dave Orr, Product Manager, and Ciprian Chelba, Research Scientist
Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans
pronounce “ladder” and “latter” identically
, for instance. Keyboard inputs on mobile devices have a similar problem, especially for
IME keyboards
. For example, the input patterns for “Yankees” and “takes” look very similar:
Photo credit: Kurt Partridge
But in this context -- the previous two words, “New York” -- “Yankees” is much more likely.
One key way computers use context is with
language models
. These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on. Often those are specialized: word order for queries versus web pages can be very different. Either way, having an accurate language model with wide coverage drives the quality of all these applications.
Due to interactions between components, one thing that can be tricky when evaluating the quality of such complex systems is error attribution. Good engineering practice is to evaluate the quality of each module separately, including the language model. We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques.
To that end,
we are releasing scripts
that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an
arXiv paper
. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks.
The benchmark scripts and data are freely available, and can be found here:
http://www.statmt.org/lm-benchmark/
The field needs a new and better standard benchmark. Currently, researchers report from a set of their choice, and results are very hard to reproduce because of a lack of a standard in preprocessing. We hope that this will solve both those problems, and become the standard benchmark for language modeling experiments. As more researchers use the new benchmark, comparisons will be easier and more accurate, and progress will be faster.
For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals.
Free Language Lessons for Computers
Tuesday, December 03, 2013
Posted by Dave Orr, Google Research Product Manager
Not everything that can be counted counts.
Not everything that counts can be counted.
-
William Bruce Cameron
50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.
These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.
But data by itself doesn’t mean much. Data is only valuable in the right context, and only if it leads to increased knowledge. Labeled data is critical to train and evaluate machine-learned systems in many arenas, improving systems that can increase our ability to understand the world. Advances in natural language understanding, information retrieval, information extraction, computer vision, etc. can help us
tell stories
, mine for valuable insights, or
visualize information
in beautiful and compelling ways.
That’s why we are pleased to be able to release sets of labeled data from various domains and with various annotations, some automatic and some manual. Our hope is that the research community will use these datasets in ways both straightforward and surprising, to improve systems for annotation or understanding, and perhaps launch new efforts we haven’t thought of.
Here’s a listing of the major datasets we’ve released in the last year, or you can subscribe to our
mailing list
. Please tell us what you’ve managed to accomplish, or send us pointers to papers that use this data. We want to see what the research world can do with what we’ve created.
50,000 Lessons on How to Read: a Relation Extraction Corpus
What is it
: A human-judged dataset of two relations involving public figures on
Wikipedia
: about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated from an institution.”
Where can I find it
:
https://code.google.com/p/relation-extraction-corpus/
I want to know more
: Here’s a
handy blog post
with a broader explanation, descriptions and examples of the data, and plenty of links to learn more.
11 Billion Clues in 800 Million Documents
What is it
: We took the ClueWeb corpora and automatically labeled concepts and entities with
Freebase concept IDs
, an example of entity resolution. This dataset is huge: nearly 800 million web pages.
Where can I find it
: We released two corpora:
ClueWeb09 FACC
and
ClueWeb12 FACC
.
I want to know more
: We described the process and results in a recent blog post.
Features Extracted From YouTube Videos for Multiview Learning
What is it
: Multiple feature families from a set of public YouTube videos of games. The videos are labeled with one of 30 categories, and each has an associated set of visual, auditory, and and textual features.
Where can I find it
: The data and more information can be obtained from the
UCI machine learning repository (multiview video dataset)
, or from
Google’s repository
.
I want to know more
: Read more about the data and uses for it
here
.
40 Million Entities in Context
What is it
: A disambiguation set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. This is another entity resolution corpus, since the links can be used to disambiguate the mentions, but unlike the ClueWeb example above, the links are inserted by the web page authors and can therefore be considered human annotation.
Where can I find it
: Here’s the
WikiLinks corpus
, and tools can be found to help use this data on our partner’s page:
Umass Wiki-links
.
I want to know more
: Other disambiguation sets, data formats, ideas for uses of this data, and more can be found at our
blog post announcing the release
.
Distributing the Edit History of Wikipedia Infoboxes
What is it
: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy resource. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
Where can I find it
:
Download from Google
or from
Wikimedia Deutschland
.
I want to know more
: We
posted
a detailed look at the data, the process for gathering it, and where to find it. You can also read a
paper
we published on the release.
Note the change in the capital of Palau.
Syntactic Ngrams over Time
What is it
: We automatically syntactically analyzed 350 billion words from the 3.5 million English language books in
Google Books
, and collated and released a set of fragments -- billions of unique tree fragments with counts sorted into types. The underlying corpus is the same one that underlies the recently updated
Google Ngram Viewer
.
Where can I find it
:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
I want to know more
: We discussed the nature of dependency parses and describe the data and release in a
blog post
. We also published a
paper about the release
.
Dictionaries for linking Text, Entities, and Ideas
What is it
: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Where can I find it
:
http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2
I want to know more
: A description of the data, several examples, and ideas for uses for it can be found in a
blog post
or in the
associated paper
.
Other datasets
Not every release had its own blog post describing it. Here are some other releases:
Automatic
Freebase annotations
of Trec’s Million Query and Web track queries.
A
set of Freebase triples
that have been deleted from Freebase over time -- 63 million of them.
Enhancing Linguistic Search with the Google Books Ngram Viewer
Thursday, October 17, 2013
Posted by Slav Petrov and Dipanjan Das, Research Scientists
Our book scanning effort, now in its eighth year, has put tens of millions of books online. Beyond the obvious benefits of being able to discover books and search through them, the project lets us take a step back and learn what the entire collection tells us about culture and language.
Launched in 2010 by Jon Orwant and Will Brockman, the Google Books Ngram Viewer lets you search for words and phrases over the centuries, in English, Chinese, Russian, French, German, Italian, Hebrew, and Spanish. It’s become popular for both casual explorations into language usage and serious linguistic research, and this summer we decided to provide some new ways to search with it.
With our interns Jason Mann, Lu Yang, and David Zhang, we’ve added three new features. The first is wildcards: by putting an asterisk as a placeholder in your query, you can retrieve the ten most popular replacement. For instance,
what noun most often follows “Queen” in English fiction
? The answer is “Elizabeth”:
This graph also reveals that the frequency of mentions of the most popular queens has been decreasing steadily over time. (Language expert Ben Zimmer shows some other interesting examples in
his Atlantic article
.) Right-clicking collapses all of the series into a sum, allowing you to see the overall change.
Another feature we’ve added is the ability to search for inflections: different grammatical forms of the same word. (Inflections of the verb “eat” include “ate”, “eating”, “eats”, and “eaten”.) Here, we can see that
the phrase “changing roles” has recently surged in popularity in English fiction
, besting “change roles”, which earlier dethroned “changed roles”:
Curiously, this switching doesn’t happen
when we add non-fiction into the mix
: “changing roles” is persistently on top, with an odd dip in the late 1980s. As with wildcards, right-clicking collapses and expands the data:
Finally, we’ve implemented the most common feature request from our users: the ability to search for multiple capitalization styles simultaneously. Until now, searching for
common capitalizations of “Mother Earth”
required using a plus sign to combine ngrams (e.g., “Mother Earth + mother Earth + mother earth”), but now the case-insensitive checkbox makes it easier:
As with our other two features, right-clicking toggles whether the variants are shown.
We hope these features help you discover and share interesting trends in language use!
11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts
Wednesday, July 17, 2013
Posted by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard, Google Research
“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato
When you type in a search query -- perhaps
Plato
-- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The
Knowledge Graph
and
Freebase
are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.
We’ve previously released
data to help with disambiguation
and recently awarded
$1.2M in research grants
to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.
These Freebase Annotations of the ClueWeb Corpora (FACC) consist of
ClueWeb09 FACC
and
ClueWeb12 FACC
. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (
Freebase MID’s
). For example:
Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.
Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.
The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several
TREC query sets
, including those from the
Million Query Track
and
Web Track
.
If you would prefer a human-annotated set, you might want to look at the
Wikilinks Corpus
we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.
You can find more detail and download the data on the pages for the two sets:
ClueWeb09 FACC
and
ClueWeb12 FACC
. You can also subscribe to our
data release mailing list
to learn about releases as they happen.
Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
Natural Language Understanding-focused awards announced
Tuesday, July 02, 2013
Posted by Massimiliano Ciaramita, Research Scientist and David Harper, Head University Relations (EMEA)
Some of the biggest challenges for the scientific community today involve understanding the principles and mechanisms that underlie natural language use on the Web. An example of long-standing problem is language ambiguity; when somebody types the word “Rio” in a query do they mean the city, a movie, a casino, or something else? Understanding the difference can be crucial to help users get the answer they are looking for. In the past few years, a significant effort in industry and academia has focused on disambiguating language with respect to Web-scale knowledge repositories such as Wikipedia and Freebase. These resources are used primarily as canonical, although incomplete, collections of “entities”. As entities are often connected in multiple ways, e.g., explicitly via hyperlinks and implicitly via factual information, such resources can be naturally thought of as (knowledge) graphs. This work has provided the first breakthroughs towards anchoring language in the Web to interpretable, albeit initially shallow, semantic representations. Google has brought the vision of semantic search directly to millions of users via the adoption of the
Knowledge Graph
. This massive change to search technology has also been called a shift “from strings to things”.
Understanding natural language is at the core of Google's work to help people get the information they need as quickly and easily as possible. At Google we work hard to advance the state of the art in natural language processing, to improve the understanding of fundamental principles, and to solve the algorithmic and engineering challenges to make these technologies part of everyday life. Language is inherently productive; an infinite number of meaningful new expressions can be formed by combining the meaning of their components systematically. The logical next step is the semantic modeling of structured meaningful expressions -- in other words, “what is said” about entities. We envision that knowledge graphs will support the next leap forward in language understanding towards scalable compositional analyses, by providing a universe of entities, facts and relations upon which semantic composition operations can be designed and implemented.
So we’ve just awarded over $1.2 million to support several natural language understanding research awards given to university research groups doing work in this area. Research topics range from semantic parsing to statistical models of life stories and novel compositional inference and representation approaches to modeling relations and events in the Knowledge Graph.
These awards went to researchers in nine universities and institutions worldwide, selected after a rigorous internal review:
Mark Johnson and Lan Du (Macquarie University) and Wray Buntine (NICTA) for “Generative models of Life Stories”
Percy Liang and Christopher Manning (Stanford University) for “Tensor Factorizing Knowledge Graphs”
Sebastian Riedel (University College London) and Andrew McCallum (University of Massachusetts, Amherst) for “Populating a Knowledge Base of Compositional Universal Schema”
Ivan Titov (University of Amsterdam) for “Learning to Reason by Exploiting Grounded Text Collections”
Hans Uszkoreit (Saarland University and DFKI), Feiyu Xu (DFKI and Saarland University) and Roberto Navigli (Sapienza University of Rome) for “Language Understanding cum Knowledge Yield”
Luke Zettlemoyer (University of Washington) for “Weakly Supervised Learning for Semantic Parsing with Knowledge Graphs”
We believe the results will be broadly useful to product development and will further scientific research. We look forward to working with these researchers, and we hope we will jointly push the frontier of natural language understanding research to the next level.
Distributing the Edit History of Wikipedia Infoboxes
Thursday, May 30, 2013
Posted by Enrique Alfonseca, Google Research
Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building
disambiguation resources
,
parallel corpora
and
structured knowledge
from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.
Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.
Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in
training systems to learn to extract data from documents
, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.
For this reason, we released, in collaboration with
Wikimedia Deutschland e.V.
, a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both
from Google
and from Wikimedia Deutschland’s
Toolserver page
. A description of the dataset can be found in our paper
WHAD: Wikipedia Historical Attributes Data
, accepted for publication at the
Language Resources and Evaluation journal
.
What kind of information can be learned from this data? Some examples from preliminary analyses include the following:
Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.
While infobox attribute updates will be much easier to process as they transition into the
Wikidata
project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.
Thanks to Googler Jean-Yves Delort and
Guillermo Garrido
and
Anselmo Peñas
from
UNED
for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from
Wikipedia Deutschland
for their support. Thanks also to
Thomas Hofmann
and
Fernando Pereira
for making this data release possible.
Syntactic Ngrams over Time
Thursday, May 23, 2013
Posted by Yoav Goldberg, Professor at Bar Ilan University & Post-doc at Google 2011-2013
We are proud to announce the release of a very large dataset of counted dependency tree fragments from the English Books Corpus. This resource will help researchers, among other things, to model the meaning of English words over time and create better natural-language analysis tools. The resource is based on information derived from a syntactic analysis of the text of millions of English books.
Sentences in languages such as English have structure. This structure is called syntax, and knowing the syntax of a sentence is a step towards understanding its meaning. The process of taking a sentence and transforming it into a syntactic structure is called parsing. At Google, we parse a lot of text every day, in order to better understand it and be able to provide better results and services in many of our products.
There are many kinds of syntactic representations (you may be familiar with
sentence diagramming
), and at Google we've been focused on a certain type of syntactic representation called "dependency trees". Dependency-trees representation is centered around words and the relations between them. Each word in a sentence can either modify or be modified by other words. The various modifications can be represented as a tree, in which each node is a word.
For example, the sentence "
we really like syntax
" is analyzed as:
The verb "like" is the main word of the sentence. It is modified by a subject (denoted nsubj) "we", a direct object (denoted dobj) "syntax", and an adverbial modifier "really".
An interesting property of syntax is that, in many cases, one could recover the structure of a sentence without knowing the meaning of most of the words. For example, consider the sentence "the krumpets gnorked the koof with a shlap". We bet you could infer its structure, and tell that group of something which is called a krumpet did something called "gnorking" to something called a "koof", and that they did so with a "shlap".
This property by which you could infer the structure of the sentence based on various hints, without knowing the actual meaning of the words, is very useful. For one, it suggests that a even computer could do a reasonable job at such an analysis, and indeed it can! While still not perfect, parsing algorithms these days can analyze sentences with impressive speed and accuracy. For instance, our parser correctly analyzes the made-up sentence above.
Let's try a more difficult example. Something rather long and literary, like the opening sentence of
One hundred years of solitude
by Gabriel García Márquez, as translated by Gregory Rabassa:
Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice.
Pretty good for an automatic process, eh?
And it doesn’t end here. Once we know the structure of many sentences, we can use these structures to infer the meaning of words, or at least find words which have a similar meaning to each other.
For example, consider the fragments:
"order a XYZ"
"XYZ is tasty"
"XYZ with ketchup"
"juicy XYZ"
By looking at the words modifying XYZ and their relations to it, you could probably infer that XYZ is a kind of food. And even if you are a robot and don't really know what a "food" is, you could probably tell that the XYZ must be similar to other unknown concepts such as "steak" or "tofu".
But maybe you don't want to infer anything. Maybe you already know what you are looking for, say "tasty food". In order to find such tasty food, one could collect the list of words which are objects of the verb "ate", and are commonly modified by the adjective "tasty" and "juicy". This should provide you a large list of yummy foods.
Imagine what you could achieve if you had hundreds of millions of such fragments. The possibilities are endless, and we are curious to know what the research community may come up with. So we parsed a lot of text (over 3.5 million English books, or roughly 350 billion words), extracted such tree fragments, counted how many times each fragment appeared, and put the counts online for everyone to download and play with.
350 billion words is a lot of text, and the resulting dataset of fragments is very, very large. The resulting datasets, each representing a particular type of tree fragments, contain billions of unique items, and each dataset’s compressed files takes tens of gigabytes. Some coding and data analysis skills will be required to process it, but we hope that with this data amazing research will be possible, by experts and non-experts alike.
The dataset is based on the English Books corpus, the same dataset behind the
ngram-viewer
. This time there is no easy-to-use GUI, but we still retain the time information, so for each syntactic fragment, you know not only how many times it appeared overall, but also how many times it appeared in each year -- so you could, for example, look at the subjects of the word “drank” at each decade from 1900 to 2000 and learn how drinking habits changed over time (much more ‘beer’ and ‘coffee’, somewhat less ‘wine’ and ‘glass’ (probably ‘of wine’). There’s also a drop in ‘whisky’, and an increase in ‘alcohol’. Brandy catches on around 1930s, and start dropping around 1980s. There is an increase in ‘juice’, and, thankfully, some decrease in ‘poison’).
The dataset is described in details in this
scientific paper
, and is available for download
here
.
50,000 Lessons on How to Read: a Relation Extraction Corpus
Thursday, April 11, 2013
Posted by Dave Orr, Product Manager, Google Research
One of the most difficult tasks in NLP is called
relation extraction.
It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that
Jim Henson
was in a spouse relation with
Jane Henson
(and in a creator relation with
many
beloved
characters
and
shows
).
The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“
Who created Kermit?
”), learn
which proteins interact
in the biomedical literature, or to build a database of
hundreds of millions of entities and billions of relations
to try and help people
explore the world’s information
.
To help researchers investigate relation extraction, we’re releasing a
human-judged dataset
of two relations about public figures on
Wikipedia
: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.
(Update: you can find additional relations
here
.)
Each relation is in the form of a triple: the relation in question, called a predicate; the subject of the relation; and the object of the relation. In the relation “Stephen Hawking graduated from Oxford,” Stephen Hawking is the subject, graduated from is the relation, and Oxford University is the object. Subjects and objects are represented by their
Freebase MID’s
, and the relation is defined as a
Freebase property
. So in this case, the triple would be represented as:
"pred":"
/education/education/institution
"
"sub":"
/m/01tdnyh
"
"obj":"
/m/07tgn
"
Just having the triples is interesting enough if you want a database of entities and relations, but doesn’t make much progress towards training or evaluation a relation extraction system. So we’ve also included the evidence for the relation, in the form of a URL and an excerpt from the web page that our raters judged. We’re also including examples where the evidence does not support the relation, so you have negative examples for use in training better extraction systems. Finally, we included ID’s and actual judgments of individual raters, so that you can filter triples by agreement.
Gory Details
The corpus itself, extracted from Wikipedia, can be found here:
https://code.google.com/p/relation-extraction-corpus/
The files are in
JSON
format. Each line is a triple with the following fields:
pred: predicate of a triple
sub: subject of a triple
obj: object of a triple
evidences: an array of evidences for this triple
url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple
judgments: an array of judgements from human annotators
rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"
Here’s an example:
{"pred":"/people/person/place_of_birth","sub":"/m/026_tl9","obj":"/m/02_286","evidences":[{"url":"http://en.wikipedia.org/wiki/Morris_S._Miller","snippet":"Morris Smith Miller (July 31, 1779 -- November 16, 1824) was a United States Representative from New York. Born in New York City, he graduated from Union College in Schenectady in 1798. He studied law and was admitted to the bar. Miller served as private secretary to Governor Jay, and subsequently, in 1806, commenced the practice of his profession in Utica. He was president of the village of Utica in 1808 and judge of the court of common pleas of Oneida County from 1810 until his death."}],"judgments":[{"rater":"11595942516201422884","judgment":"yes"},{"rater":"16169597761094238409","judgment":"yes"},{"rater":"1014448455121957356","judgment":"yes"},{"rater":"16651790297630307764","judgment":"yes"},{"rater":"1855142007844680025","judgment":"yes"}]}
The web is chock full of information, put there to be read and learned from. Our hope is that this corpus is a small step towards computational understanding of the wealth of relations to be found everywhere you look.
This dataset is licensed by Google Inc. under the
Creative Commons Attribution-Sharealike 3.0
license.
Thanks to Shaohua Sun, Ni Lao, and Rahul Gupta for putting this dataset together.
Thanks also to Michael Ringgaard, Fernando Pereira, Amar Subramanya, Evgeniy Gabrilovich, and John Giannandrea for making this data release possible.
Learning from Big Data: 40 Million Entities in Context
Friday, March 08, 2013
Posted by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research
When someone mentions Mercury, are they talking about the
planet
, the
god
, the
car
, the
element
,
Freddie
, or one of some
89 other possibilities
? This problem is called
disambiguation
(a word that is itself
ambiguous
), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a
fruit
with a
giant tech company
?), computers need help.
To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (
an idea we’ve discussed before
), then the anchor text can be thought of as a mention of the corresponding entity.
Dataset
Number of Mentions
Number of Entities
Bentivogli et al.
(
data
) (2008)
43,704
709
Day et al.
(2008)
less than 55,000
3,660
Artiles et al.
(
data
) (2010)
57,357
300
Wikilinks Corpus
40,323,863
2,933,659
What might you do with this data? Well, we’ve already written one
ACL paper on cross-document co-reference
(and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:
Look into
coreference
-- when different mentions mention the same entity -- or
entity resolution
-- matching a mention to the underlying entity
Work on the bigger problem of
cross-document coreference
, which is how to find out if different web pages are talking about the same person or other entity
Learn things about entities by aggregating information across all the documents they’re mentioned in
Type tagging
tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.
Gory Details
How do you actually get the data? It’s right here:
Google’s Wikilinks Corpus
. Tools and data with extra context can be found on our partners’ page:
UMass Wiki-links
. Understanding the corpus, however, is a little bit involved.
For copyright reasons, we cannot distribute actual annotated web pages. Instead, we’re providing an index of URLs, and the tools to create the dataset, or whichever slice of it you care about, yourself. Specifically, we’re providing:
The URLs of all the pages that contain labeled mentions, which are links to English Wikipedia
The anchor text of the link (the mention string), the Wikipedia link target, and the byte offset of the link for every page in the set
The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page
Software tools (on the
UMass site
) to: download the web pages; extract the mentions, with ways to recover if the byte offsets don’t match; select the text around the mentions as local context; and compute evaluation metrics over predicted entities.
The format looks like this:
URL http://1967mercurycougar.blogspot.com/2009_10_01_archive.html
MENTION Lincoln Continental Mark IV 40110 http://en.wikipedia.org/wiki/Lincoln_Continental_Mark_IV
MENTION 1975 MGB roadster 41481 http://en.wikipedia.org/wiki/MG_MGB
MENTION Buick Riviera 43316 http://en.wikipedia.org/wiki/Buick_Riviera
MENTION Oldsmobile Toronado 43397 http://en.wikipedia.org/wiki/Oldsmobile_Toronado
TOKEN seen 58190
TOKEN crush 63118
TOKEN owners 69290
TOKEN desk 59772
TOKEN relocate 70683
TOKEN promote 35016
TOKEN between 70846
TOKEN re 52821
TOKEN getting 68968
TOKEN felt 41508
We’d love to hear what you’re working on, and look forward to what you can do with 40 million mentions across over 10 million web pages!
Thanks to our collaborators at
UMass Amherst
:
Sameer Singh
and
Andrew McCallum
.
Recap of NAACL-12 including two Best Paper awards for Googlers
Thursday, June 14, 2012
Posted by Ryan McDonald, Research Scientist, Google Research
This past week, researchers from across the world descended on Montreal for the
Conference of the North American Chapter of the Association for Computational Linguistics
(NAACL). NAACL, as with other Association for Computational Linguistics meetings (ACL), is a premier meeting for researchers who study natural language processing (NLP). This includes applications such as
machine translation
and
sentiment analysis
, but also low-level language technologies such as the automatic analysis of morphology, syntax, semantics and discourse.
Like many applied fields in computer science, NLP underwent a transformation in the mid ‘90s from a primarily rule- and knowledge-based discipline to one whose methods are predominantly statistical and leverage advances in large data and machine learning. This trend continues at NAACL. Two common themes dealt with a historical deficiency of machine-learned NLP systems -- that they require expensive and difficult-to-obtain annotated data in order to achieve high accuracies. To this end, there were a number of studies on unsupervised and weakly-supervised learning for NLP systems, which aim to learn from large corpora containing little to no linguistic annotations, instead relying only on observed regularities in the data or easily obtainable annotations. This typically led to much talk during the question periods about how reliable it might be to use services such as
Mechanical Turk
to get the detailed annotations needed for difficult language prediction tasks. Multilinguality in statistical systems also appeared to be a common theme as researchers have continued to move their focus from building systems for resource-rich languages (e.g., English) to building systems for the rest of the world’s languages, many of which do not have any annotated resources. Work here included focused studies on single languages to studies aiming to develop techniques for a wide variety of languages leveraging morphology, parallel data and regularities across closely-related languages.
There was also an abundance of papers on text analysis for non-traditional domains. This includes the now standard tracks on sentiment analysis, but combined with this, a new focus on social-media, and in particular NLP for microblogs. There was even a paper on predicting whether a given bill will pass committee in the U.S. Congress based on the text of the bill. The presentation of this paper included
the entire video on how a bill becomes a law
.
There were
two keynote talks
. The first talk by Ed Hovy of the Information Sciences Institute of the University of Southern California was on “A New Semantics: Merging Propositional and Distributional Information.” Prof. Hovy gave his insights into the challenge of bringing together distributional (statistical) lexical-semantics and compositional semantics, which has been a need espoused recently by many leaders in the field. The second, by James W. Pennebaker, was called “A, is, I, and, the: How our smallest words reveal the most about who we are.” As a psychologist, Prof. Pennebaker represented the “outsider” keynote that typically draws a lot of interest from the audience, and he did not disappoint. Prof. Pennebaker spoke about how the use of function words can provide interesting social observations. One example was personal pronouns like “we,” whose increased usage now causes people to feel the speaker is colder and more distant as opposed to engaging the audience and making them appear accessible. This is partly due to a second and increasingly more common meaning of “we” that is much more like “you,” e.g., when a boss says: “We must increase sales”.
Finally, this year the organizers of NAACL decided to do something new called “NLP Idol.” The idea was to have four senior researchers in the community select a paper from the past that they think will have (or should have) more impact on future directions of NLP research. The idea is to pluck a paper from obscurity and bring it to the limelight. Each researcher presented their case and three judges gave feedback American Idol-style, with
Brian Roark
hosting a la Ryan Seacrest. The winner was "PAM - A Program That Infers Intentions," published in Inside Computer Understanding in 1981 by
Robert Wilensky
, which was selected and presented by
Ray Mooney
. PAM (“Plan Applier Mechanism”) was a system for understanding agents and their plans, and more generally, what is happening in a discourse and why. Some of the questions that PAM could answer were astonishing, which reminded the audience (or me at least) that while statistical methods have brought NLP broader coverage, this is often at the loss of specificity and deep knowledge representation that previous closed-world language understanding systems could achieve. This echoed sentiments in Prof. Hovy’s invited talk.
Ever since the early days of Google, Googlers have had a presence at NAACL and other ACL-affiliated events. NAACL this year was no different. Googlers authored three papers at the conference, one of which merited the conference’s Best Full Paper Award, and the other the Best Student Paper:
Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure
-
IBM Best Student Paper
Award Oscar Täckström (Google intern), Ryan McDonald (Googler), Jakob Uszkoreit (Googler)
Vine Pruning for Efficient Multi-Pass Dependency Parsing
-
Best Full Paper Award
Alexander Rush (Google intern) and Slav Petrov (Googler)
Unsupervised Translation Sense Clustering
Mohit Bansal (Google intern), John DeNero (Googler), Dekang Lin (Googler)
Many Googlers were also active participants in the NAACL workshops, June 7 - 8:
Computational Linguistics for Literature
David Elson (Googler), Anna Kazantseva, Rada Mihalcea, Stan Szpakowicz
Automatic Knowledge Base Construction/Workshop on Web-scale Knowledge Extraction
Invited Speaker
- Fernando Pereira, Research Director (Googler)
Workshop on Inducing Linguistic Structure
Accepted Paper
- Capitalization Cues Improve Dependency Grammar Induction
Valentin I. Spitkovsky (Googler), Hiyan Alshawi (Googler) and Daniel Jurafsky
Workshop on Statistical Machine TranslationProgram
Committee members
- Keith Hall, Shankar Kumar, Zhifei Li, Klaus Macherey, Wolfgang Macherey, Bob Moore, Roy Tromble, Jakob Uszkoreit, Peng Xu, Richard Zens, Hao Zhang (Googlers)
Workshop on the Future of Language Modeling for HLT
Invited Speaker
- Language Modeling at Google, Shankar Kumar (Googler)
Accepted Paper
- Large-scale discriminative language model reranking for voice-search
Preethi Jyothi, Leif Johnson (Googler), Ciprian Chelba (Googler) and Brian Strope (Googler)
First Workshop on Syntactic Analysis of Non-Canonical Language
Invited Speaker
- Keith Hall (Googler)
Shared Task Organizers
- Slav Petrov, Ryan McDonald (Googlers)
Evaluation Metrics and System Comparison for Automatic Summarization
Program Committee member
- Katja Filippova (Googler)
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Android
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
China
Chrome
Cloud Computing
Collaboration
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Drive
Google Genomics
Google Science Fair
Google Sheets
Google Translate
Google Voice Search
Google+
Government
grants
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
MapReduce
market algorithms
Market Research
ML
MOOC
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Ngram
NIPS
NLP
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PiLab
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
SIGCOMM
SIGMOD
Site Reliability Engineering
Software
Speech
Speech Recognition
statistics
Structured Data
Systems
TensorFlow
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2016
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.