Google Research Blog
The latest news from Research at Google
Distributing the Edit History of Wikipedia Infoboxes
Thursday, May 30, 2013
Posted by Enrique Alfonseca, Google Research
Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building
from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.
Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.
Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in
training systems to learn to extract data from documents
, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.
For this reason, we released, in collaboration with
Wikimedia Deutschland e.V.
, a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is
available for download
. A description of the dataset can be found in our paper
WHAD: Wikipedia Historical Attributes Data
, accepted for publication at the
Language Resources and Evaluation journal
What kind of information can be learned from this data? Some examples from preliminary analyses include the following:
Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.
While infobox attribute updates will be much easier to process as they transition into the
project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.
Thanks to Googler Jean-Yves Delort and
for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from
for their support. Thanks also to
for making this data release possible.
Natural Language Processing
Natural Language Understanding
Open Access for Publications
Wednesday, May 29, 2013
Posted by Alfred Spector, Vice President, Engineering
The Association for Computing Machinery
a new option for publication rights management, wherein researchers can choose to pay for the public to have perpetual open access to the publication. Google applauds this new option, and today we are announcing that we will pay the open access fees for all articles by Google researchers that are published in ACM journals.
also has an open access option for some of its publications, and we also pay the open access fee for them and for publications in like organizations.
Google has always believed that by improving access to the world’s knowledge, we can help improve everyone’s lives. When it comes to scientific research, we have
that open access to publications speeds up research, accelerates innovation, and helps grow the global economy.
Policies like ACM’s continue to demonstrate the sustainability of open access publishing. It will also provide better access to the papers that we write at Google. We encourage researchers everywhere to pursue open access options whenever publishing articles, and to continue to make publications available as widely as possible, within your rights.
Explore more with Mapping with Google
Tuesday, May 28, 2013
Posted by Tina Ornduff, Program Manager
In September 2012 we launched
, an open source learning platform for educators or anyone with something to teach, to create online courses. This was our experimental first step in the world of online education, and since then the features of Course Builder have continued to evolve. Mapping with Google, our latest
, showcases new features of the platform.
From your own backyard all the way to Mount Everest, Google Maps and Google Earth are here to help you explore the world. You can learn to harness the world’s most comprehensive and accurate mapping tools by registering for
Mapping with Google
Mapping with Google
is a self-paced, online course developed to help you better navigate the world around you by improving your use of the
new Google Maps,
Maps Engine Lite, and Google Earth. All registrants will receive an invitation to preview the new Google Maps.
Through a combination of video and text lessons, activities, and projects, you’ll learn to do much more than look up directions or find your house from outer space. Tell a story of your favorite locations with rich 3D imagery, or plot sights to see on your upcoming trip and share with your travel buddies. During the course, you’ll have the opportunity to learn from Google experts and collaborate with a worldwide community of participants, via Google+ Hangouts and a course forum.
Mapping with Google
will be offered from
June 10 - June 24
, and you can choose whether to explore the features of Google Maps, Google Earth, or both. In addition, you’ll have the option to complete a project, applying the skills you’ve learned to earn a certificate. Visit
to learn more and register today.
The world is a big place; we like to think that you can make it a bit more manageable and adventurous with Google’s mapping tools.
Syntactic Ngrams over Time
Thursday, May 23, 2013
Posted by Yoav Goldberg, Professor at Bar Ilan University & Post-doc at Google 2011-2013
We are proud to announce the release of a very large dataset of counted dependency tree fragments from the English Books Corpus. This resource will help researchers, among other things, to model the meaning of English words over time and create better natural-language analysis tools. The resource is based on information derived from a syntactic analysis of the text of millions of English books.
Sentences in languages such as English have structure. This structure is called syntax, and knowing the syntax of a sentence is a step towards understanding its meaning. The process of taking a sentence and transforming it into a syntactic structure is called parsing. At Google, we parse a lot of text every day, in order to better understand it and be able to provide better results and services in many of our products.
There are many kinds of syntactic representations (you may be familiar with
), and at Google we've been focused on a certain type of syntactic representation called "dependency trees". Dependency-trees representation is centered around words and the relations between them. Each word in a sentence can either modify or be modified by other words. The various modifications can be represented as a tree, in which each node is a word.
For example, the sentence "
we really like syntax
" is analyzed as:
The verb "like" is the main word of the sentence. It is modified by a subject (denoted nsubj) "we", a direct object (denoted dobj) "syntax", and an adverbial modifier "really".
An interesting property of syntax is that, in many cases, one could recover the structure of a sentence without knowing the meaning of most of the words. For example, consider the sentence "the krumpets gnorked the koof with a shlap". We bet you could infer its structure, and tell that group of something which is called a krumpet did something called "gnorking" to something called a "koof", and that they did so with a "shlap".
This property by which you could infer the structure of the sentence based on various hints, without knowing the actual meaning of the words, is very useful. For one, it suggests that a even computer could do a reasonable job at such an analysis, and indeed it can! While still not perfect, parsing algorithms these days can analyze sentences with impressive speed and accuracy. For instance, our parser correctly analyzes the made-up sentence above.
Let's try a more difficult example. Something rather long and literary, like the opening sentence of
One hundred years of solitude
by Gabriel García Márquez, as translated by Gregory Rabassa:
Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice.
Pretty good for an automatic process, eh?
And it doesn’t end here. Once we know the structure of many sentences, we can use these structures to infer the meaning of words, or at least find words which have a similar meaning to each other.
For example, consider the fragments:
"order a XYZ"
"XYZ is tasty"
"XYZ with ketchup"
By looking at the words modifying XYZ and their relations to it, you could probably infer that XYZ is a kind of food. And even if you are a robot and don't really know what a "food" is, you could probably tell that the XYZ must be similar to other unknown concepts such as "steak" or "tofu".
But maybe you don't want to infer anything. Maybe you already know what you are looking for, say "tasty food". In order to find such tasty food, one could collect the list of words which are objects of the verb "ate", and are commonly modified by the adjective "tasty" and "juicy". This should provide you a large list of yummy foods.
Imagine what you could achieve if you had hundreds of millions of such fragments. The possibilities are endless, and we are curious to know what the research community may come up with. So we parsed a lot of text (over 3.5 million English books, or roughly 350 billion words), extracted such tree fragments, counted how many times each fragment appeared, and put the counts online for everyone to download and play with.
350 billion words is a lot of text, and the resulting dataset of fragments is very, very large. The resulting datasets, each representing a particular type of tree fragments, contain billions of unique items, and each dataset’s compressed files takes tens of gigabytes. Some coding and data analysis skills will be required to process it, but we hope that with this data amazing research will be possible, by experts and non-experts alike.
The dataset is based on the English Books corpus, the same dataset behind the
. This time there is no easy-to-use GUI, but we still retain the time information, so for each syntactic fragment, you know not only how many times it appeared overall, but also how many times it appeared in each year -- so you could, for example, look at the subjects of the word “drank” at each decade from 1900 to 2000 and learn how drinking habits changed over time (much more ‘beer’ and ‘coffee’, somewhat less ‘wine’ and ‘glass’ (probably ‘of wine’). There’s also a drop in ‘whisky’, and an increase in ‘alcohol’. Brandy catches on around 1930s, and start dropping around 1980s. There is an increase in ‘juice’, and, thankfully, some decrease in ‘poison’).
The dataset is described in details in this
, and is available for download
Natural Language Processing
Natural Language Understanding
Launching the Quantum Artificial Intelligence Lab
Thursday, May 16, 2013
Posted by Hartmut Neven, Director of Engineering
We believe quantum computing may help solve some of the most challenging computer science problems, particularly in machine learning. Machine learning is all about building better models of the world to make more accurate predictions. If we want to cure diseases, we need better models of how they develop. If we want to create effective environmental policies, we need better models of what’s happening to our climate. And if we want to build a more useful search engine, we need to better understand spoken questions and what’s on the web so you get the best answer.
So today we’re launching the Quantum Artificial Intelligence Lab. NASA’s Ames Research Center will host the lab, which will house a quantum computer from
, and the
(Universities Space Research Association) will invite researchers from around the world to share time on it. Our goal: to study how quantum computing might advance machine learning.
Machine learning is highly difficult. It’s what mathematicians call an “NP-hard” problem. That’s because building a good model is really a creative act. As an analogy, consider what it takes to architect a house. You’re balancing lots of constraints -- budget, usage requirements, space limitations, etc. -- but still trying to create the most beautiful house you can. A creative architect will find a great solution. Mathematically speaking the architect is solving an optimization problem and creativity can be thought of as the ability to come up with a good solution given an objective and constraints.
Classical computers aren’t well suited to these types of creative problems. Solving such problems can be imagined as trying to find the lowest point on a surface covered in hills and valleys. Classical computing might use what’s called “gradient descent”: start at a random spot on the surface, look around for a lower spot to walk down to, and repeat until you can’t walk downhill anymore. But all too often that gets you stuck in a “local minimum” -- a valley that isn’t the very lowest point on the surface.
That’s where quantum computing comes in. It lets you cheat a little, giving you some chance to “tunnel” through a ridge to see if there’s a lower valley hidden beyond it. This gives you a much better shot at finding the true lowest point -- the optimal solution.
We’ve already developed some quantum machine learning algorithms. One produces very compact, efficient recognizers -- very useful when you’re short on power, as on a mobile device. Another can handle highly polluted training data, where a high percentage of the examples are mislabeled, as they often are in the real world. And we’ve learned some useful principles: e.g., you get the best results not with pure quantum computing, but by mixing quantum and classical computing.
Can we move these ideas from theory to practice, building real solutions on quantum hardware? Answering this question is what the Quantum Artificial Intelligence Lab is for. We hope it helps researchers construct more efficient and more accurate models for everything from speech recognition, to web search, to protein folding. We actually think quantum machine learning may provide the most creative problem-solving process under the known laws of physics. We’re excited to get started with NASA Ames, D-Wave, the USRA, and scientists from around the world.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog