Google Research Blog
The latest news from Research at Google
Distributing the Edit History of Wikipedia Infoboxes
Thursday, May 30, 2013
Posted by Enrique Alfonseca, Google Research
Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building
disambiguation resources
,
parallel corpora
and
structured knowledge
from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.
Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.
Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in
training systems to learn to extract data from documents
, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.
For this reason, we released, in collaboration with
Wikimedia Deutschland e.V.
, a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is
available for download
. A description of the dataset can be found in our paper
WHAD: Wikipedia Historical Attributes Data
, accepted for publication at the
Language Resources and Evaluation journal
.
What kind of information can be learned from this data? Some examples from preliminary analyses include the following:
Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.
While infobox attribute updates will be much easier to process as they transition into the
Wikidata
project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.
Thanks to Googler Jean-Yves Delort and
Guillermo Garrido
and
Anselmo Peñas
from
UNED
for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from
Wikipedia Deutschland
for their support. Thanks also to
Thomas Hofmann
and
Fernando Pereira
for making this data release possible.
Learning from Big Data: 40 Million Entities in Context
Friday, March 08, 2013
Posted by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research
When someone mentions Mercury, are they talking about the
planet
, the
god
, the
car
, the
element
,
Freddie
, or one of some
89 other possibilities
? This problem is called
disambiguation
(a word that is itself
ambiguous
), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a
fruit
with a
giant tech company
?), computers need help.
To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (
an idea we’ve discussed before
), then the anchor text can be thought of as a mention of the corresponding entity.
Dataset
Number of Mentions
Number of Entities
Bentivogli et al.
(
data
) (2008)
43,704
709
Day et al.
(2008)
less than 55,000
3,660
Artiles et al.
(
data
) (2010)
57,357
300
Wikilinks Corpus
40,323,863
2,933,659
What might you do with this data? Well, we’ve already written one
ACL paper on cross-document co-reference
(and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:
Look into
coreference
-- when different mentions mention the same entity -- or
entity resolution
-- matching a mention to the underlying entity
Work on the bigger problem of
cross-document coreference
, which is how to find out if different web pages are talking about the same person or other entity
Learn things about entities by aggregating information across all the documents they’re mentioned in
Type tagging
tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.
Gory Details
How do you actually get the data? It’s right here:
Google’s Wikilinks Corpus
. Tools and data with extra context can be found on our partners’ page:
UMass Wiki-links
. Understanding the corpus, however, is a little bit involved.
For copyright reasons, we cannot distribute actual annotated web pages. Instead, we’re providing an index of URLs, and the tools to create the dataset, or whichever slice of it you care about, yourself. Specifically, we’re providing:
The URLs of all the pages that contain labeled mentions, which are links to English Wikipedia
The anchor text of the link (the mention string), the Wikipedia link target, and the byte offset of the link for every page in the set
The byte offset of the 10 least frequent words on the page, to act as a signature to ensure that the underlying text hasn’t changed -- think of this as a version, or fingerprint, of the page
Software tools (on the
UMass site
) to: download the web pages; extract the mentions, with ways to recover if the byte offsets don’t match; select the text around the mentions as local context; and compute evaluation metrics over predicted entities.
The format looks like this:
URL http://1967mercurycougar.blogspot.com/2009_10_01_archive.html
MENTION Lincoln Continental Mark IV 40110 http://en.wikipedia.org/wiki/Lincoln_Continental_Mark_IV
MENTION 1975 MGB roadster 41481 http://en.wikipedia.org/wiki/MG_MGB
MENTION Buick Riviera 43316 http://en.wikipedia.org/wiki/Buick_Riviera
MENTION Oldsmobile Toronado 43397 http://en.wikipedia.org/wiki/Oldsmobile_Toronado
TOKEN seen 58190
TOKEN crush 63118
TOKEN owners 69290
TOKEN desk 59772
TOKEN relocate 70683
TOKEN promote 35016
TOKEN between 70846
TOKEN re 52821
TOKEN getting 68968
TOKEN felt 41508
We’d love to hear what you’re working on, and look forward to what you can do with 40 million mentions across over 10 million web pages!
Thanks to our collaborators at
UMass Amherst
:
Sameer Singh
and
Andrew McCallum
.
From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas
Friday, May 18, 2012
Posted by Valentin Spitkovsky and Peter Norvig, Research Team
Yet in each word some concept there must be...
— from Goethe's
Faust
(Part I, Scene III)
Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google's core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.
How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual
Wikipedia article
as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using
Wikipedia's groupings of articles into hierarchical categories
.
The data set contains triples, each consisting of (i)
text
, a short, raw natural language string; (ii)
url
, a related concept, represented by an
English Wikipedia article's canonical location
; and (iii)
count
, an integer indicating the number of times
text
has been observed connected with the concept's
url
. Our database thus includes weights that measure degrees of association. For example, the top two entries for
football
indicate that it is an ambiguous term, which is almost twice as likely to refer to what we in the US call
soccer
:
text
=football
url
count
1.
Association football
44,984
2.
American football
23,373
⋮
An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Some of the highest-scoring strings — including synonyms and translations — for both sports, are listed below:
concept
:
“
soccer
”
football
and
Football
Soccer
and
soccer
Association football
fútbol
and
Fútbol
footballer
Futbol
and
futbol
Fußball
futebol
futbolista
サッカー
축구
footballeur
Fußballspieler
sepak bola
足球
فوتبال
футболист
כדורגל
piłkarz
voetbalclub
ฟุตบอล
bóng đá
voetbal
Foutbaal
futebolista
لعبة كرة القدم
fotbal
concept
:
“
football
”
American football
football
and
Football
fútbol americano
football américain
アメリカンフットボール
American football rules
futebol americano
فوتبال آمریکایی
美式足球
football americano
Amerikan futbolu
Le Football Américain
football field
อเมริกันฟุตบอล
פוטבול
كرة القدم الأمريكية
Futbol amerykański
미식축구
futbolu amerykańskiego
football team
американского футбола
Amerikai futball
sepak bola Amerika
football player
američki fudbal
反則
كرة القدم الأميركية
Associated counts can easily be turned into percentages. The following table illustrates the concept-to-words dictionary direction — which may be useful for paraphrasing, summarization and topic modeling — for the idea of
soft drink
, restricted to English (and normalized for punctuation, pluralization and capitalization differences):
url
=
Soft_drink
text
%
1.
soft drink
(and
soft-drinks
)
28.6
2.
soda
(and
sodas
)
5.5
3.
soda pop
0.9
4.
fizzy drinks
0.6
5.
carbonated beverages
(and
beverage
)
0.3
6.
non-alcoholic
0.2
7.
soft
0.1
8.
pop
0.1
9.
carbonated soft drink
(and
drinks
)
0.1
10.
aerated water
0.1
11.
non-alcoholic drinks
(and
drink
)
0.1
12.
soft drink controversy
0.0
13.
citrus-flavored soda
0.0
14.
carbonated
0.0
15.
soft drink topics
0.0
⋮
The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other. The next table shows the top concepts meant by the string
Stanford
, which refers to all three (and other) types:
text
=Stanford
url
%
type
1.
Stanford University
50.3
ORGANIZATION
2.
Stanford (disambiguation)
7.7
a disambiguation page
3.
Stanford, California
7.5
LOCATION
4.
Stanford Cardinal football
5.7
ORGANIZATION
5.
Stanford Cardinal
4.1
multiple athletic programs
6.
Stanford Cardinal men's basketball
2.0
ORGANIZATION
7.
Stanford prison experiment
2.0
a famous psychology experiment
8.
Stanford, Kentucky
1.7
LOCATION
9.
Stanford, Norfolk
1.0
LOCATION
10.
Bank of the West Classic
1.0
a recurring sporting event
11.
Stanford, Illinois
0.9
LOCATION
12.
Leland Stanford
0.9
PERSON
13.
Charles Villiers Stanford
0.8
PERSON
14.
Stanford, New York
0.8
LOCATION
15.
Stanford, Bedfordshire
0.8
LOCATION
⋮
The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing
non-existent articles
. For technical details, see our
paper
(to be
presented
at
LREC 2012
) and the
README
file accompanying the
data
.
We hope that
this release
will fuel numerous creative applications that haven't been previously thought of!
Produced by
Angel X. Chang
and
Valentin I. Spitkovsky
; parts of this work are descended from an earlier collaboration between
University of Basque Country's Ixa Group
's
Eneko Agirre
and
Stanford's NLP Group
, including
Eric Yeh
, presently of
SRI International
, and our Ph.D. advisors,
Christopher D. Manning
and
Daniel Jurafsky
.