Google Research Blog
The latest news from Research at Google
Languages of the World (Wide Web)
Thursday, July 07, 2011
Posted by Daniel Ford and Josh Batson
The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.
Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.
To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.
We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we'll discuss more in a moment. (Figure 1)
Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia.
The language linkages invite explanations around geopolitics, linguistics, and historical associations.
Figure 1: Language links on the web.
The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.
Examining links between other languages, it seems that many are explained by people and communities which speak both languages.
The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.
The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.
Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.
What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different (Figure 2). The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages may not have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others.
Figure 2: The number of sites and seen urls per language are roughly exponentially distributed.
The largest language on the web, in terms of size and centrality, has always been English, but where is it on our map?
Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan (Figure 3). Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.
Figure 3: Language links to and from English
You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.
Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion', or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content. (Figure 4)
Figure 4: Language size vs introversion.
There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.
We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to (Figure 5). This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.
Figure 5: Unexpected connections, given the size of each language.
What's happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like Google page translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.
9 July 2011:
As has been pointed out in the comments, in
both the Philippines and Pakistan, English is one of the two
official languages; however, the Philippines was not a British colony.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog