Google Research Blog
The latest news from Research at Google
A Multilingual Corpus of Automatically Extracted Relations from Wikipedia
Tuesday, June 02, 2015
Posted by Shankar Kumar, Google Research Scientist and Manaal Faruqui, Carnegie Mellon University PhD candidate
In
Natural Language Processing
, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases “
Ottawa
” and “
Canada
” is “
is the capital of
”. These extracted relations could be used in a variety of applications ranging from
Question Answering
to building databases from unstructured text.
While relation extraction systems work accurately for English and a few other languages, where tools for syntactic analysis such as parsers, part-of-speech taggers and named entity analyzers are readily available, there is relatively little work in developing such systems for most of the world's languages where linguistic analysis tools do not yet exist. Fortunately, because we do have translation systems between English and many other languages (such as
Google Translate
), we can translate text from a non-English language to English, perform relation extraction and project these relations back to the foreign language.
Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.
In
Multilingual Open Relation Extraction Using Cross-lingual Projection
, that will appear at the
2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies
(NAACL HLT 2015), we use this idea of cross-lingual projection to develop an algorithm that extracts open-domain relation
tuples
, i.e. where an arbitrary phrase can describe the relation between the arguments, in multiple languages from
Wikipedia
. In this work, we also evaluated the performance of extracted relations using human annotations in French, Hindi and Russian.
Since there is no such publicly available corpus of multilingual relations, we are
releasing a dataset
of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). It is our hope that our data will help researchers working on natural language processing and encourage novel applications in a wide variety of languages.
We wish to thank Bruno Cartoni, Vitaly Nikolaev, Hidetoshi Shimokawa, Kishore Papineni, John Giannandrea and their teams for making this data release possible. This dataset is licensed by Google Inc. under the
Creative Commons Attribution-ShareAlike 3.0 License
.
Google Handwriting Input in 82 languages on your Android mobile device
Wednesday, April 15, 2015
Posted by Thomas Deselaers, Daniel Keysers, Henry Rowley, Li-Lun Wang, Victor Cărbune, Ashok Popat, Dhyanesh Narayanan, Handwriting Team, Google Research
Entering text on mobile devices is still considered inconvenient by many; touchscreen keyboards, although much improved over the years, require a lot of attention to hit the right buttons. Voice input is an option, but there are situations where it is not feasible, such as in a noisy environment or during a meeting. Using handwriting as an input method can allow for natural and intuitive input method for text entry which complements typing and speech input methods. However, until recently there have been many languages where enabling this functionality presented significant challenges.
Today we launched
Google Handwriting Input
, which lets users handwrite text on their Android mobile device as an additional input method for
any
Android app. Google Handwriting Input supports 82 languages in 20 distinct scripts, and works with both printed and cursive writing input with or without a stylus. Beyond text input, it also provides a fun way to enter hundreds of emojis by drawing them (simply press and hold the ‘enter’ button to switch modes). Google Handwriting Input works with or without an Internet connection.
By building on
large-scale language modeling
,
robust multi-language OCR
, and incorporating
large-scale neural-networks
and
approximate nearest neighbor search
for character classification, Google Handwriting Input supports languages that can be challenging to type on a virtual keyboard. For example, keyboards for ideographic languages (such as Chinese) are often based on a particular dialect of the language, but if a user does not know that dialect, they may be hard to use. Additionally, keyboards for complex script languages (like many South Asian languages) are less standardized and may be unfamiliar. Even for languages where virtual keyboards are more widely used (like English or Spanish), some users find that handwriting is more intuitive, faster, and generally more comfortable.
Writing 'Hello' in Chinese, German, and Tamil.
Google Handwriting Input is the result of many years of research at Google. Initially, cloud based handwriting recognition supported the
Translate Apps
on
Android
and
iOS
,
Mobile Search
, and
Google Input Tools
(in
Chrome
,
ChromeOS
,
Gmail and Docs
,
translate.google.com
, and the
Docs symbol picker
). However, other products required recognizers to run directly on an Android device without an Internet connection. So we worked to make recognition models smaller and faster for use in Android handwriting input methods for
Simplified
and
Traditional
Chinese,
Cantonese
, and
Hindi
, as well as multi-language support in
Gesture Search
. Google Handwriting Input combines these efforts, allowing recognition both on-device and in the cloud (by tapping on the cloud icon) in any Android app.
You can install Google Handwriting Input from the Play Store
here
. More information and FAQs can be found
here
.
Making Blockly Universally Accessible
Tuesday, April 01, 2014
Posted by Neil Fraser, Chief Interplanetary Liaison
We work hard to make our products accessible to people everywhere, in every culture. Today we’re expanding our outreach efforts to support a traditionally underserved community -- those who call themselves "tlhIngan."
Google's Blockly programming environment is used in K-12 classrooms around the world to teach programming. But the world is not enough. Students on
Qo'noS
have had difficulty learning to code because most of the teaching tools aren't available in their native language. Additionally, many existing tools are too fragile for their pedagogical approach. As a result, Klingons have found it challenging to enter computer science. This is reflected in the fact that less than 2% of Google engineers are Klingon.
Today we launch a full translation of Blockly in Klingon. It incorporates Klingon cultural norms to facilitate learning in this unique population:
Blockly has no syntax errors. This reduces frustration, and reduces the number of computers thrown through bulkheads.
Variables are untyped. Type errors can too easily be perceived as a challenge to the honor of a student's family (and we’ve seen where that ends).
Debugging and bug reports have been omitted, our research indicates that in the event of a bug, they prefer the entire program to just blow up.
Get a little keyboard dirt under your fingernails. Learn that although
ghargh
is delicious, code structure should not resemble it. And above all, be proud that
tlhIngan maH
. Qapla'!
You can try out the demo
here
or get involved
here
.
Google Translate welcomes you to the Indic web
Tuesday, June 21, 2011
Posted by Ashish Venugopal, Research Scientist
(Cross-posted on the
Translate Blog
and the
Official Google Blog
)
Beginning today, you can explore the linguistic diversity of the Indian sub-continent with
Google Translate
, which now supports five new experimental alpha languages: Bengali, Gujarati, Kannada, Tamil and Telugu. In India and Bangladesh alone, more than 500 million people speak these five languages. Since 2009, we’ve launched a total of 11 alpha languages, bringing the current number of languages supported by Google Translate to 63.
Indic languages
differ from English in many ways, presenting several exciting challenges when developing their respective translation systems. Indian languages often use the
Subject Object Verb (SOV) ordering
to form sentences, unlike English, which uses
Subject Verb Object (SVO) ordering
. This difference in sentence structure makes it harder to produce fluent translations; the more words that need to be reordered, the more chance there is to make mistakes when moving them. Tamil, Telugu and Kannada are also highly
agglutinative
, meaning a single word often includes affixes that represent additional meaning, like tense or number. Fortunately, our research to improve Japanese (an SOV language) translation helped us with the word order challenge, while our work translating languages like German, Turkish and Russian provided insight into the agglutination problem.
You can expect translations for these new alpha languages to be less fluent and include many more untranslated words than some of our more mature languages—like Spanish or Chinese—which have much more of the web content that powers our
statistical machine translation approach
. Despite these challenges, we release alpha languages when we believe that they help people better access the multilingual web. If you notice incorrect or missing translations for any of our languages, please
correct us
; we enjoy learning from our mistakes and your feedback helps us graduate new languages from alpha status. If you’re a translator, you’ll also be able to take advantage of our machine translated output when using the
Google Translator Toolkit
.
Since these languages each have their own unique scripts, we’ve enabled a transliterated input method for those of you without Indian language keyboards. For example, if you type in the word “nandri,” it will generate the Tamil word நன்றி (
see what it means
). To see all these beautiful scripts in action, you’ll need to install fonts* for each language.
We hope that the launch of these new alpha languages will help you better understand the Indic web and encourage the publication of new content in Indic languages, taking us five alpha steps closer to a web without language barriers.
*Download the fonts for each language:
Tamil
,
Telugu
,
Bengali
,
Gujarati
and
Kannada
.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PhotoScan
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.