Google Research Blog
The latest news from Research at Google
Slicing and dicing data for interactive visualization
Monday, February 28, 2011
Posted by Benjamin Yolken, Google Public Data Product Manager
A year ago, we introduced the
Google Public Data Explorer
, a tool that allows users to interactively explore public-interest datasets from a variety of influential sources like the World Bank, IMF, Eurostat, and the US Census Bureau. Today, users can visualize over 300 metrics across
31 datasets
, including everything from
labor productivity
(OECD) to
Internet speed
(Ookla) to
gender balance in parliaments
(UNECE) to
government debt levels
(IMF) to
population density by municipality
(Statistics Catalonia), with more data being added every week.
Last week, as part of the launch of our
dataset upload interface
, we released one of the key pieces of technology behind the product: the
Dataset Publishing Language
(DSPL). We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.
DSPL addresses this by adding an additional layer of metadata on top of the raw, tabular data in a dataset. This metadata, expressed in XML, describes the
concepts
in the dataset, for instance “country”, “gender”, “population”, and “unemployment”, giving descriptions, URLs, formatting properties, etc. for each. These concepts are then referenced in
slices
, which partition the former into
dimensions
(i.e., categories) and
metrics
(i.e., quantitative values) and link them with the underlying data tables (provided in CSV format). This structure, along with some additional metadata, is what allows us to provide rich, interactive dataset visualizations in the Public Data Explorer.
With the release of DSPL, we hope to accelerate the process of making the world’s datasets searchable, visualizable, and understandable, without requiring a PhD in statistics. We encourage you to
read more
about the format and try it yourself, both in the
Public Data Explorer
and in your own software. Stay tuned for more DSPL extensions and applications in the future!
Where does my data live?
Friday, February 25, 2011
Posted by Daniel Ford, Senior Mathematician
Have you ever wondered what happens when you upload a photo to Picasa, or where all your Gmail or YouTube videos are stored? How it is that you can read or watch them from anywhere at any time?
If you stored your data on a single hard disk, like the one in your personal computer, then the disk would eventually fail and your data would be lost forever. If you want to protect your data from the possibility of such a failure, you can store copies across many different disks so that if any one fails then you just access the data from another.
However, once storage systems get large enough, anything and everything can and does go wrong. You have to plan not just for disk failures but for server, network, and entire datacenter failures. Add to this software bugs and maintenance operations and you have a whole lot more failures.
Using measurements from dozens of Google data centers, we found that almost-simultaneous failure of many servers in a data center has the greatest impact on availability. On the other hand, disk failures have relatively little impact because our systems are specifically designed to cope with these failures.
Once you have a model of failures, you can also look at the impact of various design choices. Where exactly should you place your data replicas? How fast do you need recover from losing a disk or server? What encoding scheme or number of replicas of the data is enough, given a desired level of availability? For example, we found that storing data across multiple data centers reduces data unavailability by many orders of magnitude compared to having the same number of replicas in a single data center. The added complexity and potential for slower recovery times is worth it to get better availability, or use less storage space, or even both at the same time.
As you can see, something as simple as storing your photos, mail, or videos becomes a lot more involved when you want to be sure it's always available.
In our paper,
Availability in Globally Distributed Storage Systems
, we characterize the availability of cloud storage systems, based on extensive monitoring of Google's main storage infrastructure, and the sources of failure which affect availability. We also present statistical models for reasoning about the impact of design choices such as data placement, recovery speed, and replication strategies, including replication across multiple data centers.
A Runtime Solution for Online Contention Detection and Response
Friday, February 25, 2011
Posted by Jason Mars, Software Engineering Intern
In our recent paper,
Contention Aware Execution: Online Contention Detection and Response
, we have made a big step forward in addressing an important and pressing problem in the field of Computer Science today. This work appears in the
2010 Proceedings of the International Symposium on Code Generation and Optimization (CGO)
and was awarded the CGO 2010 Best Presentation Award at the conference.
One of the greatest challenges when using multicore processors arise when critical resources, such as the on-chip caches, are shared by multiple executing programs. If these programs simultaneously place heavy demands on shared resources, the may be forced to "take turns," and as a result, unpredictable and abrupt slowdowns may occur. This unexpected "cross-core interference" is especially problematic when considering the latency sensitive applications that are found in Google's datacenters, such as web-search. The commonly used solution is to dedicate separate machines to each application, however this leaves the processing capabilities of multicore processors underutilized. In our work, we present the Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current state-of-the-art multicore processors to infer and respond to cross-core interference and requires no added hardware support. Our experiments show that when using our CAER system, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the performance penally due to allowing co-location from 17% down to just 4% on average.
Congratulations to Ken Thompson
Tuesday, February 22, 2011
Posted by Bill Coughran, Senior Vice President of Engineering
I’m happy to share that
Ken Thompson
has been chosen as the recipient of the prestigious
Japan Prize
. The Japan Prize is bestowed for achievements in science and technology that promote the peace and prosperity of mankind.
Ken was awarded the prize along with Dennis Ritchie for their development of the UNIX operating system in 1969 while at Bell Labs. UNIX changed the direction of computing as a whole and paved the way for the development of the personal computers and the server systems that power the Internet.
It’s an enormous source of pride for us to have such amazing talent working here and Ken continues to serve as an inspiration to the rest of us. We’re excited to see what Ken will come up with next.
You can read the full press release
here
.
Query Language Modeling for Voice Search
Thursday, February 17, 2011
Posted by Ciprian Chelba, Research Scientist
About three years ago we set a goal to enable speaking to the Google Search engine on smart-phones. On the language modeling side, the motivation was that we had access to large amounts of typed text data from our users. At the same time, that meant that the users also had a clear expectation for how they would interact with a speech-enabled version of the Google Search application.
The challenge lay in the scale of the problem and the perceived sparsity of the query data. Our paper,
Query Language Modeling for Voice Search
, describes the approach we took, and the empirical findings along the way.
Besides data availability, the project succeeded due to our excellent computational platform, the culture built around teams that wholeheartedly tackle such challenges with the conviction that they will set a new bar, and a collaborative mindset that leverages resources across the company. In this case we used training data made available by colleagues working in query spelling correction, query stream sampling procedures devised for search quality evaluation, the
open finite state tools
, and
distributed language modeling infrastructure
built for machine translation.
Perhaps the most satisfying part of this research project was its impact on the end-user: when presenting the poster at SLT 2010 in Berkeley I offered to demo Google Voice Search, and often got the answer “Thanks, I already use it!”.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PhotoScan
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.