Google Research Blog
The latest news from Research at Google
The reusable holdout: Preserving validity in adaptive data analysis
Thursday, August 06, 2015
Posted by Moritz Hardt, Research Scientist
Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular
XKCD cartoon
illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a
p-value
below the infamous 0.05 significance level.
Image credit:
XKCD
Unfortunately, the problem of false discovery is even more delicate than the cartoon suggests. Correcting reported p-values for a fixed number of multiple tests is a fairly well understood topic in statistics. A simple approach is to multiply each p-value by the number of tests, but there are more sophisticated tools. However, almost all existing approaches to ensuring the validity of statistical inferences assume that the analyst performs a
fixed
procedure chosen before the data are examined. For example, “test all 20 flavors of jelly beans”. In practice, however, the analyst is informed by data exploration, as well as the results of previous analyses. How did the scientist choose to study acne and jelly beans in the first place? Often such choices are influenced by previous interactions with the same data. This
adaptive
behavior of the analyst leads to an increased risk of spurious discoveries that are neither prevented nor detected by standard approaches. Each adaptive choice the analyst makes multiplies the number of possible analyses that could possibly follow; it is often difficult or impossible to describe and analyze the exact experimental setup ahead of time.
In
The Reusable Holdout: Preserving Validity in Adaptive Data Analysis
, a joint work with
Cynthia Dwork
(Microsoft Research),
Vitaly Feldman
(IBM Almaden Research Center),
Toniann Pitassi
(University of Toronto),
Omer Reingold
(Samsung Research America) and
Aaron Roth
(University of Pennsylvania), to appear in
Science
tomorrow, we present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the
reusable holdout
mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time.
The curse of adaptivity
A beautiful example of how false discovery arises as a result of adaptivity is
Freedman’s paradox
. Suppose that we want to build a model that explains “systolic blood pressure” in terms of hundreds of variables quantifying the intake of various kinds of food. In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard
F-test
from our favorite statistics textbook and report the resulting p-value.
Inference after selection: We first select a subset of the variables based on a data-dependent criterion and then fit a linear model on the selected variables.
Freedman showed that the reported p-value is highly misleading - even if the data were completely random with no correlation whatsoever between the response variable and the data points, we’d likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.
Freedman’s paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as
“garden of the forking paths”
.
Machine learning competitions and holdout sets
Adaptivity is not just an issue with p-values in the empirical sciences. It affects other domains of data science just as well. Machine learning competitions are a perfect example. Competitions have become an extremely popular format for solving prediction and classification problems of all sorts.
Each team in the competition has full access to a publicly available training set which they use to build a predictive model for a certain task such as image classification. Competitors can repeatedly submit a model and see how the model performs on a fixed holdout data set not available to them. The central component of any competition is the public leaderboard which ranks all teams according to the prediction accuracy of their best model so far on the holdout. Every time a team makes a submission they observe the score of their model on the same holdout data. This methodology is inspired by the classic holdout method for validating the performance of a predictive model.
Ideally, the holdout score gives an accurate estimate of the true performance of the model on the underlying distribution from which the data were drawn. However, this is only the case when the model is independent of the holdout data! In contrast, in a competition the model generally incorporates previously observed feedback from the holdout set. Competitors work adaptively and iteratively with the feedback they receive. An improved score for one submission might convince the team to tweak their current approach, while a lower score might cause them to try out a different strategy. But the moment a team modifies their model based on a previously observed holdout score, they create a dependency between the model and the holdout data that invalidates the assumption of the classic holdout method. As a result, competitors may begin to overfit to the holdout data that supports the leaderboard. This means that their score on the public leaderboard continues to improve, while the true performance of the model does not. In fact, unreliable leaderboards are a widely observed phenomenon in machine learning competitions.
Reusable holdout sets
A standard proposal for coping with adaptivity is simply to discourage it. In the empirical sciences, this proposal is known as
pre-registration
and requires the researcher to specify the exact experimental setup ahead of time. While possible in some simple cases, it is in general too restrictive as it runs counter to today’s complex data analysis workflows.
Rather than limiting the analyst, our approach provides means of reliably verifying the results of an arbitrary adaptive data analysis. The key tool for doing so is what we call the
reusable holdout method
. As with the classic holdout method discussed above, the analyst is given unfettered access to the training data. What changes is that there is a new algorithm in charge of evaluating statistics on the holdout set. This algorithm ensures that the holdout set maintains the essential guarantees of fresh data over the course of many estimation steps.
The limit of the method is determined by the size of the holdout set - the number of times that the holdout set may be used grows roughly as the square of the number of collected data points in the holdout, as our theory shows.
Armed with the reusable holdout, the analyst is free to explore the training data and verify tentative conclusions on the holdout set. It is now entirely safe to use any information provided by the holdout algorithm in the choice of new analyses to carry out, or the tweaking of existing models and parameters.
A general methodology
The reusable holdout is only one instance of a broader methodology that is, perhaps surprisingly, based on
differential privacy
—a notion of privacy preservation in data analysis. At its core, differential privacy is a notion of
stability
requiring that any single sample should not influence the outcome of the analysis significantly.
Example of a stable learning algorithm: Deletion of any single data point does not affect the accuracy of the classifier much.
A beautiful line of work in machine learning shows that various notions of stability imply
generalization
. That is any sample estimate computed by a stable algorithm (such as the prediction accuracy of a model on a sample) must be close to what we would observe on fresh data.
What sets differential privacy apart from other stability notions is that it is preserved by
adaptive
composition. Combining multiple algorithms that each preserve differential privacy yields a new algorithm that also satisfies differential privacy albeit at some quantitative loss in the stability guarantee. This is true even if the output of one algorithm influences the choice of the next. This strong adaptive composition property is what makes differential privacy an excellent stability notion for adaptive data analysis.
In a nutshell, the reusable holdout mechanism is simply this: access the holdout set only through a suitable differentially private algorithm. It is important to note, however, that the user does not need to understand differential privacy to use our method. The user interface of the reusable holdout is the same as that of the widely used classical method.
Reliable benchmarks
A closely
related work with Avrim Blum
dives deeper into the problem of maintaining a reliable leaderboard in machine learning competitions (see
this blog post
for more background). While the reusable holdout could directly be used for this purpose, it turns out that a variant of the reusable holdout, we call the
Ladder algorithm
, provides even better accuracy.
This method is not just useful for machine learning competitions, since there are many problems that are roughly equivalent to that of maintaining an accurate leaderboard in a competition. Consider, for example, a performance benchmark that a company uses to test improvements to a system internally before deploying them in a production system. As the benchmark data set is used repeatedly and adaptively for tasks such as model selection, hyper-parameter search and testing, there is a danger that eventually the benchmark becomes unreliable.
Conclusion
Modern data analysis is inherently an adaptive process. Attempts to limit what data scientists will do in practice are ill-fated. Instead we should create tools that respect the usual workflow of data science while at the same time increasing the reliability of data driven insights. It is our goal to continue exploring techniques that can help to create more reliable validation techniques and benchmarks that track true performance more accurately than existing methods.
Automatically making sense of data
Tuesday, December 02, 2014
Posted by Kevin Murphy, Research Scientist and David Harper, Head of University Relations, EMEA
While the availability and size of data sets across a wide range of sources, from medical to scientific to commercial, continues to grow, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. But what if one could automatically discover human-interpretable trends in data in an unsupervised way, and then summarize these trends in textual and/or visual form?
To help make progress in this area,
Professor Zoubin Ghahramani
and his group at the University of Cambridge received a
Google Focused Research Award
in support of
The Automatic Statistician
project, which aims to build an "artificial intelligence for data science".
So far, the project has mostly been focussing on finding trends in time series data. For example, suppose we measure the levels of solar irradiance over time, as shown in this plot:
This time series clearly exhibits several sources of variation: it is approximately periodic (with a period of about 11 years, known as the
Schwabe cycle
), but with notably low levels of activity in the late 1600s. It would be useful to automatically discover these kinds of regularities (as well as irregularities), to help further basic scientific understanding, as well as to help make more accurate forecasts in the future.
We can model such data using non-parametric statistical models based on
Gaussian processes
. Such methods require the specification of a
kernel function
which characterizes the nature of the underlying function that can accurately model the data (e.g., is it periodic? is it smooth? is it monotonic?). While the parameters of this kernel function are estimated from data, the form of the kernel itself is typically specified by hand, and relies on the knowledge and experience of a trained data scientist.
Prof Ghahramani's group has developed an
algorithm
that can automatically discover a good kernel, by searching through an open-ended space of sums and products of kernels as well as other compositional operations. After model selection and fitting, the Automatic Statistician translates each kernel into a text description describing the main trends in the data in an easy-to-understand form.
The compositional structure of the space of statistical models neatly maps onto compositionally constructed sentences allowing for the automatic description of the statistical models produced by any kernel. For example, in a product of kernels, one kernel can be mapped to a standard noun phrase (e.g. ‘a periodic function’) and the other kernels to appropriate modifiers of this noun phrase (e.g. ‘whose shape changes smoothly’, ‘with growing amplitude’). The end result is an automatically generated 5-15 page report describing the patterns in the data with figures and tables supporting the main claims. Here is an extract of the report produced by their system for the solar irradiance data:
Extract of the
report for the solar irradiance data
, automatically generated by the automatic statistician.
The Automatic Statistician
is currently being generalized to find patterns in other kinds of data, such as multidimensional regression problems, and relational databases. A web-based demo of a simplified version of the system was launched in August 2014. It allowed a user to upload a dataset, and to receive an automatically produced analysis after a few minutes. An expanded version of the service will be launched in early 2015 (we will post details when available). We believe this will have many applications for anyone interested in Data Science.
Excellent Papers for 2011
Thursday, March 22, 2012
Posted by Corinna Cortes and Alfred Spector, Google Research
UPDATE: Added
Theo Vassilakis
as an author for "Dremel: Interactive Analysis of Web-Scale Datasets"
Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our
publications
offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google.
In an effort to highlight some of our work, we periodically select a number of publications to be featured on this blog. We first posted a
set of papers
on this blog in mid-2010 and subsequently discussed them in more detail in the following blog postings. In a
second round
, we highlighted new noteworthy papers from the later half of 2010. This time we honor the influential papers authored or co-authored by Googlers covering all of 2011 -- covering roughly 10% of our total publications. It’s tough choosing, so we may have left out some important papers. So, do see the
publications list
to review the complete group.
In the coming weeks we will be offering a more in-depth look at these publications, but here are some summaries:
Audio processing
“
Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function
”,
Richard F. Lyon
,
Journal of the Acoustical Society of America
, vol. 130 (2011), pp. 3893-3904.
Lyon's long title summarizes a result that he has been working toward over many years of modeling sound processing in the inner ear. This nonlinear cochlear model is shown to be "good" with respect to psychophysical data on masking, physiological data on mechanical and neural response, and computational efficiency. These properties derive from the close connection between wave propagation and filter cascades. This filter-cascade model of the ear is used as an efficient sound processor for several machine hearing projects at Google.
Electronic Commerce and Algorithms
“
Online Vertex-Weighted Bipartite Matching and Single-bid Budgeted Allocations
”,
Gagan Aggarwal
,
Gagan Goel
,
Chinmay Karande
,
Aranyak Mehta
,
SODA 2011
.
The authors introduce an elegant and powerful algorithmic technique to the area of online ad allocation and matching: a hybrid of random perturbations and greedy choice to make decisions on the fly. Their technique sheds new light on classic matching algorithms, and can be used, for example, to pick one among a set of relevant ads, without knowing in advance the demand for ad slots on future web page views.
“
Milgram-routing in social networks
”,
Silvio Lattanzi
, Alessandro Panconesi, D. Sivakumar,
Proceedings of the 20th International Conference on World Wide Web, WWW 2011
, pp. 725-734.
Milgram’s "six-degrees-of-separation experiment" and the fascinating small world hypothesis that follows from it, have generated a lot of interesting research in recent years. In this landmark experiment, Milgram showed that people unknown to each other are often connected by surprisingly short chains of acquaintances. In the paper we prove theoretically and experimentally how a recent model of social networks, "Affiliation Networks", offers an explanation to this phenomena and inspires interesting technique for local routing within social networks.
“
Non-Price Equilibria in Markets of Discrete Goods
”, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Noam Nisan,
EC
, 2011.
We present a correspondence between markets of indivisible items, and a family of auction based n player games. We show that a market has a price based (Walrasian) equilibrium if and only if the corresponding game has a pure Nash equilibrium. We then turn to markets which do not have a Walrasian equilibrium (which is the interesting case), and study properties of the mixed Nash equilibria of the corresponding games.
HCI
“
From Basecamp to Summit: Scaling Field Research Across 9 Locations
”,
Jens Riegelsberger
, Audrey Yang, Konstantin Samoylov, Elizabeth Nunge, Molly Stevens, Patrick Larvie,
CHI 2011 Extended Abstracts
.
The paper reports on our experience with a basecamp research hub to coordinate logistics and ongoing real-time analysis with research teams in the field. We also reflect on the implications for the meaning of research in a corporate context, where much of the value may be less in a final report, but more in the curated impressions and memories our colleagues take away from the the research trip.
“
User-Defined Motion Gestures for Mobile Interaction
”, Jaime Ruiz,
Yang Li
, Edward Lank,
CHI 2011: ACM Conference on Human Factors in Computing Systems
, pp. 197-206.
Modern smartphones contain sophisticated sensors that can detect rich motion gestures — deliberate movements of the device by end-users to invoke commands. However, little is known about best-practices in motion gesture design for the mobile computing paradigm. We systematically studied the design space of motion gestures via a guessability study that elicits end-user motion gestures to invoke commands on a smartphone device. The study revealed consensus among our participants on parameters of movement and on mappings of motion gestures onto commands, by which we developed a taxonomy for motion gestures and compiled an end-user inspired motion gesture set. The work lays the foundation of motion gesture design—a new dimension for mobile interaction.
Information Retrieval
“
Reputation Systems for Open Collaboration
”, B.T. Adler, L. de Alfaro,
A. Kulshreshtha
, I. Pye,
Communications of the ACM
, vol. 54 No. 8 (2011), pp. 81-87.
This paper describes content based reputation algorithms, that rely on automated content analysis to derive user and content reputation, and their applications for Wikipedia and google Maps. The Wikipedia reputation system WikiTrust relies on a chronological analysis of user contributions to articles, metering positive or negative increments of reputation whenever new contributions are made. The Google Maps system Crowdsensus compares the information provided by users on map business listings and computes both a likely reconstruction of the correct listing and a reputation value for each user. Algorithmic-based user incentives ensure the trustworthiness of evaluations of Wikipedia entries and Google Maps business information.
Machine Learning and Data Mining
“
Domain adaptation in regression
”,
Corinna Cortes
,
Mehryar Mohri
,
Proceedings of The 22nd International Conference on Algorithmic Learning Theory, ALT 2011
.
Domain adaptation is one of the most important and challenging problems in machine learning. This paper presents a series of theoretical guarantees for domain adaptation in regression, gives an adaptation algorithm based on that theory that can be cast as a semi-definite programming problem, derives an efficient solution for that problem by using results from smooth optimization, shows that the solution can scale to relatively large data sets, and reports extensive empirical results demonstrating the benefits of this new adaptation algorithm.
“
On the necessity of irrelevant variables
”, David P. Helmbold,
Philip M. Long
,
ICML
, 2011
Relevant variables sometimes do much more good than irrelevant variables do harm, so that it is possible to learn a very accurate classifier using predominantly irrelevant variables. We show that this holds given an assumption that formalizes the intuitive idea that the variables are non-redundant. For problems like this it can be advantageous to add many additional variables, even if only a small fraction of them are relevant.
“
Online Learning in the Manifold of Low-Rank Matrices
”,
Gal Chechik
, Daphna Weinshall, Uri Shalit,
Neural Information Processing Systems (NIPS 23)
, 2011, pp. 2128-2136.
Learning measures of similarity from examples of similar and dissimilar pairs is a problem that is hard to scale. LORETA uses retractions, an operator from matrix optimization, to learn low-rank similarity matrices efficiently. This allows to learn similarities between objects like images or texts when represented using many more features than possible before.
Machine Translation
“
Training a Parser for Machine Translation Reordering
”, Jason Katz-Brown,
Slav Petrov
,
Ryan McDonald
,
Franz Och
, David Talbot, Hiroshi Ichikawa, Masakazu Seno,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP '11)
.
Machine translation systems often need to understand the syntactic structure of a sentence to translate it correctly. Traditionally, syntactic parsers are evaluated as standalone systems against reference data created by linguists. Instead, we show how to train a parser to optimize reordering accuracy in a machine translation system, resulting in measurable improvements in translation quality over a more traditionally trained parser.
“
Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation
”, Ashish Venugopal,
Jakob Uszkoreit
, David Talbot,
Franz Och
, Juri Ganitkevitch,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
We propose a general method to watermark and probabilistically identify the structured results of machine learning algorithms with an application in statistical machine translation. Our approach does not rely on controlling or even knowing the inputs to the algorithm and provides probabilistic guarantees on the ability to identify collections of results from one’s own algorithm, while being robust to limited editing operations.
“
Inducing Sentence Structure from Parallel Corpora for Reordering
”,
John DeNero
,
Jakob Uszkoreit
,
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
Automatically discovering the full range of linguistic rules that govern the correct use of language is an appealing goal, but extremely challenging. Our paper describes a targeted method for discovering only those aspects of linguistic syntax necessary to explain how two different languages differ in their word ordering. By focusing on word order, we demonstrate an effective and practical application of unsupervised grammar induction that improves a Japanese to English machine translation system.
Multimedia and Computer Vision
“
Kernelized Structural SVM Learning for Supervised Object Segmentation
”,
Luca Bertelli
,
Tianli Yu
, Diem Vu, Burak Gokturk,
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2011
.
The paper proposes a principled way for computers to learn how to segment the foreground from the background of an image given a set of training examples. The technology is build upon a specially designed nonlinear segmentation kernel under the recently proposed structured SVM learning framework.
“
Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths
”,
Matthias Grundmann
,
Vivek Kwatra
, Irfan Essa,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011).
Casually shot videos captured by handheld or mobile cameras suffer from significant amount of shake. Existing in-camera stabilization methods dampen high-frequency jitter but do not suppress low-frequency movements and bounces, such as those observed in videos captured by a walking person. On the other hand, most professionally shot videos usually consist of carefully designed camera configurations, using specialized equipment such as tripods or camera dollies, and employ ease-in and ease-out for transitions. Our stabilization technique automatically converts casual shaky footage into more pleasant and professional looking videos by mimicking these cinematographic principles. The original, shaky camera path is divided into a set of segments, each approximated by either constant, linear or parabolic motion, using an algorithm based on robust L1 optimization. The stabilizer has been part of the YouTube Editor (
youtube.com/editor
) since March 2011.
“
The Power of Comparative Reasoning
”,
Jay Yagnik
, Dennis Strelow,
David Ross
, Ruei-Sung Lin,
International Conference on Computer Vision
(2011).
The paper describes a theory derived vector space transform that converts vectors into sparse binary vectors such that Euclidean space operations on the sparse binary vectors imply rank space operations in the original vector space. The transform a) does not need any data-driven supervised/unsupervised learning b) can be computed from polynomial expansions of the input space in linear time (in the degree of the polynomial) and c) can be implemented in 10-lines of code. We show competitive results on similarity search and sparse coding (for classification) tasks.
NLP
“
Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
”, Dipanjan Das,
Slav Petrov
,
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL '11)
, 2011,
Best Paper Award
.
We would like to have natural language processing systems for all languages, but obtaining labeled data for all languages and tasks is unrealistic and expensive. We present an approach which leverages existing resources in one language (for example English) to induce part-of-speech taggers for languages without any labeled training data. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in a hidden Markov model trained with the Expectation Maximization algorithm.
Networks
“
TCP Fast Open
”, Sivasankar Radhakrishnan,
Yuchung Cheng
,
Jerry Chu
,
Arvind Jain
, Barath Raghavan,
Proceedings of the 7th International Conference on emerging Networking EXperiments and Technologies (CoNEXT)
, 2011.
TCP Fast Open enables data exchange during TCP’s initial handshake. It decreases application network latency by one full round-trip time, a significant speedup for today's short Web transfers. Our experiments on popular websites show that Fast Open reduces the whole-page load time over 10% on average, and in some cases up to 40%.
“
Proportional Rate Reduction for TCP
”,
Nandita Dukkipati
, Matt Mathis,
Yuchung Cheng
, Monia Ghobadi,
Proceedings of the 11th ACM SIGCOMM Conference on Internet Measurement 2011, Berlin, Germany - November 2-4, 2011
.
Packet losses increase latency of Web transfers and negatively impact user experience. Proportional rate reduction (PRR) is designed to recover from losses quickly, smoothly and accurately by pacing out retransmissions across received ACKs during TCP’s fast recovery. Experiments on Google Web and YouTube servers in U.S. and India demonstrate that PRR reduces the TCP latency of connections experiencing losses by 3-10% depending on response size.
Security and Privacy
“
Automated Analysis of Security-Critical JavaScript APIs
”, Ankur Taly,
Úlfar Erlingsson
, John C. Mitchell,
Mark S. Miller
, Jasvir Nagra,
IEEE Symposium on Security & Privacy (SP)
, 2011.
As software is increasingly written in high-level, type-safe languages, attackers have fewer means to subvert system fundamentals, and attacks are more likely to exploit errors and vulnerabilities in application-level logic. This paper describes a generic, practical defense against such attacks, which can protect critical application resources even when those resources are partially exposed to attackers via software interfaces. In the context of carefully-crafted fragments of JavaScript, the paper applies formal methods and semantics to prove that these defenses can provide complete, non-circumventable mediation of resource access; the paper also shows how an implementation of the techniques can establish the properties of widely-used software, and find previously-unknown bugs.
“
App Isolation: Get the Security of Multiple Browsers with Just One
”, Eric Y. Chen, Jason Bau,
Charles Reis
, Adam Barth, Collin Jackson,
18th ACM Conference on Computer and Communications Security
, 2011.
We find that anecdotal advice to use a separate web browser for sites like your bank is indeed effective at defeating most cross-origin web attacks. We also prove that a single web browser can provide the same key properties, for sites that fit within the compatibility constraints.
Speech
“
Improving the speed of neural networks on CPUs
”,
Vincent Vanhoucke
,
Andrew Senior
, Mark Z. Mao,
Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
As deep neural networks become state-of-the-art in real-time machine learning applications such as speech recognition, computational complexity is fast becoming a limiting factor in their adoption. We show how to best leverage modern CPU architectures to significantly speed-up their inference.
“
Bayesian Language Model Interpolation for Mobile Speech Input
”,
Cyril Allauzen
,
Michael Riley
,
Interspeech 2011.
Voice recognition on the Android platform must contend with many possible target domains - e.g. search, maps, SMS. For each of these, a domain-specific language model was built by linearly interpolating several n-gram LMs from a common set of Google corpora. The current work has found a way to efficiently compute a single n-gram language model with accuracy very close to the domain-specific LMs but with considerably less complexity at recognition time.
Statistics
“
Large-Scale Parallel Statistical Forecasting Computations in R
”,
Murray Stokely
, Farzan Rohani, Eric Tassone,
JSM Proceedings, Section on Physical and Engineering Sciences
, 2011.
This paper describes the implementation of a framework for utilizing distributed computational infrastructure from within the R interactive statistical computing environment, with applications to timeseries forecasting. This system is widely used by the statistical analyst community at Google for data analysis on very large data sets.
Structured Data
“
Dremel: Interactive Analysis of Web-Scale Datasets
”, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton,
Theo Vassilakis
,
Communications of the ACM
, vol. 54 (2011), pp. 114-123.
Dremel is a scalable, interactive ad-hoc query system. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. Besides continued growth internally to Google, Dremel now also backs an increasing number of external customers including BigQuery and UIs such as AdExchange front-end.
“
Representative Skylines using Threshold-based Preference Distributions
”,
Atish Das Sarma
, Ashwin Lall, Danupon Nanongkai, Richard J. Lipton, Jim Xu,
International Conference on Data Engineering (ICDE)
, 2011.
The paper adopts principled approach towards representative skylines and formalizes the problem of displaying k tuples such that the probability that a random user clicks on one of them is maximized. This requires mathematically modeling (a) the likelihood with which a user is interested in a tuple, as well as (b) how one negotiates the lack of knowledge of an explicit set of users. This work presents theoretical and experimental results showing that the suggested algorithm significantly outperforms previously suggested approaches.
“
Hyper-local, directions-based ranking of places
”, Petros Venetis, Hector Gonzalez,
Alon Y. Halevy
, Christian S. Jensen,
PVLDB
, vol. 4(5) (2011), pp. 290-30.
Click through information is one of the strongest signals we have for ranking web pages. We propose an equivalent signal for raking real world places: The number of times that people ask for precise directions to the address of the place. We show that this signal is competitive in quality with human reviews while being much cheaper to collect, we also show that the signal can be incorporated efficiently into a location search system.
Systems
“
Power Management of Online Data-Intensive Services
”, David Meisner, Christopher M. Sadler,
Luiz André Barroso
,
Wolf-Dietrich Weber
, Thomas F. Wenisch,
Proceedings of the 38th ACM International Symposium on Computer Architecture
, 2011.
Compute and data intensive Web services (such as Search) are a notoriously hard target for energy savings techniques. This article characterizes the statistical hardware activity behavior of servers running Web search and discusses the potential opportunities of existing and proposed energy savings techniques.
“
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
”, Lingjia Tang, Jason Mars, Neil Vachharajani,
Robert Hundt
, Mary-Lou Soffa,
ISCA
, 2011.
In this work, the authors expose key characteristics of an emerging class of Google-style workloads and show how to enhance system software to take advantage of these characteristics to improve efficiency in data centers. The authors find that across datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing micro-architectural resources on a single machine (such as on-chip caches and bandwidth to memory). The impact of co-locating threads from multiple applications with diverse memory behavior changes the optimal mapping of thread to cores for each application. By employing an adaptive thread-to-core mapper, the authors improved the performance of the datacenter applications by up to 22% over status quo thread-to-core mapping, achieving performance within 3% of optimal.
“
Language-Independent Sandboxing of Just-In-Time Compilation and Self-Modifying Code
”, Jason Ansel, Petr Marchenko,
Úlfar Erlingsson
, Elijah Taylor,
Brad Chen
, Derek Schuff, David Sehr,
Cliff L. Biffle
, Bennet S. Yee,
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
, 2011.
Since its introduction in the early 90's, Software Fault Isolation, or SFI, has been a static code technique, commonly perceived as incompatible with dynamic libraries, runtime code generation, and other dynamic code. This paper describes how to address this limitation and explains how the SFI techniques in Google Native Client were extended to support modern language implementations based on just-in-time code generation and runtime instrumentation. This work is already deployed in Google Chrome, benefitting millions of users, and was developed over a summer collaboration with three Ph.D. interns; it exemplifies how Research at Google is focused on rapidly bringing significant benefits to our users through groundbreaking technology and real-world products.
“
Thialfi: A Client Notification Service for Internet-Scale Applications
”, Atul Adya, Gregory Cooper,
Daniel Myers
,
Michael Piatek
,
Proc. 23rd ACM Symposium on Operating Systems Principles (SOSP)
, 2011, pp. 129-142.
This paper describes a notification service that scales to hundreds of millions of users, provides sub-second latency in the common case, and guarantees delivery even in the presence of a wide variety of failures. The service has been deployed in several popular Google applications including Chrome, Google Plus, and Contacts.
CDC Birth Vital Statistics in BigQuery
Friday, January 13, 2012
Posted by Dan Vanderkam, Software Engineer
Google’s
BigQuery Service
lets enterprises and developers crunch large-scale data sets quickly. But what if you don’t have a large-scale data set of your own?
To help the data-less masses, BigQuery offers several
large, public data sets
. One of these is the
natality
data set, which records information about live births in the United States. The data is derived from the
Division of Vital Statistics
at the
Centers for Disease Control and Prevention
, which has collected an electronic record of birth statistics
since 1969
. It is one of the longest-running electronic records in existence.
Each row in this database represents a live birth. Using simple queries, you can discover fascinating trends from the last forty years.
For example, here’s the average age of women giving birth to their first child:
The average age has increased from 21.3 years in 1969 to 25.1 years in 2008. Using more complex queries, one could analyze the factors which have contributed to this increase, i.e. whether it can be explained by changing racial/ethnic composition of the population.
You can see more
examples
like this one on the BigQuery site.
Google at the Joint Statistical Meetings in Miami
Monday, August 22, 2011
Posted by Marianna Dizik, Statistician
The Joint Statistical Meetings (JSM) were held in Miami, Florida, this year. Nearly 5,000 participants from academia and industry came to present and discuss the latest in statistical research, methodology, and applications. Similar to previous years, several Googlers shared expertise in large-scale experimental design and implementation, statistical inference with massive datasets and forecasting, data mining, parallel computing, and much more.
Our session "Statistics: The Secret Weapon of Successful Web Giants" attracted over one hundred people; surprising for an 8:30 AM session! Revolution Analytics reviewed this in their official blog post
"How Google uses R to make online advertising more effective"
The following talks were given by Googlers at JSM 2011. Please check the upcoming Proceedings of the JSM 2011 for the full papers.
Statistical Plumbing: Effective use of classical statistical methods for large scale applications
Author(s): Ni Wang, Yong Li, Daryl Pregibon, and Rachel Schutt
Parallel Computations in R, with Applications for Statistical Forecasting
Author(s): Murray Stokely and Farzan Rohani and Eric Tassone
Conditional Regression Models
Author(s): William D. Heavlin
The Effectiveness of Display Ads
Author(s): Tim Hesterberg and Diane Lambert and David X. Chan and Or Gershony and Rong Ge
Measuring Ad Effectiveness Using Continuous Geo Experiments
Author(s): Jon Vaver and Deepak Kumar and Jim Koehler
Post-Stratification and Network Sampling
Author(s): Rachel Schutt and Andrew Gelman and Tyler McCormick
Google has participated at JSM each year since 2004. We have been increasing our involvement significantly by providing sponsorship, organizing and giving talks at sessions and roundtables, teaching courses and workshops, hosting a booth with new Google products demo, submitting posters, and more. This year Googlers participated in sessions sponsored by ASA sections for Statistical Learning and Data Mining, Statistics and Marketing, Statistical Computing, Bayesian Statistical Science , Health Policy Statistics, Statistical Graphics, Quality and Productivity, Physical and Engineering Sciences, and Statistical Education.
We also hosted the Google faculty reception, which was well-attended by faculty and their promising students. Google hires a growing number of statisticians and we were happy to participate in JSM again this year. People had a chance to talk to Googlers, ask about working here, encounter elements of Google culture (good food! T-shirts! 3D puzzles!), meet old and make new friends, and just have fun!
Thanks to everyone that presented, attended, or otherwise engaged with the statistical community at JSM this year. We’re looking forward to seeing you in San Diego next year.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Australia
Automatic Speech Recognition
Awards
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gmail
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
KDD
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
ph.d. fellowship
PhD Fellowship
PhotoScan
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum Computing
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
YouTube
Archive
2017
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Google
on
Follow @googleresearch
Give us feedback in our
Product Forums
.