Google Research Blog
The latest news from Research at Google
Headset “Removal” for Virtual and Mixed Reality
Tuesday, February 21, 2017
Posted by Vivek Kwatra, Research Scientist and Christian Frueh, Avneesh Sud, Software Engineers
Virtual Reality (VR) enables remarkably immersive experiences, offering
new ways to view the world
and the ability to explore novel environments, both
. However, compared to physical reality, sharing these experiences with others can be difficult, as VR headsets make it challenging to create a complete picture of the people participating in the experience.
Some of this disconnect is alleviated by
(MR), a related medium that shares the virtual context of a VR user in a two dimensional video format allowing other viewers to get a feel for the user’s virtual experience. Even though MR facilitates sharing, the headset continues to block facial expressions and eye gaze, presenting a significant hurdle to a fully engaging experience and complete view of the person in VR.
researchers, in collaboration with
, have been working on solutions to address this problem wherein we reveal the user’s face by virtually “removing” the headset and create a realistic see-through effect.
VR user captured in front of a green-screen is blended with the virtual environment to generate the MR output: Traditional MR output has the user face occluded, while our result reveals the face. Note how the headset is modified with a marker to aid tracking.
Our approach uses a combination of 3D vision, machine learning and graphics techniques, and is best explained in the context of
enhancing Mixed Reality
video (also discussed in the
). It consists of three main components:
Dynamic face model capture
The core idea behind our technique is to use a 3D model of the user’s face as a proxy for the hidden face. This proxy is used to synthesize the face in the MR video, thereby creating an impression of the headset being removed. First, we capture a personalized 3D face model for the user with what we call
gaze-dependent dynamic appearance
. This initial calibration step requires the user to sit in front of a color+depth camera and a monitor, and then track a marker on the monitor with their eyes. We use this one-time calibration procedure — which typically takes less than a minute — to acquire a 3D face model of the user, and learn a database that maps appearance images (or textures) to different eye-gaze directions and blinks. This
(i.e. the face model with textures indexed by eye-gaze) allows us to dynamically change the appearance of the face during synthesis and generate any desired eye-gaze, thus making the synthesized face look natural and alive
On the left, the user’s face is captured by a camera as she tracks a marker on the monitor with her eyes. On the right, we show the dynamic nature of reconstructed 3D face model: by moving or clicking on the mouse, we are able to simulate both apparent eye gaze and blinking.
Calibration and Alignment
Creating a Mixed Reality video
requires a specialized setup consisting of an external camera, calibrated and time-synced with the headset. The camera captures a video stream of the VR user in front of a green screen and then composites a cutout of the user with the virtual world to create the final MR video. An important step here is to accurately estimate the calibration (the fixed 3D transformation) between the camera and headset coordinate systems. These calibration techniques typically involve significant manual intervention and are done in multiple steps. We simplify the process by adding a physical marker to the front of the headset and tracking it visually in 3D, which allows us to optimize for the calibration parameters automatically from the VR session.
For headset “removal”, we need to align the 3D face model with the visible portion of the face in the camera stream, so that they would blend seamlessly with each other. A reasonable proxy to this alignment is to place the face model just behind the headset. The calibration described above, coupled with VR headset tracking, provides sufficient information to determine this placement, allowing us to modify the camera stream by rendering the virtual face into it.
Compositing and Rendering
Having tackled the alignment, the last step involves producing a suitable rendering of the 3D face model, consistent with the content in the camera stream. We are able to reproduce the true eye-gaze of the user by combining our dynamic gaze database with an HTC Vive headset that has been modified by SMI to
technology. Images from these eye trackers lack sufficient detail to directly reproduce the occluded face region, but are well suited to provide fine-grained gaze information. Using the live gaze data from the tracker, we synthesize a face proxy that accurately represents the user’s attention and blinks. At run-time, the gaze database, captured in the preprocessing step, is searched for the most appropriate face image corresponding to the query gaze state, while also respecting aesthetic considerations such as temporal smoothness. Additionally, to account for lighting changes between gaze database acquisition and run-time, we apply color correction and feathering, such that the synthesized face region matches with the rest of the face.
Humans are highly sensitive to artifacts on faces, and even small imperfections in synthesis of the occluded face can feel unnatural and distracting, a phenomenon known as the “
.” To mitigate this problem, we do not remove the headset completely, instead we have chosen a user experience that conveys a ‘scuba mask effect’ by compositing the color corrected face proxy with a translucent headset. Reminding the viewer of the presence of the headset helps us avoid the uncanny valley, and also makes our algorithm robust to small errors in alignment and color correction.
This modified camera stream, displaying a see-through headset, with the user’s face revealed and their true eye-gaze recreated, is subsequently merged with the virtual environment to create the final MR video.
Results and Extensions
We have used our
headset removal technology to enhance Mixed Reality
, allowing the medium to not only convey a VR user’s interaction with the virtual environment but also show their face in a natural and convincing fashion. The example below demonstrates our tech applied to an artist using
Google Tilt Brush
in a virtual environment:
An artist creates 3D art using Google Tilt Brush, shown in Mixed Reality. On the top is the traditional MR result where the face is hidden behind the headset. On the bottom is our result, which reveals the entire face and eyes for a more natural and engaging experience.
While we have shown the potential of our technology, its applications extend beyond Mixed Reality. Headset removal is poised to enhance communication and social interaction in VR itself with diverse applications like VR video conference meetings, multiplayer VR gaming, and exploration with friends and family. Going from an utterly blank headset to being able to see, with photographic realism, the faces of fellow VR users promises to be a significant transition in the VR world, and we are excited to be a part of it.
The CS Capacity Program - New Tools and SIGCSE 2017
Thursday, February 16, 2017
Posted by Chris Stephenson, Head of Computer Science Education Strategy
CS Capacity program
was launched in March of 2015 to help address a
dramatic increase in undergraduate computer science enrollments
that is creating serious resource and pedagogical challenges for many colleges and universities. Over the last two years, a diverse group of universities have been working to develop successful strategies that support the expansion of high-quality CS programs at the undergraduate level. Their work focuses on innovations in teaching and technologies that support scaling while ensuring the engagement of women and underrepresented students. These innovations could provide assistance to many other institutions that are challenged to provide a high-quality educational experience to an increasing number of introductory-level students.
The cohort of CS Capacity institutions include George Mason University, Mount Holyoke College, Rutgers University, and the University California Berkeley which are working individually, and Duke University, North Carolina State University, the University of Florida, and the University of North Carolina which are working together. These institution each brings a unique approach to addressing CS capacity challenges. Two years into the program, we're sharing an update on some of the great projects and ideas to emerge so far.
At George Mason, for example, computer science professor Jeff Offutt and his team have developed an online system to provide self-paced learning for CS1 and CS2 classes that allows learners through the learning materials wore quickly or slowly depending on their needs. The system, called
, includes course content, practice and assessment exercises (including automated testing), mini-lectures, and daily inspirations. This team has also launched a program to recruit and train undergraduate tutorial assistants to increase learning support. For more information on SPARC, contact Jeff Offutt at email@example.com.
MaGE Peer Mentor program
at Mount Holyoke College is addressing its increasing CS student enrollment by preparing undergraduate peer mentors to provide effective feedback on coding assignments and contribute to an inclusive learning environment. One of the major elements of these program is an online course that helps to recruit and train students to be undergraduate peer mentors. Mount Holyoke has made their
entire online course curriculum
for the peer mentor program available so that other institutions can incorporate all or part of it to assist with preparing their own student tutors. For more information on the MaGE curriculum, contact Heather Pon-Barry at firstname.lastname@example.org.
MaGE Program Students and Faculty from Mount Holyoke College
At University of California, Berkeley, the CS Capacity team is focused on providing access to increased and better tutoring. They’ve instituted
a small-group tutoring program
that includes weekend mastery learning sessions, increased office hours support, designated discussions section, project checkpoint deadlines, exam/homework/lab/discussion walkthrough videos, and a new office hours app that tracks student satisfaction with office hours. For more information on Berkeley’s interventions, contact Josh Hug at email@example.com.
The CS Capacity team at Rutgers has been exploring the gender gap at multiple levels using a longitudinal study across four required CS classes (paper to be published in the proceedings of the
SIGCSE 2017 Technical Symposium
). They’re investigating several factors that may impact the retention of women and underrepresented student populations, including intention to major in CS, grades, and prior experience. They’ve also been defining an additional set of feature set to improve their use of
(a course management system with automated grading). This work includes building a hint system to provide more information for students who are struggling with a concept or assignment, crowd-sourcing grading, and studying how students think about CS content and the kinds of errors they are making. The Rutgers team will be publishing their study results in the proceedings of the SIGCSE 2017 Technical Symposium. For more information on these tools, contact Andrew Tjang at firstname.lastname@example.org.
The team consisting of Duke, NCSU, UNC, and UF have produced and plan to share tools to improve the student learning experience.
My Digital Hand
(MDH) is a free online tool for managing and tracking one-to-one peer teaching sessions (for example, helping to keep track of how many hours peer mentors are spending with mentees). MDH supports best practice in peer teaching and mitigates some of the observed challenges in taking peer teaching to scale. The team has also been working on ASCEND (Adaptive Student Computing Environment with Natural Language Dialogue), an Eclipse plug-in designed to facilitate remote synchronous peer teaching sessions. Students can share their projects with a peer teaching fellow (PTF) and chat as the PTF leads the student through a session. ASCEND helps instructors better understand current practice by logging all programming actions and textual chats in real time to a database. For more information on these tools, contact Jeff Forbes at email@example.com.
Several of the CS Capacity principle investigators will be presenting papers on these new interventions and tools at the
SIGCSE conference in March
. Faculty from the CS capacity program will also be presenting a panel and roundtable discussion session called “New Tools and Solutions to Address the CS Capacity Crunch.” If you’re attending SIGCSE this year, we hope you’ll join us on Thursday, March 9, from 3:45-5:00 pm.
Given the likelihood that CS undergraduate enrollments will continue to climb, it is critical that the CS education community continue to find, test, and share solutions and tools that enable institutions to effectively teach more students while maintaining the quality of the education experience for students. Faculty from the CS Capacity program will continue to share their solutions and results with the community via CS education conferences and publications.
An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!
Wednesday, February 15, 2017
Posted by Paul Natsev, Software Engineer
Last September, we released the
, which spans
millions of videos labeled with thousands of classes
, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as
that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the
, and in collaboration with
Google Cloud Machine Learning
, we are also organizing a
video understanding competition
and an affiliated
An Updated YouTube-8M
The new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art
audio modeling architecture
, in addition to the previously released visual features. The audio and visual features are synchronized in time, at 1-second temporal granularity, which makes YouTube-8M a large-scale multi-modal dataset, and opens up opportunities for exciting new research on joint audio-visual (temporal) modeling. Key statistics on the new version are illustrated below (more details
A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.
Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.
The Google Cloud & YouTube-8M Video Understanding Challenge
We are also excited to announce the
Google Cloud & YouTube-8M Video Understanding Challenge
, in partnership with
. The challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and to then label ~700K unseen test videos. It will be hosted as a
, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details
). In order to enable wider participation in the competition, Google Cloud is also offering credits so participants can optionally do model training and exploration using
Google Cloud Machine Learning
. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at
. For details on getting started with local or cloud-based training, please see our
getting started guide on Kaggle
The CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
We will announce the results of the challenge and host invited talks by distinguished researchers at the
1st YouTube-8M Workshop
, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (
) in Honolulu, Hawaii. The workshop will also feature presentations by top-performing challenge participants and a selected set of paper submissions. We
researchers to submit papers describing novel research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the above challenge.
We designed this dataset with scale and diversity in mind, and hope lessons learned here will generalize to many video domains (YouTube-8M captures over 20 diverse video domains). We believe the challenge can also accelerate research by enabling researchers without access to big data or compute clusters to explore and innovate at previously unprecedented scale. Please join us in advancing video understanding!
This post reflects the work of many others within Machine Perception at Google Research, including Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Joonseok Lee, Hanhan Li, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jiang Wang, as well as Philippe Poutonnet and Mike Styer from Google Cloud, and our partners at Kaggle. We are grateful for the support and advice from many others at Google Research, Google Cloud, and YouTube, and especially thank Aren Jansen, Jort Gemmeke, Dan Ellis, and the Google Research Sound Understanding team for providing the audio features in the updated dataset.
Announcing TensorFlow 1.0
Wednesday, February 15, 2017
Posted by Amy McDonald Sandjideh, Technical Program Manager, TensorFlow
In just its
, TensorFlow has helped researchers, engineers, artists, students, and many others make progress with everything from
early detection of skin cancer
preventing blindness in diabetics
. We’re excited to see people using TensorFlow in over
6000 open-source repositories online
Today, as part of the first annual
TensorFlow Developer Summit
, hosted in Mountain View and
livestreamed around the world
, we’re announcing
TensorFlow 1.0 is incredibly fast!
lays the groundwork for even more performance improvements in the future, and
tips & tricks
for tuning your models to achieve maximum speed. We’ll soon publish updated implementations of several popular models to show how to take full advantage of TensorFlow 1.0 - including a 7.3x speedup on 8 GPUs for Inception v3 and 58x speedup for distributed Inception v3 training on 64 GPUs!
It’s more flexible:
TensorFlow 1.0 introduces a high-level API for TensorFlow, with tf.layers, tf.metrics, and tf.losses modules. We’ve also announced the inclusion of a new tf.keras module that provides full compatibility with
, another popular high-level neural networks library.
It’s more production-ready than ever:
TensorFlow 1.0 promises Python API stability (details
), making it easier to pick up new features without worrying about breaking your existing code.
Other highlights from
Python APIs have been changed to resemble NumPy more closely. For this and other backwards-incompatible changes made to support API stability going forward, please use our handy
Experimental APIs for
Higher-level API modules tf.layers, tf.metrics, and tf.losses - brought over from
Experimental release of
, a domain-specific compiler for TensorFlow graphs, that targets CPUs and GPUs. XLA is rapidly evolving - expect to see more progress in upcoming releases.
Introduction of the TensorFlow Debugger (
), a command-line interface and API for debugging live TensorFlow programs.
for object detection and localization, and camera-based image stylization.
improvements: Python 3 docker images have been added, and TensorFlow’s pip packages are now PyPI compliant. This means TensorFlow can now be installed with a simple invocation of
pip install tensorflow
We’re thrilled to see the pace of development in the TensorFlow community around the world. To hear more about TensorFlow 1.0 and how it’s being used, you can watch the
TensorFlow Developer Summit talks on YouTube
, covering recent updates from higher-level APIs to TensorFlow on mobile to our new
compiler, as well as the exciting ways that TensorFlow is being used:
for a link to the livestream and video playlist (individual talks will be posted online later in the day).
The TensorFlow ecosystem continues to grow with new techniques like
for dynamic batching and tools like the
along with updates to our existing tools like
. We’re incredibly grateful to the community of contributors, educators, and researchers who have made advances in deep learning available to everyone. We look forward to working with you on forums like
group and at future events.
On-Device Machine Intelligence
Thursday, February 09, 2017
Posted by Sujith Ravi, Staff Research Scientist, Google Research
To build the cutting-edge technologies that enable
, we often apply combinations of machine learning technologies such as
deep neural networks
graph-based machine learning
. However, the machine learning systems that power most of these applications run in the cloud and are computationally intensive and have significant memory requirements. What if you want machine intelligence to run on your personal phone or smartwatch, or on
devices, regardless of whether they are connected to the cloud?
Yesterday, we announced the launch of
Android Wear 2.0
, along with brand new wearable devices, that will run Google's first entirely “on-device” ML technology for powering smart messaging. This on-device ML system, developed by the Expander research team, enables technologies like
to be used for any application,
including third-party messaging apps
, without ever having to connect with the cloud…so now you can respond to incoming chat messages directly from your watch, with a tap.
The research behind this began last year while our team was developing the machine learning systems that enable conversational understanding capability in
. The Android Wear team reached out to us and was interested to know whether it would be possible to deploy this Smart Reply technology directly onto a smart device. Because of the limited computing power and memory on smart devices, we quickly realized that it was not possible to do so. Our product manager, Patrick McGregor, realized that this presented a unique challenge and an opportunity for the Expander team to return to the drawing board to design a completely new, lightweight, machine learning architecture — not only to enable Smart Reply on Android Wear, but also to power a wealth of other on-device mobile applications. Together with Tom Rudick, Nathan Beach, and other colleagues from the Android Wear team, we set out to build the new system.
Learning with Projections
A simple strategy to build lightweight conversational models might be to create a small dictionary of common rules (input → reply mappings) on the device and use a naive look-up strategy at inference time. This can work for simple prediction tasks involving a small set of classes using a handful of features (such as binary
from text, e.g. “
I love this movie
” conveys a positive sentiment whereas the sentence “
The acting was horrible
” is negative). But, it does not scale to complex natural language tasks involving rich vocabularies and the wide language variability observed in chat messages. On the other hand, machine learning models like
recurrent neural network
s (such as
s), in conjunction with
, have proven to be extremely powerful tools for complex sequence learning in natural language understanding tasks, including Smart Reply. However, compressing such rich models to fit in device memory
produce robust predictions at low computation cost (rapidly on-demand) is extremely challenging. Early experiments with restricting the model to predict only a small handful of replies or using other techniques like
did not produce useful results.
Instead, we built a different solution for the on-device ML system. We first use a fast, efficient mechanism to group similar incoming messages and project them to similar (“nearby”) bit vector representations. While there are several ways to perform this projection step, such as using
, we employ a modified version of
locality sensitive hashing
(LSH) to reduce dimension from millions of unique words to a short, fixed-length sequence of bits. This allows us to compute a projection for an incoming message very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming messages, word embeddings, or even the full model used for training.
Similar messages are grouped together and projected to nearby vectors. For example, the messages "
hey, how's it going?
" and "
How's it going buddy?
" share similar content and might be projected to the same vector 11100011. Another related message “
Howdy, everything going well
?” is mapped to a nearby vector 11100110 that differs only in 2 bits.
Next, our system takes the incoming message along with its projections and jointly trains a “message projection model” that learns to predict likely replies using our
framework. The graph learning framework enables training a robust model by combining semantic relationships from multiple sources — message/reply interactions, word/phrase similarity, semantic cluster information — learning useful projection operations that can be mapped to good reply predictions.
(Top) Messages along with
, if available, are used in a machine learning framework to jointly learn a “message projection model”. (Bottom) The message projection model learns to associate replies with the projections of the corresponding incoming messages. For example, the model projects two different messages “
Howdy, everything going well?
” and “
How’s it going buddy?
” (bottom center) to nearby bit vectors and learns to map these to relevant replies (bottom right).
It’s worth noting that while the message projection model can be trained using complex machine learning architectures and the power of the cloud, as described above, the model itself resides and performs inference completely on device. Apps running on the device can pass a user’s incoming messages and receive reply predictions from the on-device model without data leaving the device. The model can also be adapted to cater to the user’s writing style and individual preferences to provide a personalized experience.
The model applies the learned projections to an incoming message (or sequence of messages) and suggests relevant and diverse replies. Inference is performed on the device, allowing the model to adapt to user data and personal writing styles.
To get the on-device system to work out of the box, we had to make a few additional improvements such as optimizing for speeding up computations on device and generating rich, diverse replies from the model. We will have a forthcoming scientific publication that describes the on-device machine learning work in more detail.
Converse from Your Wrist
When we embarked on our journey to build this technology from scratch, we weren’t sure if the predictions would be useful or of sufficient quality. We’re quite surprised and excited about how well it works even on Android wearable devices with very limited computation and memory resources. We look forward to continuing to improve the models to provide users with more delightful conversational experiences, and we will be leveraging this on-device ML platform to enable completely new applications in the months to come.
You can now use this feature to respond to your messages directly from your Google watches or any watch that runs Android Wear 2.0. It is already enabled on Google Hangouts, Google Messenger, and many third-party messaging apps. We also provide an
for developers of third-party Wear apps.
On behalf of the Google Expander team, I would also like to thank the following people who helped make this technology a success: Andrei Broder, Andrew Tomkins, David Singleton, Mirko Ranieri, Robin Dua and Yicheng Fan.
Internet of Things
Natural Language Understanding
Announcing TensorFlow Fold: Deep Learning With Dynamic Computation Graphs
Tuesday, February 07, 2017
Posted by Moshe Looks, Marcello Herreshoff and DeLesley Hutchins, Software Engineers
In much of machine learning, data used for training and inference undergoes a preprocessing step, where multiple inputs (such as images) are scaled to the same dimensions and stacked into batches. This lets high-performance deep learning libraries like
run the same
across all the inputs in the batch in parallel. Batching exploits the
capabilities of modern GPUs and multi-core CPUs to speed up execution. However, there are many problem domains where the size and structure of the input data varies, such as
in natural language understanding,
abstract syntax trees
in source code,
for web pages and more. In these cases, the different inputs have different computation graphs that don't naturally batch together, resulting in poor processor, memory, and cache utilization.
Today we are releasing
to address these challenges. TensorFlow Fold makes it easy to implement deep-learning models that operate over data of varying size and structure. Furthermore, TensorFlow Fold brings the benefits of batching to such models, resulting in a speedup of more than 10x on CPU, and more than 100x on GPU, over alternative implementations. This is made possible by
, introduced in our paper
Deep Learning with Dynamic Computation Graphs
This animation shows a
recursive neural network
run with dynamic batching. Operations with the same color are batched together, which lets TensorFlow run them faster. The Embed operation converts
words to vector representations
. The fully connected (FC) operation combines word vectors to form vector representations of phrases. The output of the network is a vector representation of an entire sentence. Although only a single parse tree of a sentence is shown, the same network can run, and batch together operations, over multiple parse trees of arbitrary shapes and sizes.
The TensorFlow Fold library will initially build a separate computation graph from each input.
Because the individual inputs may have different sizes and structures, the computation graphs may as well. Dynamic batching then automatically combines these graphs to take advantage of opportunities for batching, both within and across inputs, and inserts additional instructions to move data between the batched operations (see
for technical details).
To learn more, head over to our
. We hope that TensorFlow Fold will be useful for researchers and practitioners implementing neural networks with dynamic computation graphs in TensorFlow.
This work was done under the supervision of Peter Norvig.
Advancing Research on Video Understanding with the YouTube-BoundingBoxes Dataset
Monday, February 06, 2017
Posted by Esteban Real, Vincent Vanhoucke, Jonathon Shlens, Google Brain team and
Stefano Mazzocchi, Google Research
One of the most challenging research areas in machine learning today is enabling computers to understand what a scene is about. For example, while humans know that a ball that disappears behind a wall only to reappear a moment later is very likely the same object, this is not at all obvious to an algorithm. Understanding this requires not only a global picture of
are contained in each frame of a video, but also
those objects are located within the frame and their
locations over time
. Just last year we published
, a dataset consisting of automatically labelled YouTube videos. And while this helps further progress in the field, it is only one piece to the puzzle.
Today, in order to facilitate progress in video understanding research, we are introducing
, a dataset consisting of 5 million bounding boxes spanning 23 object categories, densely labeling segments from 210,000 YouTube videos. To date, this is the largest manually annotated video dataset containing bounding boxes, which track objects in temporally contiguous frames. The dataset is designed to be large enough to train large-scale models, and be representative of videos captured in natural settings. Importantly, the human-labelled annotations contain objects as they appear in the real world with partial occlusions, motion blur and natural lighting.
Summary of dataset statistics.
Relative number of detections in existing image (red) and video (blue) data sets. The YouTube BoundingBoxes dataset (YT-BB) is at the bottom, is at the bottom.
The three columns are counts for: classification annotations, bounding boxes, and unique videos with bounding boxes. Full details on the dataset can be found in the
A key feature of this dataset is that bounding box annotations are provided for entire video segments. These bounding box annotations may be used to train models that explicitly leverage this temporal information to
over time. In a video, individual annotated objects might become entirely occluded and later return in subsequent frames. These annotations of individual objects are sometimes not recognizable from individual frames, but
be understood and recognized in the context of the video if the objects are localized and tracked accurately.
Three video segments, sampled at 1 frame per second. The final frame of each example shows how it is visually challenging to recognize the bounded object, due to blur or occlusion (train example, blue arrow). However, temporally-related frames, where the object has been more clearly identified, can allow object classes to be inferred. Note how only visible parts are included in the box: the orange arrow in the bear example (middle row) points to the hidden head. The dog example illustrates tight bounding boxes that track the tail (orange arrows) and foot (blue arrows). The airplane example illustrates how partial objects are annotated (first frame) tracked across changes in perspective, occlusions and camera cuts.
We hope that this dataset might ultimately aid the computer vision and machine learning community and lead to new methods for analyzing and understanding real world vision problems. You can learn more about the dataset in this
This work was greatly helped along by Xin Pan, Thomas Silva, Mir Shabber Ali Khan, Ashwin Kakarla and many others, as well as support and advice from Manfred Georg, Sami Abu-El-Haija, Susanna Ricco and George Toderici.
Adaptive Data Analysis
Automatic Speech Recognition
Electronic Commerce and Algorithms
Google Cloud Platform
Google Play Apps
Google Science Fair
Google Voice Search
High Dynamic Range Imaging
Internet of Things
Natural Language Processing
Natural Language Understanding
Optical Character Recognition
Public Data Explorer
Security and Privacy
Site Reliability Engineering
Give us feedback in our
Official Google Blog
Public Policy Blog
Lat Long Blog
Ads Developer Blog
Android Developers Blog