Modeling Images, Videos and Text Using the Caffe Deep Learning Library


In the first part of the course, I will describe several recent advances in automatic generation of natural language descriptions for images and video. Image and video description has important applications in human-robot interaction, indexing and retrieval, and audio descriptive language generation for the blind. I will start with a deep learning model that combines a convolutional network structure with a recurrent structure to generate sentences from images or fixed-length videos. I will then describe a sequence-to-sequence neural network that learns to generate captions for brief videos of variable length. The model is trained on video-sentence pairs and is naturally able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. To further handle the ambiguity over multiple objects and locations, the model incorporates convolutional networks with Multiple Instance Learning (MIL) to consider objects in different positions and at different scales simultaneously. I will show how the multi-scale multi-instance convolutional network is integrated with a sequence-to-sequence recurrent neural network to generate sentence descriptions based on the visual representation. This architecture is the first end-to-end trainable deep neural network that is capable of multi-scale region processing for variable-legth video description. I will show results of captioning YouTube videos and Hollywood movies.
In the second part of the course I will talk about how these deep language and vision models can be implemented using the Caffe library. Caffe (Convolutional Architecture for Fast Feature Embedding) is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe’s expressive architecture encourages application and innovation, as models and optimization are defined by configuration without hard-coding. Caffe allows one to switch between CPU and GPU by setting a single flag to train on a GPU machine, then deploy to commodity clusters or mobile devices. Caffe’s extensible code fosters active development and has seen many contributors provide state-of-the-art models for computer vision. Speed makes Caffe perfect for research experiments and industry deployment (up to 1 ms/image for inference and 4 ms/image for learning). Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. This tutorial will equip researchers and developers with the tools and know-how needed to incorporate deep learning into their work. I will show basic Caffe usage and step by step notebook examples including the language and vision models discussed in the first part of the course.