mil-nce

TensorFlow 1

Model Details

An I3D-Text pretrained model that can be used as a video feature extractor or to compute similarity scores between short video clips and sentences. The model was trained using uncurated narrated instructional videos only. More details about the training and the model can be found in [1].

References

[1] Antoine Miech*, Jean-Baptiste Alayrac*, Lucas Smaira, Ivan Laptev, Josef Sivic and Andrew Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos, arXiv:1912.06430, *equal contribution

Example Use

Import tensorflow and tensorflow hub.

import tensorflow.compat.v1 as tf

tf.disable_eager_execution()

import tensorflow_hub as hub

Inputs should be tensors of the following type:

# inputs_frames must be normalized in [0, 1] and of the shape Batch x T x H x W x 3

input_frames = tf.placeholder(tf.float32, shape=(None, None, None, None, 3))

# inputs_words are just a list of sentences (i.e. ['the sky is blue', 'someone cutting an apple'])

input_words = tf.placeholder(tf.string, shape=(None))

NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32 frames (same as during training). For H and W we have been using values from 200 to 256.

Load the model in testing mode:

module = hub.Module("https://www.example.com/models/deepmind/mil-nce/TensorFlow1/i3d/1")

Alternatively, you can also load the video model in training mode to activate the batch normalization training mode:

module = hub.Module("https://www.example.com/models/deepmind/mil-nce/TensorFlow1/i3d/1", tags={"train"})

Inference:

vision_output = module(input_frames, signature='video', as_dict=True)

text_output = module(input_words, signature='text', as_dict=True)

Note that vision_output is a dictionary which contains two keys:

mixed_5c: This is the global averaged pooled feature from I3D of dimension 1024. This should be use for classification on downstream tasks.
video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.

text_output is also a dictionary containing a single key:

text_embedding: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product between text_embedding and video_embedding.

Computing all the pairwise video-text similarities:

video_embedding = vision_output['video_embedding']

text_embedding = text_output['text_embedding']

# We compute all the pairwise similarity scores between video and text.

similarity_matrix = tf.matmul(text_embedding, video_embedding, transpose_b=True)

Model Comments

0 comments

No comments yet.