mil-nce
TensorFlow 1
Model Details
An I3D-Text pretrained model that can be used as a video feature extractor or to compute similarity scores between short video clips and sentences. The model was trained using uncurated narrated instructional videos only. More details about the training and the model can be found in [1].
References
[1] Antoine Miech*, Jean-Baptiste Alayrac*, Lucas Smaira, Ivan Laptev, Josef Sivic and Andrew Zisserman, End-to-End Learning of Visual Representations from Uncurated Instructional Videos, arXiv:1912.06430, *equal contribution
Example Use
Import tensorflow and tensorflow hub.
Inputs should be tensors of the following type:
NB: The video network is fully convolutional (with global average pooling in time and space at the end). However, we recommend using T=32
frames (same as during training). For H
and W
we have been using values from 200
to 256
.
Load the model in testing mode:
Alternatively, you can also load the video model in training mode to activate the batch normalization training mode:
Inference:
Note that vision_output
is a dictionary which contains two keys:
mixed_5c
: This is the global averaged pooled feature from I3D of dimension 1024. This should be use for classification on downstream tasks.video_embedding
: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.
text_output
is also a dictionary containing a single key:
text_embedding
: It is the text embedding (size 512) from the joint text-video space. To compute the similarity score between text and video, you would compute the dot product betweentext_embedding
andvideo_embedding
.
Computing all the pairwise video-text similarities:
0 comments
No comments yet.