movenet

TensorFlow 2

Model Details

A convolutional neural network model that runs on RGB images and predicts human joint locations of a single person. The model is designed to be run in the browser using Tensorflow.js or on devices using TF Lite in real-time, targeting movement/fitness activities. This variant: MoveNet.SinglePose.Thunder is a higher capacity model (compared to MoveNet.SinglePose.Lightning) that performs better prediction quality while still achieving real-time (<30FPS) speed. Naturally, thunder will lag behind the lightning, but it will pack a punch.


Model Specifications


The following sessions describe the general model information. Please see the model card for more detailed information and quantitative analysis.


Model Architecture


MobileNetV2 image feature extractor with Feature Pyramid Network decoder (to stride of 4) followed by CenterNet prediction heads with custom post-processing logics. Lightning uses depth multiplier 1.75.


Intended Use


Primary Intended Uses

  1. Optimized to be run in the browser environment using Tensorflow.js with WebGL support or on-device with TF Lite.
  2. Tuned to be robust on detecting fitness/fast movement with difficult poses and/or motion blur.
  3. Most suitable for detecting the pose of a single person who is 3ft ~ 6ft away from a device’s webcam that captures the video stream.
  4. Focus on detecting the pose of the person who is closest to the image center and ignore the other people who are in the image frame (i.e. background people rejection).
  5. The model predicts 17 human keypoints of the full body even when they are occluded. For the keypoints which are outside of the image frame, the model will emit low confidence scores. A confidence threshold (recommended default: 0.3) can be used to filter out unconfident predictions.


Primary Intended Users


  1. People who build applications (e.g. fitness/physical movement, AR entertainment) that require very fast inference and good quality single-person pose detection (with background people rejection) on standard consumer devices (e.g. laptops, tablets, cell phones).


Out-of-scope Use Cases


  1. This model is not intended for detecting poses of multiple people in the image.
  2. Any form of surveillance or identity recognition is explicitly out of scope and not enabled by this technology.
  3. The model does not store/use/send any information in the input images at inference time.

License


This model follows Apache 2.0. If you intend to use it beyond permissible usage, please consult with the model owners ahead of time.


Example Use


Model Specifications


The following sections describe the general model information. Please see the model card for more detailed information and quantitative analysis.


Model Architecture


MobileNetV2 image feature extractor with Feature Pyramid Network decoder (to stride of 4) followed by CenterNet prediction heads with custom post-processing logics. Lightning uses depth multiplier 1.5.


Inputs

A frame of video or an image, represented as an int32 tensor of dynamic shape: 1xHxWx3, where H and W need to be a multiple of 32 and the larger dimension is recommended to be 256. To prepare the input image tensor, one should resize (and pad if needed) the image such that the above conditions are hold. Please see the Usage section for more detailed explanation. Note that the size of the input image controls the tradeoff between speed vs. accuracy so choose the value that best suits your application. The channel order is RGB with values in [0, 255].


Outputs


A float32 tensor of shape [1, 6, 56].

  1. The first dimension is the batch dimension, which is always equal to 1.
  2. The second dimension corresponds to the maximum number of instance detections. The model can detect up to 6 people in the image frame simultaneously.
  3. The third dimension represents the predicted bounding box/keypoint locations and scores. The first 17 * 3 elements are the keypoint locations and scores in the format: [y_0, x_0, s_0, y_1, x_1, s_1, …, y_16, x_16, s_16], where y_i, x_i, s_i are the yx-coordinates (normalized to image frame, e.g. range in [0.0, 1.0]) and confidence scores of the i-th joint correspondingly. The order of the 17 keypoint joints is: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]. The remaining 5 elements [ymin, xmin, ymax, xmax, score] represent the region of the bounding box (in normalized coordinates) and the confidence score of the instance.


Usage


The following code snippet shows how to load and run the model inference on an input image in Python. We recommend to resize and pad your image to the height/width such that:

  1. The height/width are both multiple of 32.
  2. The height to width ratio is close (and enough) to cover the original image's aspect ratio.
  3. Make the larger side to be 256 (one should adjust this based on the speed/accuracy requirements). For example, a 720p image (i.e. 720x1280 (HxW)) should be resized and padded to 160x256 image.


# Import TF and TF Hub libraries.
import tensorflow as tf
import tensorflow_hub as hub

# Load the input image.
image_path = 'PATH_TO_YOUR_IMAGE'
image = tf.io.read_file(image_path)
image = tf.compat.v1.image.decode_jpeg(image)
image = tf.expand_dims(image, axis=0)
# Resize and pad the image to keep the aspect ratio and fit the expected size.
image = tf.cast(tf.image.resize_with_pad(image, 256, 256), dtype=tf.int32)

# Download the model from TF Hub.
model = hub.load("https://www.example.com/models/google/movenet/TensorFlow2/multipose-lightning/1")
movenet = model.signatures['serving_default']

# Run model inference.
outputs = movenet(image)
# Output is a [1, 6, 56] tensor.
keypoints = outputs['output_0']


Intended Use


Primary Intended Uses


  1. Optimized to be run in the browser environment using TF.js with WebGL support or on-device with TF Lite.
  2. Tuned to be robust on detecting fitness/fast movement with motion blur poses.
  3. Most suitable for detecting the pose of multiple people who are 3ft ~ 6ft away from a device’s webcam that captures the video stream.
  4. Detect up to 6 people and their poses in the image frame.
  5. The model predicts 17 human keypoints of the full body even when they are occluded. For keypoints that are outside of the image frame, the model will emit low confidence scores. A confidence threshold can be used to filter out unconfident predictions.


Primary Intended Users


  1. People who build applications (e.g. fitness/physical movement, AR entertainment) that require very fast inference and good quality multi-person pose detection on standard consumer devices (e.g. laptops, tablets, cell phones). For the users whose application assumes only a single person in the image, we recommend checking out our MoveNet.SinglePose model.


Out-of-scope Use Cases


  1. Any form of surveillance or identity recognition is explicitly out of scope and not enabled by this technology.
  2. The model does not store/use/send any information in the input images at inference time.


Changelog


  1. Initial release.



Model Comments

0 comments

No comments yet.