MOT, TF model customization and distributed training

Python project, TensorFlow.

First, this article will describe how to convert a simple object detector to Multi-Object Tracking (MOT) capable of keeping identities to follow subjects along a sequence. Second, it will show how to customize and retrain a model from TensorFlow Object Detection API. Contrary to the previous article, we will parse the VOC2012 dataset with modern methods instead of implementing our own parser from scratch. We will also distribute the training on multiple GPUs.

GitHub link: https://github.com/Apiquet/Tracking_SSD_ReID

First, I will share the notes I have taken from recent paper about tracking. Then, we will move forward on our tracker implementation for MOT task.

Table of contents

  1. Papers
    1. Tractor++
    2. Deep Sort
    3. JDE
  2. Objectives
  3. Tracker implementation with SSD from previous article
  4. Tracker implementation with SSD from TensorFlow Hub
  5. Customize a model from TensorFlow Obj. detection API and distributed training on multiple GPU
    1. Pascal VOC 2012 with tensorflow_datasets
    2. Distribute a dataset over multiple GPUs
    3. Customize and retrain TensorFlow models
  6. Conclusion

1) Papers

Papers to analyze:


Tractor++ (code):

” We push tracking-by-detection to the limit by using only an object detection method to perform tracking.”

  • no training or optimization on tracking data
  • exploit the bounding box regression of an object detector to predict the position of an object in the next frame (convert a detector into a Tracktor)
  • detect object locations independently in each frame, then link corresponding detections across time
  • Two main steps:
    • regression of the object detector at frame t aligns already existing track bounding boxes of frame t-1. The other are killed
    • the object detector (or a given set of public detections) provides a set of detections of frame t and a new track is initialized
  • Detector
    • Core element : regression-based detector, train a Faster R-CNN with ResNet-101 and Feature Pyramid Networks (FPN) on the MOT17Det [45] pedestrian detection dataset
    • Faster R-CNN applies a Region Proposal Network to generate a multitude of bounding box proposals for each potential object
    • Feature maps for each proposal are extracted via Region of Interest (RoI) pooling , and passed to the classification and regression heads
    • assigns score and evaluates the likelihood of the proposal showing a pedestrian
    • regression head refines the bounding box location tightly around an object
    • non-maximum-suppression (NMS) is then used to the refined bounding box proposals
  • Tracktor
    • extract the spatial and temporal positions, i.e., trajectories
    • trajectory regression achieved by regressing the bounding box of frame t?1 to the object’s new position at frame t
    • For Faster R-CNN, this corresponds to applying RoI pooling on the features of the current frame but with the previous bounding box coordinates (assumption: target has moved only slightly)
    • Two cases for deactivating a trajectory:
      • object leaving the scene or occluded by a non-object
      • occlusions between objects are handle by applying NMS

Two extensions available to improve identity preservation accross frames:

  • Motion model: to fix the hypothesis that the position of object change only slighly from frame to frame
    • moving cameras (objects’ position change too much from frame to frame) -> camera motion compensation (CMC) by aligning frames via image registration using the Enhanced Correlation Coefficient (ECC)
    • Low frame rate (objects’ position change too much from frame to frame) -> apply a constant velocity assumption (CVA)
  • Re-identification: short-term re-identification based on appearance vectors generated by a Siamese neural network (trained on tracking ground truth data)
    • compare the distance in the embedding space of the deactivated with the newly detected tracks
    • re-identify via a threshold

Deep Sort (code):

  • Kalman filtering in image space and frame-by-frame data association
  • Robustness increased by adding CNN trained to discriminate pedestrians on a large-scale person re-identification dataset
  • For each track, a counter is incremented since the last successful measurement association and reset to 0 when the track has been found. If it reaches a threshold, the track is deleted for leaving the scene.
  • New track is initiated for each non associated detection found and classified as tentative for three frames (expect three successfull measurement association to keep it)
  • Assignment problem -> Hungarian algorithm used: motion and appearance information integrated through combination of two appropriate metrics.
  • Mahalanobis distance provides information about possible object locations based on motion (useful for short-term predictions).
  • Cosine distance used for appearance information (useful to recover identities after long-term occlusions)
  • Matching Cascade: Instead of solving associations in a global assignment problem, they solve a series of subproblems. It introduces a matching cascade that gives priority to more frequently seen objects
  • CNN used has been trained on a large-scale person re-identification dataset [21] that contains over 1,100,000 images of 1,261 pedestrians

JDE (code):

  • Detector and Embedding model (JDE) in a single-shot deep network
  • Feature Pyramid Network as base architecture
  • training process as a multi-task learning problem with anchor classification, box regression, and embedding learning
  • evaluation metrics: AP is employed to evaluate the performance of the detector. The retrieval metric True Accept Rate (TAR) at certain False Alarm Rate (FAR) adopted to evaluate the quality of the embedding

2) Objectives

Nowadays, we can notice that top tracker algorithms use a neural network trained for object detection, then a strategy of re-ID (re-identification) is implemented. For instance, Tractor++ uses a Siamese network to achieve re-ID.

The first neural network for object detection we are going to use is the SSD developed in the previous article: https://apiquet.com/2020/11/07/ssd300-implementation/.

The following illustration was made thanks to tensorflow.python.keras.utils.vis_utils.plot_model and paint.net to help visualize our final SSD architecture:

This neural network, trained for object detection on 21 different classes (Pascal VOC 2012), will be used for the box detection, then, the tracker will be used to keep identities.

The first tracker implemented will be a naive tracker that only use the SSD’s detections and keep subject’s id by computing IoU between boxes of the frame n and the frame n+1. A Siamese CNN network could also be developed to track the subjects, it could only use the SSD’s detections as box proposal when needed.

3) Tracker implementation with SSD from previous article

A first tracker we could implement is a tracker that only takes as input the proposed boxes from the neural network trained for object detection. This tracker could have the following behavior:

  • frame n: get the boxes/classes from the SSD model and give an ID for each subject
  • frame n+1: compute the IoU (explained and implemented in the previous article) between the new boxes and the ones of the frame n. If the IoU > threshold and the classes are the same, we propagate the IDs.

This tracker can be implemented with two classes:

  • A class for a subject with the following attributes:
    • category (person, dog, etc.)
    • box (cx, cy, w, h)
    • lifespan: number of times the subject was found (we could then filter to avoid displaying a subject only seen once or twice)
    • dying: a number increased when the subject is not found in the current frame, the subject is removed when it reaches a certain number
    • seen: if the subject was seen in the current frame
    • identity: the subject’s id
  • A main class with:
    • list of alive tracked subjects
    • max id given (to be increased each time we find a new subject)
    • a method to remove subjects with lifespan > threshold
    • a method to compute IoU

The tracker’s behavior could be:

  • loop over the different categories given by SSD
  • compute IoU between the tracked subjects and the current boxes given by SSD (of the selected category)
  • give the tracked subject’s id to the box with IoU > threshold
  • Finally, create new tracked subjects with all the left boxes

The following code is the main class implementation:

# loop over the categories
for category in tf.unique(categories)[0]:
    cat_idx = categories == category

    for subject in self.subjects:
        # no need to compute IoU is subject has a different class
        if subject.category != category:
            continue
        iou, values = self.computeJaccardIdx(subject.loc, boxes, 0.3)
        iou = tf.math.logical_and(iou, cat_idx)
        iou = tf.dtypes.cast(iou, tf.int16)
        # verify if at least a box has and IoU >= 0.3
        if tf.reduce_sum(iou, axis=0) != 0:
            subject.seen = True
            # give subject's id to box with max IoU score
            idx_max_iou = values.numpy().argmax(axis=0)
            subject.loc = boxes[idx_max_iou]
            subject.dying = 0
            identity[idx_max_iou] = subject.identity
            <code>lifespan[idx_max_iou] = subject.lifespan</code>
# increase lifespan for any subject that was not seen
for subject in self.subjects:
    if subject.seen is False:
        subject.dying += 1
# create a new subject for each box that did not get an id
for i, box in enumerate(boxes):
    if identity[i] == -1:
        self.max_id += 1
        new_subject = _subjectTracked(categories[i], box, self.max_id)
        self.subjects.append(new_subject)
        identity[i] = self.max_id

This tracker has been implemented here: https://github.com/Apiquet/Tracking_SSD_ReID/blob/main/models/NaiveTracker.py

We can now just add a line of code in the function used in the previous article to plot SSD’s predictions:

identities = tracker(classes, boxes)

And we get the tracking results:

Contrary to the previous article, we now give an ID to each subject seen and the tracker keeps it along the sequence.

Even if our SSD model created in the previous article achieves good results, I decided to take a neural network from TensorFlow Hub for the following reasons:

  • In a professional use case, we won’t spend time implementing an entire architecture if it is already available under non-restricted rights. We thus need to know how to use resources such as TensorFlow Hub to get an existing model then modify it to fit our use case,
  • The model implemented in the previous article has limited performances due to the limited computation resources available on Colab and limited time that I wanted to spend training the model,
  • The previous model as input shape of 300×300 which makes complex the detection of distant objects (this choice has been made to reduce the process/training time),
  • If the model for object detection comes from TensorFlow Hub, the code will be more generic if we then want to modify the network used for box proposal,
  • People could only be interested in the tracker implementation without needed the SSD network from the previous article,
  • People could want to test the tracker implementation with different object detectors to compare performances,
  • People could want to learn how to find an existing architecture and how to modify it to fit a specific use case

4) Tracker implementation with SSD from TensorFlow Hub

For the reasons mentioned previously, we are going to use a SSD model given by TensorFlow Hub and available here: tfhub.dev/tensorflow/ssd_mobilenet_v2/2. This model will also be used for box proposal.

The following code allows to create our detector:

import tensorflow_hub as hub
module_handle = "https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1"
model = hub.load(module_handle)

We can then use the same tracker developed in the previous section, but as the model used for box proposal is more accurate, we can use it on harder sequences using the pltPredOnVideoTfHub() function available here: https://github.com/Apiquet/Tracking_SSD_ReID/blob/main/utils/eval.py

pltPredOnVideoTfHub(detector, video_path, out_path, score_threshold=0.112,
                    start_idx=270, end_idx=440, skip=1, resize=(500,200),
                    tracker=tracker, fps=20, input_shape=(512,512),
                    targets=["Horse", "Person", "Dog"], lifespan_thres = 8)

It is difficult for the tracker to keep the identity of each horse but we can see that some of them are well followed. We can also notice that the person and the dog are very well tracked.

5) Customize a model from TensorFlow Obj. detection API and distributed training on multiple GPU

5-1) Pascal VOC 2012 with tensorflow_datasets

In the previous article, we saw how to implement our own parser to convert the raw annotations from .xml to data that can be used by a neural network for training. This time, we are going to use modern tool such as tensorflow_datasets to do so. The following code downloads, splits and divides in batches the dataset:

splits = ['train[:80%]', 'train[80%:90%]', 'train[90%:]']

(train_examples, validation_examples, test_examples), info = tfds.load('voc/2012', batch_size=32, with_info=True, split=splits)

We also get information in the “info” object to understand how the data was ordered:

We can notice that we have the images in uint8 (0 to 255) so we will need to normalize them. Then, we have the “object” with the boxes annotations and corresponding labels. To build our dataset, we simply need to create a function that returns the normalized images of good resolution, the labels with one-hot encoding and the boxes’ coordinates:

@tf.function
def format_image(tensor):
    images = tf.image.resize(tensor['image'], IMAGE_SIZE) / 255.0
    return images, tf.one_hot(tensor['objects']['label'], 20), tensor['objects']['bbox']

train_batches = train_examples.shuffle(num_examples // 4).map(format_image).prefetch(1)
validation_batches = validation_examples.map(format_image)
test_batches = test_examples.map(format_image).batch(1)

This simple function is all we need to prepare our dataset for a training. We can verify the final shape of the train batches:

The shapes looks good so the train batches are all set. We will now see how to distribute it over multiple GPUs to speed up the training.

5-2) Distribute a dataset over multiple GPUs

To create a distributed training, TensorFlow provides tf.distribute.strategies with many strategies explained here: https://www.tensorflow.org/api_docs/python/tf/distribute

Even if we only have a single GPU, we can implement a distributed strategy called OneDeviceStrategy that distribute the dataset and model on a single GPU. Thanks to it, if we got a second GPU in the futures or if someone wants to run our code and has multiple GPUs, a simple change OneDeviceStrategy to MirroredStrategy will be enough to train the model on all the available GPUs.

First, we need to find the GPU available and define the strategy:

devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)
gpu_name = "GPU:0"

# Can be changed to MirroredStrategy if multiple GPU available
strategy = tf.distribute.OneDeviceStrategy(device=gpu_name)

Then, we need to distribute the dataset over all the GPUs:

def distribute_datasets(strategy, train_batches, validation_batches, test_batches):
    train_dist_dataset = strategy.experimental_distribute_dataset(train_batches)
    val_dist_dataset = strategy.experimental_distribute_dataset(validation_batches)
    test_dist_dataset = strategy.experimental_distribute_dataset(test_batches)
    return train_dist_dataset, val_dist_dataset, test_dist_dataset

train_dist_dataset, val_dist_dataset, test_dist_dataset = distribute_datasets(strategy, train_batches, validation_batches, test_batches)

Once our dataset is distributed, we can download the model, customize it and retrain it on our data.

5-3) Customize and retrain TensorFlow models

Once the TensorFlow Object Detection API is installed, we have access to multiple useful methods:

from object_detection.utils import config_util
from object_detection.builders import model_builder

These two objects allow to set the config how a model (number of outputs classes, freeze layers, etc.) and to build the model. In the Object Detection API (https://github.com/tensorflow/models/tree/master/research/object_detection), we can see all the available models under models/. We are going to use the model: ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.

First, we need to get the model, set the number of classes to 20 and set the freeze attribute:

num_classes = 20
pipeline_config = 'models/research/object_detection/configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config'

configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
model_config.ssd.num_classes = num_classes
model_config.ssd.freeze_batchnorm = True
detection_model = model_builder.build(
    model_config=model_config, is_training=True)

We also need to restore the weights:

checkpoint_path = 'models/research/object_detection/test_data/checkpoint/ckpt-0'
fake_box_predictor = tf.compat.v2.train.Checkpoint(
    _base_tower_layers_for_heads=detection_model._box_predictor._base_tower_layers_for_heads,
    # _prediction_heads=detection_model._box_predictor._prediction_heads,
    #    (i.e., the classification head that we <em>will not</em> restore)
    _box_prediction_head=detection_model._box_predictor._box_prediction_head,
    )
fake_model = tf.compat.v2.train.Checkpoint(
          _feature_extractor=detection_model._feature_extractor,
          _box_predictor=fake_box_predictor)
ckpt = tf.compat.v2.train.Checkpoint(model=fake_model)
ckpt.restore(checkpoint_path).expect_partial()

Finally, we need to run the model on a fake image to finish the set up:

image, shapes = detection_model.preprocess(tf.zeros([1, 640, 640, 3]))
prediction_dict = detection_model.predict(image, shapes)
_ = detection_model.postprocess(prediction_dict, shapes)

The model is now ready for training with a simple loop that has the traditional GradientTape():

prediction_dict = model.predict(preprocessed_images, shapes)
losses_dict = model.loss(prediction_dict, shapes)
total_loss = losses_dict['Loss/localization_loss'] + losses_dict['Loss/classification_loss']
gradients = tape.gradient(total_loss, vars_to_fine_tune)
optimizer.apply_gradients(zip(gradients, vars_to_fine_tune))

Please see the notebook in my Github profile if the full loop code is needed.

Conclusion

In this article, we learned how to add a module to a model to extend its capabilities. In this case, we extend an object detector to a tracking algorithm to build a Multi-Object Tracker. We also learned how to use a model from TensorFlow Hub and how to customize and retrain a model from TensorFlow Object Detection API. Finally, we learned how to parse a dataset thanks to tensorflow_datasets and how to distribute a training on multiple GPUs.


Here you can find my project:

https://github.com/Apiquet/Tracking_SSD_ReID

Video source: coveer