Semantic Understanding of Vehicle Activity

Last updated on Feb 20, 2022

DISCLAIMER:- This post is an excerpt from my capstone project report from my bachelors degree. Harish Kumar, Gokull Subramanian, and I were the contributors.

If you are interested in reading the full project report, it can be found here.

Libraries used:- Python, OpenCV, TensorFlow, PyTorch, Keras

Motivation

Autonomous vehicles as well as vehicles with ADAS features use multiple sensors and integrate their data to take decisions and assist in the driving process as well as in automating it. By using LiDAR, RADAR, etc., data can be acquired in metric units which helps in taking decisions in real world environment. However, the high cost of such sensors is a major drawback. In the recent years, researchers as well as autonomous vehicle manufacturers have been looking at the feasibility of using only RGB cameras for perception. This is largely attributed to the influence of deep learning on computer vision in the current decade. This project work has been focused on the perception aspect of autonomous vehicles with data from a single RGB camera.

Objective and Challenges

Our primary objective was to predict the semantic activity of the ego vehicle as well as the vehicles on the scene. The goal was to build an end-to-end activity classifier using a video object segmentation approach which could do object detection, semantic segmentation as well as instance level tracking and then we planned to use those information to predict the activity. But the first challenge was to find a common dataset which contains ground truth labels of object lables, pixel level labels as well as pixel level optical flow labels. At that time such dataset did not exist.

So our objective was achieved using three deep neural networks for three tasks namely, object detection and tracking, lane detection and optical flow estimation. Object detection and tracking was used to detect and track the vehicles on the scene. Lane detection was used to understand the context of the road. Optical flow estimation was used to capture the motion of vehicles. The outputs of the three networks were manually integrated to interpret the semantic vehicle activity. The secondary objective was to run this integration on an embedded board. The performance of the same was evaluated and the results were presented.

Proposed Pipeline

Understanding of semantic vehicle activity required information about the objects in the scene, their motion and the information about the road itself. This information should also be temporal (related across frames) which is why tracking plays an important role too. The object detection network gives the classes of the detected objects and provides bounding boxes to the tracking algorithm as input. The tracking algorithm uses these bounding box coordinates to track these objects across frames. The lane detection algorithm detects all the visible lane markings on the scene in a distinct manner such that the ego lane can be detected in every frame. Optical flow estimation provides the flow vectors of each pixel in the frame as input to manual integration. An abstract model of the proposed pipeline is given in below figure.


Figure 3.1: Proposed Pipeline


Figure 3.2: Integration Pipeline

Object Detection and Tracking

Problem Definition

Semantic understanding of vehicle activity demands identification of the vehicles in videos. Object detection can be used to detect objects in each frame and tracking can be done to associate the data acquired from object detection along all the frames.

A tracking by detection approach was proposed to do object detection and tracking as two different tasks. Object detection involves classification and localization of objects in an image. Tracking algorithm tracks multiple objects across all the frames.

Need for the Task

Object detection is an important task in scene perception and it is done across all frames for the following reasons:

To classify the objects present in the set of pixel areas.
To localize the objects using a bounding box in an image.

Tracking is essential for the following reasons:

In order to associate the information acquired from object detection from all frames and get a tracking ID for each object.
To track the objects along successive frames until they exit the field of view of camera.
To mitigate the effect of identity switches caused by occlusion, motion blur, lighting conditions.

YOLO Network Architecture

YOLO is a single shot object detector, which takes image as an input and gives a tensor containing all bounding box information, classes and confidence scores.

To understand YOLO network architecture, it can be divided into two parts such as YOLO body and YOLO head. First part of the architecture is used to extract features, which is considered base network. In YOLOv3, Darknet-53 has been used as base network, also called as YOLO body, to extract features from the input data. Then the extracted features will be subjected to YOLO head, which does the detection of objects. This detection involves the localization and classification of objects.

YOLO body (DarkNet-53) extracts features from an image in three different scales for detecting small, medium and large objects accurately, which will be fed into YOLO head.

To simplify DarkNet-53 architecture, it has been divided into CONV2D blocks and Residual blocks as shown in Figure.


Figure 3.2: DarkNet-53 Architecture

Each CONV2D block consists of a convolution layer along with a leaky ReLU layer as in Figure. Residual blocks (Res-Block as per the Figure) are made of one 24CONV2D block followed by n residual units. Here, n defines the number of residual blocks. Each Residual unit can be made by two CONV2D blocks stacked together and added with the input itself.


Figure 3.3: YOLO v3 Architecture

YOLOv3 generates 3 different scales of output tensors containing the bounding boxes, confidence scores and class names. This output tensor will be subjected to few post-processing steps such as IoU and Non Max Suppression (NMS) to remove the unnecessary bounding boxes. Bounding boxes, class names and confidence scores will be found from object detection. Bounding boxes from object detection will be used for tracking.

Deep SORT Algorithm

Deep SORT (Simple Online Real-time Tracking with a deep association metric) is an improved version of SORT, is used as a tracker in our tracking by detection approach. It uses conventional vision algorithms to do tracking but by adding deep association metric long term occlusions can be sustained. It is generally assumed the noise to be present in our input data and camera to be uncalibrated. The Deep SORT pipeline is shown in the Figure 3.4.


Figure 3.4: Deep SORT Pipeline

Integration of Detection and Tracking

In tracking-by-detection approach, bounding boxes and tracking IDs are taken from tracking output, but tracking algorithm has generally no influence on class labels. So output class labels have to be fetched from the YOLO output. Number of bounding boxes from YOLO and from tracking will not always be same, as tracking takes care of FPs and FNs . The association between class names and bounding boxes will be a challenge. To mitigate this problem, storing of all class labels will be necessary. For each track, the first frame in which it is being tracked is found and class labels are taken from the stored list. This class label from the first tracked frame will be passed to all consecutive frames until the object is lost.

Results

The results of the object detection using YOLO have been shown below.


Figure 3.6: Output of YOLO

Tracking ID and bounding boxes are generated from tracking. The results of only tracking are shown below.


Figure 3.7: Output of Deep SORT

Results of both tracking and object detection are overlapped and shown below.


Figure 3.8: Output of YOLO and Deep SORT

After integration between tracking and detection labels are be shown below. Label contains the class label from YOLO and tracking ID and bounding box from tracker.


Figure 3.9: Integration of Object Detection and Tracking Outputs