Using Kalman filtering in video content analysis -

Using Kalman filtering in video content analysis

The transition from analog to digital video is bringing long-awaited benefits to security systems, largely because digital compression allows more image data to be transmitted and stored. New advances come with a price, however.

Digital video encourages the deployment of more cameras, but that requires more personnel to monitor the cameras. Storing the video can reduce the amount to be reviewed, since the motion vectors and detectors that are used in compression can be used to eliminate the frames with no significant activity. However, since motion vectors and detectors offer no information as to what is occurring, someone must physically screen the captured video to determine suspicious activity.

As a result, there is a push to develop methods that will significantly increase the effectiveness of monitoring security and surveillance video. Video content analysis (VCA), also known as video analytics, electronically recognizes the significant features within a series of frames and allows systems to issue alerts when specific types of events occur, speeding real-time security response. VCA automatically searches the captured video for specific content, relieving personnel from tedious hours of reviewing and decreasing the number of personnel needed to screen camera video.

VCA techniques that are continually being developed to make widespread implementation feasible in the years ahead. One certainty is that VCA will require a great deal of processing to identify objects of interest in the vast stream of video pixel data. In addition, VCA systems must be programmable to meet variations in application, recognize different content and adapt to evolving algorithms.

Newly available video processors provide an exceptionally high level of performance and programming flexibility for compression, VCA and other digital video requirements. Software platforms and tools that complement the processors simplify development for security and surveillance p roducts. As VCA techniques develop, they can be readily implemented into the enabling technology.

VCA flow
There is no international standard for VCA, but this is the generic flow :

1. A longer sequence is separated into individual scenes or shots that are to be analyzed. Since different scenes have different histograms, or color frequency distributions, a frame in which the histogram changes radically from that of the previous frame can be treated as a scene change.

2. Changing foreground objects within the scene are detected as separate from the static background.

3. Individual foreground objects are extracted or segmented, then tracked from frame to frame. Tracking involves detecting the position and speed of the object.

4. If recognition is necessary, the features of the object are extracted so the object can be classified.

5. If the event is something of interest, an alert is issued to the management software and/or personnel.

Foreground/background detection
VCA is built on the ability to detect activity that changes in the foreground against a generally static and uninteresting background. In the past, foreground/background detection was computationally limited. Today, higher-performance digital signal processors and video processors make it possible to execute more-complex detection algorithms.

In general, there are two methods of foreground/background detection: non-adaptive methods, which use only a few video frames and do not maintain a background model, and adaptive methods, which maintain a background model that evolves over time. In adaptive VCA algorithms, feedback from steps 2 through 4 of the VCA flow is sent to update and maintain the background model, which is then used as input for step 1.

Non-adaptive detection
In the simplest non-adaptive case, each pixel in the previous frame is subtracted from the corresponding pixel in the current frame in order to determine the absolute difference. The pixel's ab- solute difference is then compared with a predetermined threshold value that represents a “zero” level after compensating for noise in the scene and from the imager. If the absolute difference exceeds the threshold, the corresponding pixel belongs to the foreground. Otherwise, the pixel belongs to the background.

Short-term video object tracking and recognition in a controlled environment is possible using three frames. Even so, non-adaptive methods are useful only in highly supervised, short-term tracking applications without significant changes in the video scene. When scene or background changes occur, manual re-initialization is required. If not, errors accumulate over time, making the results unreliable.

Adaptive detection
Because of the limitations of non-adaptive methods, adaptive foreground and background detection is being implemented in VCA applications. Adaptive detection maintains a background model that is continuously updated by blending in data from every new video frame. Adaptive methods require more processing over non-adaptive methods, and the sophistication of the background model can vary.

In a basic adaptive method, the algorithm subtracts the background model pixel-by-pixel from the current frame to determine the foreground (as opposed to the non-adaptive algorithm's subtraction of subsequent frames). Results are also fed back into the model, thus adapting it to continual background changes without the need to reset. This method is effective for many video surveillance scenarios in which objects are constantly moving or background noises are present a significant portion of the time.

More complex foreground/background detection is based on a statistical background model in which every background pixel in a given frame is modeled as a random variable that follows Gaussian distribution. The mean and standard deviation of each individual pixel evolve over time, based on video data from every frame.

Object tracking/recognition
After foreground/background detection, a mask is created. All the parts of a single object may not be connected, because of environmental noise, so a computationally intensive process of morphological dilation is implemented before connecting all the parts as a whole object. Dilation involves imposing a grid on the mask, counting foreground pixels in each area of the grid and turning on the rest of the pixels in each area where the count indicates that separated ob- jects should be connected. After dilation and component connection, a bounding box is derived for each object. The box represents the smallest rectangle containing the entire object as it might appear in different frames, resulting in segmentation.

Tracking segmented foreground objects involves three steps: predicting where each object should be located for the current frame, determining which object best matches its description, and correcting the object trajectories to predict the next frame. The first and third steps are accomplished by means of a recursive Kalman filter . Since only the object's position can be observed in a single frame, it is necessary to calculate its speed and next position instantaneously using matrix computations.

At the start of the process, the filter is initialized to the foreground object's position relative to the background model. For every frame in which the object is tracked, the filter predicts the relative position of the foreground object in the succeeding frame. When the scene moves to the succeeding frame, the filter locates the object and corrects the trajectory.

The second step in tracking involves data association, which determines the correspondence of objects across frames based on similarities in features. Object size, shape and location can be based on the bounding boxes and their overlap from frame to frame. Velocity is a matter of prediction by the Kalman filter, and histograms associate different objects with their colors. However, any or all of these features can change.

Consider the case of a white truck with a red cab that approaches the camera along the street, pulls into a driveway, reverses and drives away in the opposite direction. All of the features of the object have changed in the course of the scene: size, shape, speed and color. The software must be able to accommodate these changes in order to identify the truck accurately. Additionally, when multiple objects are being tracked, the software must be able to distinguish features among them.

The complexities of tracking lead to problems associated with classifying objects. For instance, it is easier for the system to issue an alert if an object has crossed a line in front of the camera than if a human being has crossed the line. The dimensions of the object and its speed can provide a vector for rough classification, but more information is required for finer classification.

A larger object provides more pixel information, though possibly too much for fast classification. In this case, dimensional reduction techniques are required for real-time response, even though later investigation may use the full pixel information available in the stored frames.

Effective VCA implementation must overcome a number of challenges other than object classification. These include changes in light levels resulting from nightfall, water surfaces, clouds, wind in trees, rain, snow and fog; tracking the paths of objects that cross, causing the foreground pixels of each to merge briefly and then separate; and tracking objects from view to view in multiple-camera systems. Solving these problems is still a work in progress in VCA.

VCA system design
Implementing VCA and video encoding requires a high-performance processor and varied deployments. The emergence of new analytic techniques demands programming flexibility, which can be addressed with processors that integrate the highest performance with programmable DSP and RISC microprocessor cores in addition to video hardware co-processors. The right processor also needs to integrate high-speed communication peripherals and video signal chains to reduce system component counts and costs.

Using this type of solution to integrate VCA within a camera offers a robust, efficient form of network implementation. VCA software can also be integrated within PCs that serve as concentration units for multiple cameras. In addition to the VCA flow itself, there may be a need for preprocessing steps that handle de-interleaving before the foreground/background detection and other analytic steps.

The application software may add processing steps for object recognition or other purposes. Both one- and two-processor design versions provide headroom for additional software functions.

Adaptive methods of separating foreground objects from the background, then tracking objects and, if necessary, classifying suspicious activities are all aspects of VCA that require a high level of real-time processing computation and adaptability. DSP-based video processors offer the performance needed for VCA and video encoding, along with programming flexibility that can adapt to changes in application requirements and techniques.

The net effect is the raising of video security to a new level.

Cheng Peng () is a DSP video applications engineer at Texas Instruments. Peng received his PhD in electrical engineering at Texas A&M University and joined TI in August 2002.

A version of this article has been previously published on Signal Processing DesignLine

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.