Vision-based artificial intelligence brings awareness to surveillance - Embedded.com

Vision-based artificial intelligence brings awareness to surveillance

Recent events showcase both the tantalizing potential and the current underwhelming reality of automated surveillance technology. Consider, for example, the terrorist bombing at the finish line of the April 15, 2013 Boston Marathon. Bolyston Street was full of cameras – those permanently installed by law enforcement organizations and businesses and those carried by race spectators and participants. But none of them was able to detect the impending threat – represented by the intentionally abandoned backpacks, each containing a pressure cooker-implemented bomb – with sufficient advance notice to prevent the tragedy. Also, due to both the slow speed and low accuracy of the alternative computer-based image analysis algorithms, the flood of video footage was predominantly analyzed by the eyes of the police department and FBI representatives attempting to identify and locate the perpetrators,.

Consider, too, the military presence in Afghanistan and elsewhere, as well as the ongoing threat to U.S. embassies and other facilities around the world. Only a limited number of human surveillance personnel are available to look out for terrorist activities such as the installation of IEDs (improvised explosive devices) and other ordinances, the congregation and movement of enemy forces, and the like. These human surveillance assets are further hampered by fundamental human shortcomings such as distraction and fatigue.

Computers, on the other hand, don't get sidetracked, and they don't need sleep. More generally, an abundance of ongoing case studies, domestic and international alike, provide ideal opportunities to harness the analysis assistance that computer vision processing can deliver.

For example, automated analytics algorithms are able to sift through an abundance of security camera footage in order to pinpoint an object left at a scene and containing an explosive device, cash, contraband, or other contents of interest to investigators. After capturing facial features and other details of the person(s) who left the object, analytics algorithms can also index image databases both public (Facebook, Google Image Search, etc.) and private (CIA, FBI, etc.) in order to rapidly identify the suspect(s).

Unfortunately, left-object, facial recognition, and other related technologies haven't historically been sufficiently mature to be relied upon with high confidence, especially in non-ideal usage settings, such as when individuals aren't looking directly at the lens or are obscured by shadows, or other challenging lighting conditions. As a result, human eyes and brains were traditionally relied upon for video analysis instead of computer code, thereby delaying suspect identification and pursuit, as well as the possibility of error (false positives, missed opportunities, etc). Such automated surveillance technology shortcomings are rapidly being surmounted, however, as cameras (and the image sensors contained within them) become more feature rich, as the processors analyzing the video outputs increase in performance, and as the associated software becomes more robust.

As these and other key system building blocks such as memory devices decrease in cost and power consumption, opportunities for surveillance applications are rapidly expanding beyond traditional law enforcement into new markets such as business analytics and consumer-tailored surveillance systems , as well as smart building and smart city initiatives. To facilitate these trends, an alliance of hardware and software component suppliers, product manufacturers, and system integrators has emerged to accelerate the availability and adoption of intelligent surveillance systems and other embedded vision processing opportunities.

How do artificial intelligence and embedded vision processing intersect? Answering this question begins with a few definitions. Computer vision is a broad, interdisciplinary field that attempts to extract useful information from visual inputs by analyzing images and other raw sensor data.

The term “embedded vision” refers to the use of computer vision in embedded systems, mobile devices, PCs, and the cloud. Historically, image analysis techniques have only been implemented in complex and expensive, therefore niche, surveillance systems. However, the previously mentioned cost, performance and power consumption advances are now paving the way for the proliferation of embedded vision into diverse surveillance and other applications .

Automated surveillance capabilities
In recent years, digital equipment has rapidly entered the surveillance industry, which was previously dominated by analog cameras and tape recorders. Networked digital cameras, video recorders, and servers have not only improved in quality and utility, but they have also become more affordable. Vision processing has added artificial intelligence to surveillance networks, enabling “aware” systems that help protect property, manage the flow of traffic, and even improve operational efficiency in retail stores. In fact, vision processing is helping to fundamentally change how the industry operates, allowing it to deploy people and other resources more intelligently while expanding and enhancing situational awareness. At the heart of these capabilities are vision algorithms and applications, commonly referred to as video analytics, which vary broadly in definition, sophistication, and implementation (Figure 1 ).



Click on image to enlarge.

Figure 1: Video analytics is a broad application category referencing numerous image analysis functions, varying in definition, sophistication, and implementation.

Motion detection, as its name implies, allows surveillance equipment to automatically signal an alert when frame-to-frame video changes are noted. As one of the most useful automated surveillance capabilities, motion detection is widely available, even in entry-level digital cameras and video recorders. A historically popular technique to detect motion relies on codecs' motion vectors, a byproduct of the motion estimation employed by video compression standards such as MPEG-2 and H.264.

Because these standards are frequently hardware-accelerated, scene change detection using motion vectors can be efficiently implemented even on modest IP camera processors, needing no additional computing power. However, this technique is susceptible to generating false alarms, because motion vector changes do not always coincide with motion from objects of interest. It can be difficult to impossible, using only the motion vector technique, to ignore unimportant changes such as trees moving in the wind or casting shifting shadows, or to adapt to changing lighting conditions.

Such “false positives” have contributed to the perception that motion detection algorithms are unreliable. To prevent vision systems from undermining their own utility, installers often insist on observing fewer than five false alarms per day. Nowadays, however, an increasing percentage of systems are adopting intelligent motion detection algorithms that apply adaptive background modeling along with other techniques to help identify objects with much higher accuracy levels, while ignoring meaningless motion artifacts.

While there are no universal industry standards regulating accuracy, systems using these more sophisticated methods regularly achieve detection precision approaching 90 percent for typical surveillance scenes, even those with those adequate lighting and limited background clutter. Even under more challenging environmental conditions, such as poor or wildly fluctuating lighting, precipitation-induced substantial image degradation, or heavy camera vibration, accuracy can still be near 70 percent. The more advanced 3-D cameras discussed later in this article can boost accuracy higher still.

The capacity to accurately detect motion has spawned several event-based applications, such as object counting and trip zone. As the name implies, ‘counting’ tallies the number of moving objects crossing a user-defined imaginary line, while ‘tripping’ flags an event each time an object moves from a defined zone to an adjacent zone. Other common applications include loitering, which identifies when objects linger too long, and object left-behind/removed,which searches for the appearance of unknown articles or the disappearance of designated items.

Robust artificial intelligence often requires layers of advanced vision know-how, from low-level imaging processing to high-level behavioral or domain models. As an example, consider a demanding application such as traffic and parking lot monitoring, which maintains a record of vehicles passing through a scene. It is often necessary to first deploy image stabilization and other compensation techniques to retard the effects of extreme environmental conditions such as dynamic lighting and weather. Compute-intensive pixel-level processing is also required to perform background modeling and foreground segmentation.

To equip systems with scene understanding sufficient to identify vehicles in addition to traffic lanes and direction, additional system competencies handle feature extraction, object detection, object classification (car, truck, pedestrians, etc.), and long-term tracking. LPR (license plate recognition) algorithms and other techniques locate license plates on vehicles and discern individual license plate characters. Some systems also collect metadata information about vehicles, such as color, speed, direction, and size, which can then be streamed or archived in order to enhance subsequent forensic searches.

Algorithm implementation options
Traditionally, analytics systems were based on PC servers, with surveillance algorithms running on x86 CPUs. However, with the introduction of high-end vision processors, all image analysis steps (including the previously mentioned traffic systems) can now optionally be entirely performed in dedicated-function equipment.

Embedded systems based on DSPs (digital signal processors), application SoCs (system-on-chips), GPUs (graphics processors), FPGAs (field programmable logic devices) and other processor types are now entering the mainstream, primarily driven by their ability to achieve comparable vision processing performance to that of x86-based systems, at lower cost and power consumption.

Standalone cameras and analytics DVRs (digital video recorders) and NVRs (networked video recorders) increasingly rely on embedded vision processing. Large remote monitoring systems, on the other hand, are still fundamentally based on one or more cloud servers that can aggregate and simultaneously analyze numerous video feeds. However, even emerging ‘cloud’ infrastructure systems are beginning to adopt embedded solutions in order to more easily address performance, power consumption, cost, and other requirements. Embedded vision coprocessors can assist in building scalable systems, offering higher net performance, in part by redistributing processing capabilities away from the central server core and toward cameras at the edge of the network.

Semiconductor vendors offer numerous devices for different segments of the embedded cloud analytics market. These ICs can be used on vision processing acceleration cards that go into the PCI Express slot of a desktop server, for example, or to build standalone embedded products.

Many infrastructure systems receive compressed H.264 videos from IP cameras and decompress the image streams before analyzing them. Repeated “lossy” video compression and decompression results in information discard that may be sufficient to reduce the accuracy of certain video analytics algorithms. Networked cameras with local vision processing “intelligence,” on the other hand, have direct access to raw video data and can analyze and respond to events with low latency (Figure 2 ).

Figure 2: In distributed intelligence surveillance systems, networked cameras with local vision processing capabilities have direct access to raw video data and can rapidly analyze and respond to events.

Although the evolution to an architecture based on distributed intelligence is driving the proliferation of increasingly autonomous networked cameras, complex algorithms often still run on infrastructure servers. Networked cameras are commonly powered by Power Over Ethernet (PoE) and therefore have a very limited power budget. Further, the lower the power consumption, the smaller and less conspicuous the camera can be. To quantify the capabilities of modern semiconductor devices, consider that an ARM Cortex-A9-based camera consumes only 1.8W in its entirety, while compressing H.264 video at 1080p30 (1920×1080 pixels per frame, 30 frames per second) resolution.

It's relatively easy to recompile PC-originated analytics software to run on an ARM processor, for example. However, as the clock frequency of a host CPU increases, the resultant camera power consumption also increases significantly as compared to running some-to-all of the algorithm on a more efficient DSP, FPGA or GPU. Harnessing a dedicated vision coprocessor will reduce the power consumption even more. And further assisting software development, a variety of computer vision software libraries is available.

Some algorithms, such as those found in OpenCV (the Open Source Computer Vision Library), are cross-platform, while others, such as Texas Instruments' IMGLIB (the Image and Video Processing Library), VLIB (the Video Analytics and Vision Library) and VICP (the Video and Imaging Coprocessor Signal Processing Library), are vendor-proprietary.

Leveraging pre-existing code speeds time to market, and to the extent that it exploits on-chip vision acceleration resources, it can also produce much higher performance results than those attainable with generic software (Figure 3 ).

Figure 3: Vision software libraries can speed a surveillance system's time to market as well as notably boost its frame rate and other attributes. Historical trends and future forecasts
As previouslymentioned, embedded vision processing is one of the key technologiesresponsible for evolving surveillance systems beyond their archaic CCTV(closed-circuit television) origins and into the modern realm ofenhanced situational awareness and intelligent analytics. For most ofthe last century, surveillance required people, sometimes lots of them,to effectively patrol property and monitor screens and access controls.

Inthe 1990’s, DSPs and image processing ASICs (application-specificintegrated circuits) helped the surveillance industry capture imagecontent in digital form using frame grabbers and video cards. Coincidingwith the emergence of high-speed networks for distributing andarchiving data at scales that had been impossible before, surveillanceproviders embraced computer vision technology as a means of helpingmanage and interpret the deluge of video content now being collected.

Initialvision applications such as motion detection sought to draw theattention of on-duty surveillance personnel, or to trigger recording forlater forensic analysis. Early in-camera implementations were usuallyelementary, using simple DSP algorithms to detect gross changes ingrayscale video, while those relying on PC servers for processinggenerally deployed more sophisticated detection and tracking algorithms.

Over the years, however, embedded vision applications havesubstantially narrowed the performance gap with servers, benefiting frommore capable function-tailored processors. Each processor generationhas integrated more potent discrete components, including multiplepowerful general computing cores as well as dedicated image and visionaccelerators.

As a result of these innovations, the modernportfolio of embedded vision capabilities is constantly expanding. Andthese expanded capabilities are appearing in an ever-wider assortment ofcameras, featuring multi-megapixel CMOS sensors with wide dynamic rangeand/or thermal imagers, and designed for every imaginable installationrequirement, including dome, bullet, hidden/concealed, vandal-proof,night vision, pan-tilt-zoom, low light, and wirelessly networkeddevices.

Installing vision-enabled cameras at the ‘edge’ hasreduced the need for expensive centralized PCs and backend equipment,lowering the implementation cost sufficient to place these systems inreach of broader market segments, including retail, small business, andresidential.

The future is bright for embedded vision systems. Sensors capable of discerning and recovering 3-D depth data , such as stereo vision, TOF (time-of-flight) , and structured light technologies, are increasingly appearing in surveillance applications, promising significantly more reliable and detailed analytics .

3-Dtechniques can be extremely useful when classifying or modelingdetected objects while ignoring shadows and illumination artifacts,addressing a problem that has long plagued conventional 2-D visionsystems. In fact, systems leveraging 3-D information can deliverdetection accuracies above 90 percent, even for highly complex scenes,while maintaining a minimal false detection rate (Figure 4 ).

Figure4: 3-D cameras are effective in optimizing detection accuracy, byenabling algorithms to filter out shadows and other traditionalinterference sources.

However, these 3-Dtechnology advantages come with associated tradeoffs that also must beconsidered. For example, stereo vision, which uses geometric“triangulation” to estimate scene depth, is a passive, low-powerapproach to depth recovery which is generally less expensive than othertechniques and can be used at longer camera-to-object distances, at thetradeoff of reduced accuracy (Figure 5 ).

(a)

(b)

(c)

Figure5: The stereo vision technique uses a pair of cameras, reminiscent of ahuman's left- (a) and right-eye perspectives (b), to estimate thedepths of various objects in a scene (c).

TOF, onthe other hand, is an active, higher-power sensor that generally offersmore detail, but at higher cost and with a shorter operating range. Bothapproaches, along with structured light and other candidates, can beused for detection. But the optimum technology for a particularapplication can only be fully understood after prototyping (Figure 6 ).

Figure6: Although the depth map generated by a TOF (time-of-flight) 3-Dsensor is more dense than its stereo vision-created disparity mapcounterpart, with virtually no coverage “holes” and therefore greateraccuracy in the TOF case, stereo vision systems tend to be lower power,lower cost and usable over longer distances.

Asnew video compression standards such as H.265 become established,embedded vision surveillance systems will need to process even largervideo formats (4k x 2k and beyond), which will compel designers toharness hardware processor combinations that may include CPUs,multi-core DSPs, FPGAs, GPUs, and dedicated accelerators.

Addressingoften-contending embedded system complexity, cost, power, andperformance requirements will likely lead to more distributed visionprocessing, whereby rich object and feature metadata extracted at theedge can be further processed, modeled, and shared in the cloud. Theprospect of more advanced compute engines will enable state-of-the-artvision algorithms, including optical flow and machine learning.

Embeddedvision technology has the potential to enable a wide range ofelectronic products, such as the surveillance systems discussed in thisarticle, that are more intelligent and responsive than before, and thusmore valuable to users. It can add helpful features to existingproducts. And it can provide significant new markets for hardware,software and semiconductor manufacturers.

Editor’s Note: On Thursday, May 29, 2014, in Santa Clara, California, the Alliance will hold its fourth Embedded Vision Summit ,part of an ongoing series of technical educational forums for hardwareand software product creators interested in incorporating visualintelligence into electronic systems. The Embedded Vision Summit Westwill be co-located with the Augmented World Expo , a three-day event covering augmented reality, wearable computing, and the Internet of Things.

Brian Dipert isEditor-In-Chief of the Embedded Vision Alliance. He is also a SeniorAnalyst at BDTI (Berkeley Design Technology, Inc.), and Editor-In-Chiefof InsideDSP, the company's online newsletter dedicated to digitalsignal processing technology. He has a B.S. degree in ElectricalEngineering from Purdue University in West Lafayette, IN. Hisprofessional career began at Magnavox Electronics Systems in Fort Wayne,IN; Brian subsequently spent eight years at Intel Corporation inFolsom, CA. He then spent 14 years at EDN Magazine.

Jacob Jose is a Product Marketing Manager with Texas Instruments’ IP Camerabusiness. He joined Texas Instruments in 2001. He has engineering andbusiness expertise in the imaging, video and analytics markets, and hasworked at locations in China, Taiwan, South Korea, Japan, India, and theUSA. He has a Bachelors degree in computer science and engineering fromthe National Institute of Technology at Calicut, India and is currentlyenrolled in the executive MBA program at Kellogg School of Business,Chicago, Ilinois.

Darnell Moore , Ph.D., is a Senior Memberof the Technical Staff with Texas Instruments’ Embedded ProcessingSystems Lab. As an expert in vision, video, imaging, and optimization,his body of work includes Smart Analytics, a suite of visionapplications that spawned TI’s DMVA processor family, as well asadvanced vision prototypes, such as TI’s first stereo IP surveillancecamera. He received a BSEE from Northwestern University and a Ph.D. fromthe Georgia Institute of Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.