Lag-time in video games and video conferencing is annoying. Lag-time in avionics, medical devices, and industrial video systems is mission-critical. That’s why low latency video systems are proliferating in applications where live video feeds need to be processed and analyzed in real time. This article discusses some of the various contributors to latency in video systems, and ways of minimizing their impact at the video source and playback ends.
The need for speed
For consumers and business users, lag time is commonly experienced in video games and video conferencing. Lag time in video games leads to being overrun by enemies, eliminated by other players, or in the case of massive Star Craft games, it leads to a stadium full of angry fans. In real-time business systems, low-latency video conferencing is also important. Without it, mismatched voice and video cause confusion and frustration. High latency in a video conferencing system can disrupt a conversation to the extent that it defeats the purpose of using video conferencing to increase productivity. In mission critical applications, the severity of high latency consequences is multiplied, such as in Unmanned Aerial Vehicles (UAVs), video assisted surgery, and mid-air refueling.
UAVs used in tactical strikes on enemy targets and video-assisted surgeries such as endoscopy and laparoscopy appear to be unrelated, yet both rely on accurate information – an accuracy that can only be provided by low latency video systems. If the latency is too great, the consequences can be catastrophic, such hitting the wrong target with the UAV's payload or missing a crucial element during surgery.
During mid-air refueling, military pilots are trying to orient the aircraft’s fuel inlet to the fuel pipe of a fuel carrier aircraft. Again, video systems play a critical role here, capturing and sending live video feeds of the inlet to the pilot in real time. In this instance, the success of the entire operation is highly dependent upon the latency between the captured video at its source and the video displayed on the screen in the cockpit.
When making mission-critical decisions based on video feeds, low latency is essential. In my examples, higher latency could lead to the UAV targeting an unintended area, doctors making misplaced incisions, and military aircraft running out of fuel. This could ultimately lead to serious property damage, failed missions, and loss of life.
Latency contributors and counter measures
The anatomy of a typical video system consists of two parts: the video source end where video is captured, compressed, and streamed, and the video playback end, where video is received, decompressed, and displayed.
The diagram in Figure 1 shows the components of a typical video system. Each of the modules below adds a delay to the end-to-end latency of the system. Let us investigate where and how these delays are introduced:
Consider a basic 60 frames/sec video. Capture and display of video adds 17 ms to each. Depending on availability of encoder and decoder (CODECs), your compression and decompression modules add another 15ms to 17ms each. Encapsulation and decapsulation modules, based on their container formats, add between 15ms to 17ms each. The transport protocol (RTP, UDP, TCP) adds 5ms to 10ms each in the streamer and stream receiver modules.
When you put these numbers together, you arrive with a delay of somewhere around 94 ms to 121 ms. Note that additional delay is introduced for data to arrive at the stream receiver after it’s sent from streamer. This delay is a further contributor to the end-to-end latency of the system.
Use of innovative techniques for minimizing these delays may be possible though not always practical. For example, instead of allowing complete frame capture to add 17ms of delay, we can process videos by capturing slices of input and encoding them. This way, depending on the number of slices, video capture delays can be reduced. However, this requires tampering with the video capture interface, which is not easy to do, and can affect system stability.
Similarly, the use of electronic/hardware resources to minimize processing delays can reduce latency. For example, deploying components capable of 1080p60 video processing and using them for 720p60 video processing will reduce the processing latency to approximately 9ms, from 17ms. While successful this is an inefficient solution with underutilized capacity. Such workarounds only add to the cost of the system.
However, there are other methods for reducing the inter-module delays, and thus reducing the overall end-to-end latency of video systems. This paper talks about various methods used to control latency, encountered issues, and solutions that worked for correcting these issues.
Approaches to lowering video latency delay
One of the common issues encountered in low latency video systems is the jerky playback (jitter effect) of video feeds. A jitter buffer is required at the video playback end in these cases to smooth out video playback. Jitter buffers store some encapsulated video data, so that the video playback end has some data for processing, and does not starve, leading to that jittery effect. But, jitter buffers add their own latency, depending on their size, thereby adding to overall latency of the video system. Avoiding jitter buffers can help reduce latency. However, this can only be achieved by controlling video encoding options at the video source end. Some of options are discussed below.
Avoiding encoding mode frame size variations. There are two encoding modes: variable bitrate (VBR) mode, where a quantization parameter value is fixed; and a constant bitrate (CBR) mode, where the quantization parameter value varies.
At the video source end, when the encoder is configured to use a VBR mode, it gives quality video but at the same time leads to large variations in the video frame sizes. The variations in the video frame sizes lead to jerkiness in video playback at video playback end. For smoother playback of video while using these settings, the introduction of a jitter buffer is required, adding to the latency of the video system. Thus it is strongly advised to avoid using VBR mode, so that the video playback end can cut down on the use of a jitter buffer. Instead of VBR, the encoder should be configured to use CBR mode.
Avoiding coding type frame size variations. Typically, there are two types of commonly used coding: Intra-coding, where compression is achieved by removing redundancy within the same video frame, and Inter-coding, where compression is achieved by removing redundancy in the subsequent video frames.
Depending on the coding techniques used, there are intra-coded frames (aka I frames) and inter-coded frames (aka P frames). I frames result in more bits as only intra-coding is used. P frames result in fewer bits, as both intra and inter coding are used. The fluctuations in the number of bytes in I frames and P frames adds to fluctuations in reception times for these frames at the video playback end, requiring a jitter buffer to smooth out output at the video display end. When these fluctuations are high, the latency increases.
The diagrams in Figure 2 shows an analysis featuring the max and average variations in I and P frames for two streams, one with bad rate control (left) and one with good control (right). The badly controlled stream illustrated to the left has wide fluctuations between the average and maximum variations of the data in the I and P frames, respectively as well as fluctuations between the two types of frames, severely impacting the latency of the video system. The stream with the good rate control to the right shows streams that are much smaller in size and with less fluctuation between average and maximum values and less difference between the I and P frames.
Apart from I and P Frames, there is a third type known as a B frame, which uses bidirectional prediction in inter-coding. The bidirectional prediction requires multiple frames to be decoded prior to decoding of B frames. This impacts the video source as well as video playback end latencies. It is strongly advised to disable this feature while doing video compression.Frames/second streaming vs bits/second. At the video source end,when the streamer is configured for sending data at a steady bitrate,even the smallest variations in the video frame sizes adds to jerkinessin video playback. This encourages the introduction of a jitterbuffer because when video arrives at a fixed rate of X bits per second,the arrival instances of various frames can’t adhere to the frame time.For example, when a video frame rate is 60 fps, it is expected thatafter every 17th millisecond, a new frame arrives at the video playbackend. But suppose the variations in the frame sizes are minimal, as lowas 5%. In this case the 17ms boundary is consistently crossed. Hence, itis beneficial to adhere to the frames-per-second policy instead ofbits-per-second policy of streaming the video. This way the frames willarrive on or before 17th millisecond window, and a jitter buffer is notrequired. However, this method cannot be used in environments withstrict bandwidth-constraints. Even in such environments, if the videocan be compressed at 10% less bit rate than the available bandwidth,then this method will be useful.
The diagram in Figure 3 is a snapshot of a Wireshark capture from a 30 fps video system. Hereeach frame should have been received at the video playback end in lessthan 33 ms. However, this is not the case. As you can see, frame 1052 isan I frame and frame 1158 is a P frame.
The start of the Iframe was received at reference time instance 0.6662 and the start ofthe P frame was received at 0.7553. So the complete I frame was receivedafter 0.0891 sec or 89.1 milliseconds. These types of anomalies arisedue to a streamer configuration that adheres to bits per second, and notto frames per second. This behavior is often seen with open sourcestreamer applications like VLC.
Click on image to enlarge.
Generic approaches for reducing processing delays
Apartfrom the approaches discussed above, other options are available tominimize processing delays. These are some of the options we have used.
Reducing inter-module delays. There should be a strong buffer sharing policy in place to avoidinter-module delays. When capture and encode modules share buffers, itis imperative that you use no more than three shared buffers. This way,when the capture driver is working on one buffer, the capture moduleowns the second buffer, and the encoder module processes the thirdbuffer. Any more buffers will lead to a worst case 17ms of additionallatency in a 60 fps video system. This is true for all inter-modulebuffer sharing, like encode-encapsulate, encapsulate-streamer,decode-display, etc.
Multiple operations per thread. Whenever the processing time for two consecutive operations is less than17ms in a 60 fps video system, it is recommended to do these operationsin a single thread. This way the inter-module delay is cut down by fewmilliseconds. Note that every thread added for parallel processing, inabsence of strong buffer sharing and thread scheduling policies, adds tothe latency of the system.
The foregoing assumes that thecumulative processing time for these threads is above 17ms. Tosummarize, the lowest latency video system will need to capture/displayvideo in one thread and manage the rest of the processing serially inanother thread. This way the video source end latency will be 17ms, andvideo playback end latency will be 17ms, resulting in the glass-to-glasslatency of the video system coming down as low as 50ms. However, thisrequires very high processing capabilities in hardware, and will usuallyresult in under-utilization of hardware.
Other aspects of minimizing latency
Apartfrom the latencies introduced by the video source end and the videoplayback end, there are other factors that contribute to latency. Forexample, the video feed sent over a network does not arrive instantly atthe video playback end.
The difference between the time it issent out from video source end and the time it is received at the videoplayback end is termed as network latency. Also, even in absence ofjitter buffers as discussed earlier, the FIFOs or queues in place at thekernel level add to the latency of the system. This additional latencydue to buffering at the kernel level socket implementation affects thevideo source end as well as the video playback end. These delays can bereduced by using the following methods:
Reduce delays with socket buffering. Buffering of the incoming/outgoing data starts at the socket layer atthe kernel space. So even when there is no jitter buffer, in realitythere is still some buffering in place. This buffering helps tocompensate for the absence of a jitter buffer. But if due considerationis not given, it also drastically adds to the latency. The amount ofbuffering done is usually in the range of a few hundred kilobytes, andvaries across operating systems. For Linux-based operating systems,options are provided to configure the socket layer or network stacks tomanage these buffers. These options should be utilized to do optimalbuffering, with consideration for the range of supported bitrates of thevideo system and expected variations in frame sizes. Linux-basedsystems provide SO_RCVBUF options to adjust socket level buffering atthe stream receiver end.
Note that the kernel usually doublesthis buffer size for bookkeeping overheads. At the network stack level,features for tuning the receive buffer sizes are available under‘sysctl’. For example, the below command can be used to reduce thedefault receive buffer size in the network layer to 8 MBs.
sysctl -w net.core.rmem_max=8388608
Theseoptions must be evaluated with a trial and error approach beforefinalizing the values and options themselves. Use of these options canaffect the stability of other applications running on the system. Hencecaution is advisable when employing these options.
Differentiated services. By using the differentiated services fields, the video data can beforwarded with higher priority. Using these options for the socket couldprovide better latency performance. However, this is subject to thenetwork configuration as well as the support from the underlyingoperating system.
Expedited forwarding is an example ofdifferentiated service for voice, video, and other real time services.Use of these fields while, configuring the sockets, can help reduce thenetwork latency. Linux-based systems provide IP_TOS socket option toenable and configure the differentiated services. Below a code snippetdemonstrates the use of IP_TOS for enabling expedited forwarding.
int tos_local = 0x2e;
sockfd = socket (AF_INET, SOCK_DGRAM, 0);
setsockopt(sockfd, IPPROTO_UDP, IP_TOS, &tos_local,
Byadjusting the video encoder properties, the application code can betuned to produce good video feed that in itself helps to reduce videoplayback end latency. At the same time, choosing a good buffer sharingarchitecture and using best thread scheduling policies at both videosource and video playback ends is equally important. Furthermore, byusing advanced socket options and by exploiting network layerconfigurations, network latency can be controlled to a large extent. Allof this helps to reduce the glass-to-glass latency of the video system.
Krishna Prabhakaran , senior technical lead at eInfochips ,has worked in the video processing field for over 11 years, withhands-on experience in video and image processing algorithms andapplications. He has worked on multiple mission critical product designsbased on real-time streaming, including applications in aerospace andmedical equipment. He has also designed video analytics algorithms foreInfochips.