Reducing VoIP quality degradation when network conditions are unstable -

Reducing VoIP quality degradation when network conditions are unstable

Delay and packet loss can significantly affect the perceived quality of voice transmitted over packet networks. Packets travelling from source to destination may suffer from delay variation, may arrive out of order or even may be lost. To compensate for delay variation, de-jitter buffers (the term jitter buffer is more commonly used) are used at the receive side of packet based systems. The role of de-jitter buffers is to restore the correct order of packets and to allow the “slower” packets to arrive.

There are two major classes of jitter buffers: static jitter buffers and adaptive jitter buffers. The static jitter buffers have a fixed size and the packets leaving the jitter buffer have a constant delay (the delay from the moment the packet is produced until the moment when the packet is consumed), whilst the adaptive jitter buffers have variable size and variable delay.

The management and implementation of the jitter buffers is not specified by any standard resulting in different implementations for both static and adaptive algorithms. Also, there is not a general recipe for “a good” jitter buffer; it is very much dependent on the target application and the environment where the application is used (e.g. it is not necessary to have a complex and/or memory consuming jitter buffer implementation if the delay variation for the target network is very low ).

We have focused our attention on developing an adaptive jitter buffer algorithm for a Voice over IP embedded application with restrictions in both memory consumption and processing power. Our objective was to build a base algorithm that uses less processing power when the network is “behaving properly” and to use additional mechanisms that would minimize the quality degradation of the output streams when the network behavior becomes unpredictable.

Previous approaches

Different approaches to playout buffer algorithms have been studied in literature. We have classified the algorithms we studied into 4 categories (a similar classification being proposed by Narbutt et al. in [1]):

  • algorithms that establish the playout delay based on a continuous estimation of the network parameters
  • statistics based algorithms
  • algorithms that are maximizing the user satisfaction
  • algorithms that are using various heuristics and monitor certain parameters (e.g. late packets fraction, buffer occupancy, etc).

The most known algorithm pertaining to the first category is the one presented in [3]. It uses an autoregressive estimate method to estimate the average network delay and its variance.

The algorithm estimates two parameters (delay and delay variance) and uses them to calculate the playout time.

where di and dv are the i-th estimates of delay and its variance respectively, while nl is the one way packet delay of the i-th packet (as defined by RFC2679).

α is a parameter that impacts the jitter buffer adaptation speed. A lower value for this parameter makes the jitter buffer sensitive to small variations in delay. A higher value makes the jitter buffer less sensitive to small delay variations, but adapts slower to sudden changes in the network delays. We propose a value of 0.998002, which corresponds to an exponential moving average of about 500 samples.

Based on the above estimates, the playout time is computed as:

where pi is the playout time and ti is the sent time. β is a factor influencing how important the delay variance is in the computation of the playout time and it is empirically computed. The authors proposed a value of 4.

The values di and vi are computed for each packet received, but they are used to calculate the playout time only for the first packet in a talkspurt (During silence periods an application may sent occasional comfort-noise packets or may not sent packets at all. The first packet of a talkspurt is the first packet following a silence period. The notion of talkspurt and talkspurt identification is defined and discussed in RFC3551 [22])

Several modifications have been proposed to this algorithm. Some of them were proposed by the original authors in [3], introducing detection of spikes in network delay and different adaptation speeds for delay increase and decrease. Another suggested improvement of the algorithm was to use a dynamic value for alfa ([4], [10]) dependent on the network conditions (higher values for stable periods and lower values for instability).

The algorithm presented above is actually based on an exponential moving average filter; other types of filters being also proposed (e.g. NLMS filter in [11]).

In the second category, there are a significant number of algorithms that are building the packet delay distribution function for the packets received. In [7] and [8] a histogram of the previous delays is computed and maintained. In [9] the parameters of the Pareto cumulative density function are continuously updated. The playout time is then computed such that the packet loss is kept less than a defined threshold.

The third category contains several algorithms that are building functions for measuring the user satisfaction. In [8], the authors established a relationship between MOS value, packet loss ratio and playout delay assuming a Pareto distribution of the network delays. In this way, the playout delay was calculated to maximize the MOS function.

The algorithms pertaining to the fourth category monitor certain parameters like buffer occupancy, loss percentage, late packets fraction, etc.In [12], the buffer occupancy is monitored over time and the delay reduced when the buffer is consistently containing more than N frames for a defined period of time.

In [13], an occupancy watermark is introduced to define the threshold for the buffer occupancy when the risk of overflow and underflow is reached. The area between the two limits (underflow and overflow) is considered the normal mode of functioning. When the buffer occupancy falls outside this area, the jitter buffer enters an adaptation phase with the aim of returning the occupancy inside the targeted area.

In jitter buffer literature, the algorithms focus primarily on optimizing playout, but ignore some important details like the target platform (general purpose computers, embedded systems ), restrictions (the amount of available memory, the amount of processing power available for the jitter buffer), network type (e.g. a lighter algorithm may be used when the network conditions are well known) that may have a great influence over the algorithm selection.

We have focused our study in finding an algorithm that minimizes both the amount of memory and processing power used and targets a VoIP Media Gateway running on a DSP. At the same time, the algorithm has to offer quality services for “well behaving networks” (i.e. well-managed IP networks and partially-managed IP networks as defined in ITU-T G.1050 ) and graceful degradation for unmanaged IP networks.

The previously studied algorithms perform delay adjustment for each packet or per talkspurt basis. The talkspurt based algorithms compress/expand the silence periods between talkspursts to avoid packet dropping and gaps insertion. For voice application, such an algorithm would be the preferred solution because it eliminates the unpleasant effect of packet dropping or gap insertion into the middle of the talkspurt.

Unfortunately, it is not always easy to detect the start of talkspurt – some streams carry this information in the marker bit from the RTP header, others do not. This information may be computed outside the jitter buffer as a packet type differentiation (silence or voice), but not all the voice application may obtain this information easily prior to packet insertion into the jitter buffer.

The proposed algorithm has two flavors:

  • An algorithm for streams with talkspurt information (we will refer to this algorithm as TAA – Talkspurt Adaption Algorithm)
  • An algorithm for streams without talkspurt information (we will refer to this algorithm as NTAA – Non Talkspurt Adaptation Algorithm).

The jitter buffer uses one of the abovementioned algorithms depending on the availability of talkspurt information. If the jitter buffer can identify the start of talkspurt, it uses the TAA algorithm; otherwise it uses the NTAA algorithm, the switch between the two being automatic.

The TAA algorithm

The algorithm combines a proactive approach with a reactive one. The proactive part is continuously estimating the average delay and its variance using the algorithm proposed by Ramjee et al. in [3]. We have selected this algorithm because of its simplicity and low requirements in both memory and processing power, but other algorithms may be also used.

The jitter buffer is configured to perform optimum for certain network conditions in terms of delay variation, packet reordering and packet loss. The reactive part is activated when the jitter buffer reaches certain situations like overflow or late packet arrival.

Thus, if the network conditions are stable and within the supported limit the estimator will be, in general, accurate. If there are non-conformities to these conditions, the jitter buffer will enter in one of the two states: late packet arrival or overflow.

Let Sn be the send time for packet n and Slast the send time for the last packet played. A late packet having Sn < Slast is dropped because a packet with newer information has been already played.  However, if late packet has Sn > Slast, itcan still be played with the cost of delay increase. If the packet is within a reasonable range relative to the jitter buffer current time (e.g. 1 second), it is accepted and the delay increased.

At this point, two options are available:

  • Drop the packet and keep the current delay. If this packet is an isolated one, the solution is good.
  • Keep the packet and increase the delay. If, the delay is significantly increased, the buffer will start filling out and it may encounter an overflow situation. In this case the overflow situation has to be carefully handled.

The TAA algorithm uses the second approach: keeps the packet and handles overflow situations when they are encountered.


Let αn bethe delay increase for accepting late packet n; therefore the delay increase for a received packet is F(n)* αn , .

Let TDIL be the “Total Delay Increase” due to accepting late packets

So, for N received packets TDIL is:


In an overflow situation, a packet must be dropped because there is not enough memory to accommodate the new one. We have chosen not to drop the current packet, but to drop a packet that would allow a rapid delay decrease to avoid entering the overflow situation again at the next incoming packet.

Let bn be the delay decrease for solving the overflow n.

Let us assume that for a period T, N packets were received and M overflows occurred. In this situation TDIL is:


Shown below is the pseudo-code for late packet arrival and overflow situations. In case of overflow, if TDIL is 0 (there was not a delay increase because of late packet arrival in the near past) the choice of the packet to be dropped depends on the received packet location inside the buffer.

If the delay variation is within the acceptable ranges, this approach would position the jitter buffer time better on the delay variation distribution; if the delay is larger than the supported ranges, packets have to be dropped anyway.


The TDIL value reflects a delay increase due to acceptance of late frames. The value is maintained to indicate a delay decrease necessity when overflow occurs. However, if the late packets acceptance happened somewhere in the past, it should not have the same influence on the current overflows, thus an aging mechanism for TDIL is employed (we have selected to decrease the TDIL by a factor after each N packets received without overflow ).

Also, an important aspect is the type of the late packet. If the late packet is silence, a good decision would be to drop the packet (it does not contain relevant information anyway ).

The NTAA algorithm

The algorithm uses the same reactive part as the TAA does. However the proactive algorithm cannot be used because the stream has no talkspurt information. The delay increases naturally by inserting gaps into the stream when no packet is available for playing. However, there are some situations when packets are delayed for a period of time (e.g. when congestion occurs).

During this period, the jitter buffer has no packet for playing, but eventually the delayed packets are received (almost) all at once causing the jitter buffer to fill up. If, after this congestion period, the network conditions become stable again, the jitter buffer will remain with a considerable number of packets, so all the new received packets will be delayed although the delay on the network is low. In these situations, it would be better to reduce the delay even by dropping some packets.

Let [T1 , Tn ] be the analysis time interval (the interval the jitter buffer is run on). The algorithm splits the interval into equally sized intervals (let Ts be this size), let M be the set of those intervals, M={I1 , I2 ,…,In/Ts }. Let Min(Ik ) be the minimum value of the buffering delay for interval Ik.


The algorithm computes the minimum buffering delay for each interval. Min(Ik ) being over a certain threshold (MIN_THRESHOLD) is an indication that the packets were kept in the buffer longer than necessary because there are no “late packets” received (packets with Min(Ik ) under the threshold) inside Ik . In this situation the algorithm makes the decision to decrease delay by one frame.

Figure 1 : Packet buffering delay

An example is depicted in Figure 1 above , where, after a period, the delay is decreased. The buffering delay may be considerable decreased for the second half, a buffer of 200 samples being more than enough.

Different metrics may be used to evaluate the jitter buffer performance (e.g. packet loss vs buffer delay curve, or user satisfaction with E-model [1]).

The jitter buffer may affect the stream quality by packet loss, voice gaps (periods when no packets are available for playing) and buffering delay. We have chosen these metrics to evaluate the jitter buffer performance.

We have focused our attention not only on the presence of these flaws, but also on their distribution (the output stream quality is differently affected by the loss of 10 consecutive packets than by the loss of 10 packets, but one packet every 100 packets. Generally, voice applications have mechanisms for compensating for packet loss (i.e. packet loss concealment), and these algorithms tend to work better if the number of consecutive packets dropped is low.

We have evaluated the algorithm proposed by Ramjee et al. in [3] (referred as Algorithm 1) and the proposed algorithm (referred as Algorithm 2). For the evaluation of algorithms, we have selected several classes of input streams starting from low delay variation to higher delay variation and packet reordering ( Table 1 below ).


Table 1 : Input streams

The proposed algorithm behaves better in almost all cases regarding packet loss and gaps insertion. The buffering delay tends to be higher for the Algorithm 2, the reason being a more rapid adaptation to the changing network conditions, thus less packet dropping and gaps insertion.

In spite of this, the maximum buffering delay is never reached (The jitter buffer is configured to have two limits regarding buffering delay: a minimum threshold and a maximum threshold. The minimum threshold may be used to induce a fixed delay for all packets received and the maximum threshold to limit the amount of time a packet spent in the jitter buffer).

As shown in Figure 2 to Figure 7 below , for low delay variation streams (e.g. class 1), the difference between the two algorithms is insignificant, this being an indication that Algorithm 2 is not using the reactive mechanisms (accepting late packets and overflow) for low delay variation streams; these mechanisms are activated when the input stream exceed certain thresholds in term of delay variation and packet reordering


Figure 2 : Input stream delay variation (PDV = Packet Delay Variation as defined by RFC 5481 – the reference being the packet with the minimum delay in the stream)


Figure 3 : Packet loss


Figure 4 : voice gaps


Figure 5 : Average buffer delay


Figure 6 : Average buffer delay (class 2)


Figure 7 : Average buffer delay (class 5)

As presented in previous sections, when no talkspurt information is available, the jitter buffer has an additional mechanism to reduce the delay. Figure 8 below presents the jitter buffer behavior (the TAA algorithm) for a stream without talkspurt information.


Figure 8 : No talkspurt information (no delay reduction)

Packet Loss: 0.012%

Average buffering delay: 927

Max buffering delay: 1120

Voice gaps: 0

If the additional mechanism of delay reduction is used (Figure 9 below ), the delay is reduced by dropping some frames at regular time intervals (the packets with negative delay on the graphic are actually dropped).


Figure 9 :No talkspurt information (delay reduction)

Packet Loss: 0.013%

Average buffering delay: 289

Max buffering delay: 800

Voice gaps: 1

The jitter buffer was configured not to decrease the delay below 160 samples, this being the reason the delay is not further reduced. Also to be noticed, after the target buffering delay has been reached (the jitter buffer started with a high buffer delay ), no further adjustments are made if the input stream does not require them.

Some final thoughts

A jitter buffer implementation is dependent on the target application resources and the ecosystem where the application is deployed. In general, the better the algorithm (estimating the playout time for a packet ) the more resources are needed.

A simpler algorithm may be used for a network with low delay variation and packet reordering or at least a network with a predictable behavior. However, the applications are not deployed to a single system, and different systems may have different specifications in terms of network behavior.

We have implemented a jitter buffer algorithm that is using fewer resources when the network is predictable (we are using the simple filter as proposed by Ramjee et al. in [3] ) and additional mechanisms when certain situations are encountered: the jitter buffer delay is too high causing a jitter buffer fill and overflows or the delay is too low causing late packet arrivals.

When any of the two situations are encountered the jitter buffer is adjusting the delay in order to minimize packet loss and gaps insertion. The advantage of using a proactive approach ([3]) combined with a reactive one comes from the fact that the packet loss and gaps insertion are minimized by having a proactive algorithm working on talkspurts and a reactive algorithm minimizing the packet loss in case of overflow/underflow.

Unfortunately, it is not always simple to detect the start of talkspurt for a certain stream. The voice applications may not have this information prior to packet insertion in the jitter buffer. For such cases, a different scheme was used, to monitor the buffering delay and to reduce it from time to time by dropping some packets.

An extension to this algorithm would be to receive the packet type information from the caller after the packet was processed. This would help the jitter buffer detect the type of the next packet with a higher probability (if the current packet is silence, there is a high probability for the next one to be also silence, so a good candidate for dropping ).

All in all, there are different aspects to be taken into account when implementing a jitter buffer and one cannot have a general solution for all kind of applications and environments.

Adrian Răileanu is a DSP software engineer in the Packet Telephony Systems and Applications team in Freescale Semiconductor, developing and integrating media processing components. He received an MS degree in electronic and telecom engineering from Politehnica University of Bucharest.

Diana Crăciun is a DSP software engineer in the Packet Telephony Department in Freescale Semiconductor, developing and integrating media processing components. She received an MS degree in computer science from Politehnica University of Bucharest.


[1] Miroslaw Narbutt, Mark Davis, “Assessing the quality of VoIP transmission affected by playout buffer scheme”

[2] ITU-T G107 Recommendation, “The E-model, a computational model for use in transmission planning”

[3] R. Ramjee, J. Kurose, D. Towsley, H.Schulzrinne, “Adaptive Playout Mechanisms for Packetized Audio Applications in Wide-Area Networks ”

[4] J. Bolot, Andres Vega Garcia, “Control mechanisms for packet audio in the Interned”

[5] A. Kansal, A.Karandikar, “Adaptive Delay Estimation for Low Jitter Audio over Internet”

[6] Sue B. Moon, Jim Kurose, and Don Towsley, “Packet Audio Playout Delay Adjustment: Performance Bounds and Algorithms

[7] N. Shivakumar, C. J. Sreenan,B. Narendran, P. Agrawal, “The Concord Algorithm for Synchronization of Networked Multimedia Streams

[8] Kouhei Fujimoto, Shingo Ata, Masayuki Murata, “Adaptive Playout Buffer Algorithm for Enhancing Perceived Quality of Streaming Applications

[9] Kouhei Fujimoto, Shingo Ata, Masayuki Murata, “Playout Control for Streaming Applications by Statistical Delay Analysis

[10] Minkyong Kim and Brian Noble, “SANE: Stable Agile Network Estimation”

[11] Philip DeLeon, Cormac J. Sreenan, “An Adaptive Predictor for Media Playout Buffering”

[12] Donald L.Stone and Kevin Jeffay, “An Empirical Study of Delay Jitter Management Policies”

[13] Kurt Rothermel, Tobias Helbig, “An Adaptive Stream Synchronization Protocol”

[14] A. Kansal, A.Karandikar, “An overview of Delay Jitter Control for Packet Audio in IP Telephony”

[15] Nikolaos Laoutaris and Ioannis Stavrakakis, “Intrastream Synchronization for Continuous Media Streams: A Survey of Playout Schedulers”

[16] M. Narbutt, A. Kelly, L. Murphy, “Adaptive VoIP Playout Scheduling: Assessing

User Satisfaction”

[17] Cormac J. Sreenan, Jyh-Cheng Chen, “Delay Reduction Techniques for Playout Buffering”

[18] ITU-T G.1020 Recommendation, “Performance parameter definitions for quality of

speech and other voiceband applications utilizing IP networks

[19] RFC 2679, “A One-way Delay Metric for IPPM” 
[20] RFC 3393, “IP Packet Delay Variation Metric for IP Performance” 
[21] RFC 5481, “Packet Delay Variation Applicability statement” 
[22] RFC 3551, “RTP Profile for Audio and Video Conferences with Minimal Control” 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.