The challenges of multi-touch gesture interfaces -

The challenges of multi-touch gesture interfaces


The key to understanding multi-touch touch panels is to realize that a touch is not the same thing as a mouse click.

With a multi-touch projected capacitive touch panel, the user interface of an embedded application can be enhanced with gestures such as pinch, zoom, and rotate. True multi-touch panels, that is, panels that return actual coordinates for each individual touch, can support even more advanced features like multiple-person collaboration and gestures made of a combination of touches (for example, one finger touching while another swipes). The different combinations of gestures are limited only by the designer's imagination and the amount of code space. As multi-touch projected capacitive touch panels continue to replace single-touch resistive touch panels in embedded systems, designers of those systems must develop expertise on how to interface to these new panels and how to use multi-touch features to enhance applications.

When implementing a touch interface, the most important thing is to keep in mind the way the user will interact with the application. The fastest, most elegant gesture-recognition system will not be appreciated if the user finds the application difficult to understand. The biggest mistake made in designing a touch interface is using the same techniques you would use for a mouse. While a touch panel and a mouse have some similarities, they're very different kinds of input devices. For example, you can move a mouse around the screen and track its position before taking any action. With tracking, it's possible to position the mouse pointer precisely before clicking a button. With a touch interface, the touch itself causes the action.

The touch location isn't as precise as a mouse click. One complication with touch is that it can be difficult to tell exactly where the touch is being reported since the finger obscures the screen during the touch. Another difference is in the adjacency of touch areas; due to the preciseness of a mouse, touch areas can be fairly small and immediately adjacent to each other.

With a touch interface, it's helpful to leave space between touch areas to allow for the ambiguousness of the touch position. Figure 1 shows some recommended minimum sizes and distances.

Click on image to enlarge.

Feedback mechanisms need to be tailored to a touch interface to help the user understand what action was taken and why. For example, if the user is trying to touch a button on the screen and the touch position is reported at a location just outside the active button area, the user won't know why the expected action did not occur. In the book Brave NUI World: Designing Natural User Interfaces for Touch and Gesture , Daniel Wigdor and Dennis Wixon suggest several ways to provide feedback so the user can adjust the position and generate the expected action.1 One example is a translucent ring that appears around the user's finger. When the finger is over an active touch area, the ring might contract, wiggle, change color, or indicate in some other way that the reported finger position is over an active element (Figure 2a ). Another option is that the element itself changes when the finger is over it (Figure 2b ).

The authors describe several other strategies for designing touch interfaces, including adaptive positioning (which activates the nearest active area to the touch), various feedback mechanisms, and modeling the algorithmic flow of a gesture.

You'll need to consider the capabilities of the touch controller when designing the gestures that will be recognized by the user interface. Some multi-touch controllers report gesture information without coordinates. For example, the controller might send a message saying that a rotation gesture is in progress and the current angle of the rotation is 48º, but it won't reveal the center of the rotation or the location of the touches that are generating the gesture. Other controllers provide gesture messages as well as the actual coordinates and some controllers provide only the touch coordinates without any gesture information. These last two types are considered “true” multi-touch because they provide the physical coordinates of every touch on the panel regardless of whether a gesture is occurring or not.

Even if the controller provides gesture information, its interpretation of the gestures may not match the requirements of the user interface. The controller might support only one gesture at a time while the application requires support for three or four simultaneous gestures; or it may define the center of rotation differently from the way you want it defined. Of course no controller is going to automatically recognize gestures that have been invented for an application such as the “one finger touching while another swipes” example given above. As a result, you will often need to implement your own gesture-recognition engine.

A gesture-recognition engine can be a collection of fairly simple algorithms that generates events for touches, drags, and flicks, or it can be a complicated processing system that uses predictive analysis to identify gestures in real time. Gesture engines have been implemented using straight algorithmic processing, fuzzy logic, and even neural networks. The type of gesture-recognition engine is driven by the user interface requirements, available code space, processor speed, and real-time responsiveness. For example, the Canonical Multitouch library for Linux analyzes multiple gesture frames to determine what kinds of gesture patterns are being executed.2 In the rest of this article I'll focus on a few simple gesture-recognition algorithms that can be implemented with limited resources.

Common gestures
The simplest and most common gestures are touch (and double touch), drag, flick, rotate, and zoom. A single touch, analogous to a click event with a mouse, is defined by the amount of time a touch is active and the amount of movement during the touch. Typical values might be that the touch-down and touch-up events must be less than a half second apart, and the finger cannot move by more than five pixels.

A double touch is a simple extension of the single touch where the second touch must occur within a certain amount of time after the first touch, and the second touch must also follow the same timing and positional requirements as the first touch. Keep in mind that if you are implementing both a single touch and a double touch, the single touch will need an additional timeout to ensure that the user isn't executing a double touch.

While a drag gesture is fairly simple to implement, it's often not needed at the gesture-recognition level. Since the touch controller only reports coordinates when a finger is touching the panel, the application can treat those coordinate reports as a drag. Implementing this at the application level has the added benefit of knowing if the drag occurred over an element that can be dragged. If not, then the touch reports can be ignored or continually analyzed for other events (for example, passing over an element may result in some specific behavior).

A flick is similar to a drag but with a different purpose. A drag event begins when the finger touches the panel and ends when the finger is removed. A flick can continue to generate events after the finger is removed. This can be used to implement the kinds of fast scrolling features common on many cell phones where a list continues to scroll even after the finger is lifted. A flick can be implemented in several ways, with the responsibilities divided between the gesture-recognition layer and the application layer. Before we discuss the different ways to implement flick gestures, let's first focus on how to define a flick.

A flick is generally a fast swipe of the finger across the surface of the touch panel in a single direction. The actual point locations during the flick do not typically matter to the application. The relevant parameters are velocity and direction. To identify a flick, the gesture-recognition layer first needs to determine the velocity of the finger movement. This can be as simple as determining the amount of time between the finger-down report and the finger-up report divided by the distance traveled. However, this can slow the response time since the velocity is not determined until after the gesture has finished.

Flick decay
This flick algorithm is based on the standard formula for calculating distance traveled based on velocity and acceleration: d = vt + ½at 2 . The velocity is the initial velocity of the flick gesture. The acceleration and time are tuned to meet the application’s requirements. In this case we use a constant time and determine the rate of deceleration required to complete the decay in that time. A timer is used to generate each of the flick deceleration events. The timer should fire every 10 to 200 milliseconds depending on processor utilization and the required smoothness of the flick. On each timer tick we need to move a certain number of pixels, which decreases with each step.The number of pixels to move on each step is calculated by taking the delta between the total distance of the previous step and the total distance of the next step:

Delta = (vt n +1at n +1 2 ) – (vtn + ½at n 2 )

After simplifying and putting in terms of code variables, we get:

PixelsToMoveThisStep = InitialFlickVelocityInPixelsPerMs*StepTimeInMs
–Deceleration*(CurrentStepTimeInMs2 – PreviousStepTimeInMs2)

The flick processing is divided among three areas of the code:

Click on image to enlarge.

This code can be modified to use a constant deceleration (instead of constant time) and to use fixed point instead of floating point. When porting and modifying the code (especially when converting to fixed point), be sure that the variable types are large enough to hold the resulting calculations.

For greater responsiveness, track the velocity as the gesture is occurring. If the velocity is above a certain threshold, a flick has occurred. The direction of the flick can also be determined from the touch-down and touch-up positions using the arc tangent function (refer to the sidebar on rotation angles for details). However, this simplistic approach can result in false flicks. The user may draw a very quick circle, which meets the velocity requirements but most certainly should not be interpreted as a flick. To prevent this type of false report, the gesture-recognition engine should determine if all the reported points are in a relatively straight line. If not, the gesture is not a flick. If the algorithm is reporting the velocity in real time, it must include some kind of prequalification of the gesture direction before it starts reporting flick events to the application. The gesture-recognition engine also needs to decide how to report the direction of the flick. This is driven by the user-interface requirements and the amount of decoupling desired between the gesture-recognition layer and the application. In the most generic version, a flick might report the direction as an angle. Or it could report it in a more application-friendly way such as up, right, left, or down.

Rotation angle
The rotation angle is based on the position of the two fingers defining the rotation. The coordinates of the two touches can be used in conjunction with the arctangent function to return an angle in radians:

AngleInRadians = atan((SecondTouch.Y – FirstTouch.Y) / (SecondTouch.X – FirstTouch.X));

The angle can be easily converted to degrees if needed:

AngleInDegrees = AngleInRadians * 180.0 / 3.14159;

There are several issues with the atan function as used above. First, if the two X coordinates are the same then a divide-by-zero error occurs. Second, the atan function cannot determine which quadrant the angle is in and so returns a value between +π/2 (+90º ) and -π/2 (-90º ). It's better to use the atan2 function which handles both of these problems and returns a value between π(+180º) and -π(-180º ):

AngleInRadians = atan2(SecondTouch.Y – FirstTouch.Y, SecondTouch.X – FirstTouch.X);

The use of floating-point libraries can be avoided by creating a table of angles based on the ratio of Y and X. This has the added advantage of avoiding the radians-to-degrees conversion. The Y value will need to be multiplied by some factor so that the resulting Y-to-X ratio can be expressed as an integer. The size of the table and the Y multiplier used depends on the required resolution of the angle. A four-entry table is enough to return up, right, down, or left. A table with 360 entries can return the exact angle in whole degrees. Just be sure that the Y multiplier used to calculate the ratio is large enough to give the desired resolution.

The rotation angle is typically implemented as an offset. Wherever the user first places their fingers becomes the reference angle. As the fingers rotate around the panel, the actual rotation angle is the difference between the reference angle and the current angle.

Another typical feature of a flick is inertia . Inertia provides a natural end to a flick operation by, for example, slowing down the scrolling of a list until it comes to a stop. Inertia can be implemented at the application level or the gesture-recognition level. At the gesture level, inertia can be implemented as a series of events with decreasing velocity. This could be implemented with specific inertia events, but it's typically easier to reuse the flick event for the same purpose. The initial flick event includes the direction and velocity of the actual flick gesture. A timer in the gesture-recognition engine then continues to generate flick events using the original direction and a decreasing velocity. The decreasing velocity is generated by an exponential decay function that is tuned to create whatever deceleration profile works best for the application. You can even dynamically control the deceleration profile based on the application state (in other words, larger objects might decelerate more quickly than smaller ones, or longer lists might scroll for a longer period of time before decelerating). See the sidebar for one example of a flick-decay algorithm.

A rotation gesture can occur anytime two or more fingers are touching the panel. In its simplest form, a rotation gesture begins once the second finger is down and ends when there is only one finger still on the touch panel. The center of rotation is typically defined as the midpoint between the fingers. The rotation angle can be reported as a relative angle referenced to the original positions of the two fingers, or it can be reported as an absolute angle referenced to the panel orientation. The gesture engine might even provide both angles and let the application decide which one to use. Some applications might need the actual coordinates of the fingers instead of the angle. This can be useful for sticky tracking of the finger positions. In this mode, the coordinates of each finger act as anchor points for manipulating elements on the user interface. A typical example would be an image-viewing application. When users place their fingers on the image and then move their fingers around, the application modifies the shape and position of the image as necessary to keep the initially-touched pixels directly under the user's fingers. For more information, refer to the sidebar, which discusses how to determine the angle of rotation.

A zoom gesture occurs when the user places two or more fingers on the touch panel and then moves those fingers closer together or further apart. A zoom event might report a positive or negative distance in pixels. As with the drag gesture, it may be easier to implement a zoom gesture directly in the application using the reported touch positions. The application would enter a zoom state whenever two or more fingers are present. Once in the zoom state, the application uses the coordinates of the touches to implement the zoom. This has the added benefit of following the same sticky tracking mentioned above where the pixels under the user's fingers stay with the fingers as they move around the panel.

Combining gestures

In addition to the simple gestures mentioned so far, the application may need to recognize combinations of simple gestures. The most common example is simultaneous rotate and zoom gestures where a user element (like an image or a map) sticks to the user's fingers (Figures 3a and 3b ). If this is the most common kind of gesture manipulation used in the application, a gesture-recognition engine might not be needed since it may be easier for the application to just receive the individual touch coordinates and implement the desired functionality.

Click on image to enlarge.

Other combinations can be more complicated, like a zoom with flick . In this scenario, the user is trying to zoom out or in very quickly. Just like a flick, a zoom can have an initial velocity and a deceleration curve, either positive for zooming out or negative for zooming in. Again, the gesture engine requirements are entirely driven by the application.

Another added complication is the need to track multiple gestures simultaneously . This is a typical requirement in collaboration applications where more than one user may be interacting with the application. When tracking multiple gestures, the gesture-recognition engine will need to report some kind of gesture identifier for each event. It also needs to keep track of which physical touches are assigned to each gesture. This information must remain constant until all the gestures have ended.

Tracking multiple gestures raises some difficult issues when determining which touch is associated with which gesture. For example, assume that three rotation gestures are occurring simultaneously. Gesture 1 is using physical touches A and B, gesture 2 is using touches C and D, and gesture 3 is using touches E and F. First touch D is lifted, then touch E is lifted, then a new touch is applied. Should this new touch be associated with gesture 2, gesture 3, or does it start a new gesture? Is it possible to assign the new touch to a gesture based on proximity (in other words, it has to be within 20 pixels of the other touch in a gesture)? The answers to these questions can be simple, or can be based on complicated predictive algorithms tailored to the application. Either way, the gesture engine must explicitly deal with these issues to avoid inconsistent and confusing results for the users.

Event systems and mediators

Once the required gesture events have been generated, the next issue is how to efficiently get this information to the application. You can setup your own event system, but the difficulty is in deciding what information to pass up to the application. Good software design states that there should be very little coupling between the application and the gesture-recognition system. However, some types of gesture-recognition engines can be greatly enhanced by allowing information to flow both directions across the application/gesture engine boundary.

In the paper “A Framework for Robust and Flexible Handling of Inputs with Uncertainty,” authors Julia Schwarz and Andrew Wilson describe a method of disambiguating gestures based on the various probabilities associated with specific user elements.3 When the user begins a gesture, the gesture engine begins reporting information to the application. Each element in the user interface determines the probability that the gesture is intended for it based on relative position, the apparent gesture being executed, and the kinds of gestures that the element supports.

For example, if the user is dragging a finger up the screen and the nearest elements are a push button and a slider, then the user is most likely trying to slide the slider. As each element continues to monitor the gesture, it adjusts its own probability. This continues until either one element has a probability of 100% and all the rest are near zero, or until a certain amount of time has elapsed without a winner, in which case the ambiguity is resolved by a mediator. To facilitate this process, some information from the application needs to be communicated to the mediator so that it can resolve the dispute. This mediator would typically be part of the gesture engine so that the same mediator can be used by multiple applications, thus increasing the coupling between the application and the gesture engine.

Not a mouse click
As multi-touch touch panels become more common in embedded systems, engineers need to become familiar with how they work, what kinds of information they provide, and how best to integrate them into the application's architecture and user interface. Since the requirements of the gesture-recognition engine heavily depend on the size of the system (such as memory or processor speed) and the user interface requirements, there is no one-size-fits-all gesture-recognition engine. The most important thing to remember when designing the gesture engine is that a touch is not the same thing as a mouse click. While this might seem limiting at first, once you begin to truly analyze the user interface from a touch perspective, you'll discover an entirely new set of exciting ways to communicate with your users. You might even invent a new gesture that has everybody asking, “Why didn't I think of that!”

Tony Gray is principal engineer at Ocular, a provider of capacitive touch screen and display solutions. He has over 20 years experience designing a variety of embedded systems and has published several articles in Embedded Systems Design. He has a degree in computer engineering from Lehigh University in Bethlehem, PA.

1. Wigdor, Daniel and Dennis Wixon. Brave NUI World: Designing Natural User Interfaces for Touch and Gesture . ISBN 0123822319.
2. Canonical Multitouch,
3. Schwarz, Julia (et al.) and Andrew Wilson. “A Framework for Robust and Flexible Handling of Inputs with Uncertainty,” HCII/Carnegie Mellon and Microsoft. UIST'10, October 3-6, 2010, New York, New York, USA.

Suggested reading:
Cooper, Alan, Robert Reimann, and David Cronin. About Face3: The Essentials of User Interaction Design . ISBN 0470084111.

Lea, John. “Unity Gesture UI Guidelines,” draft, Google Docs Online,

This article provided courtesy of and EmbeddedSystems Design Magazine. Sign up for subscriptionsand newsletters. Copyright © 2011 UBM–All rights reserved.

1 thought on “The challenges of multi-touch gesture interfaces

  1. As a long time user of the (original) iPad, the most frustrating thing for me is the number of times the screen changes completely because I accidentally brushed against some touch area. Or my cat did. Many times it's not that easy to get back to where you

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.