Using Java to deal with multicore programming complexity: Part 2 – Migrating legacy C/C++ code to Java

Editor’s Note: In Part 2 in a three part series on how embedded developers can more effectively exploit the use of multicore Java, Kelvin Nilsen details the factors to consider in making a decision to shift from C and C++ to Java, as well as provides some guidelines for migrating legacy code to a Java-based multicore development environment.

For the past several decades, huge legacies of software have been created. Many embedded system industries have experienced doubling of application size every 18 to 36 months. Now, many of the companies responsible for maintaining these large legacies are faced with the difficult challenge of modernizing the code to run in multiprocessor environments.

One option: Keep your legacy code as is
Clearly, the path of least resistance is to keep large code legacies as is, and not worry about porting the code to new multiprocessor architectures. Most code originally developed to run as a single non-concurrent thread will run just fine on today’s multiprocessors as long as there is no need to scale performance to more than one processor at a time.

Sometimes you can scale to multiprocessor deployment by running multiple instances of the original application, with each instance running on a different processor. This improves bandwidth by enabling a larger number of transactions to be processed in any given unit of time, but it does not generally improve latency (the time required to process each transaction).

Even if it is possible to preserve large portions of an existing legacy application by replicating the application’s logic across multiple processors, some parallel processing infrastructure usually must be developed to manage the independent replicas. For example, a supervisor activity might need to decide which of the N replicas is responsible for each transaction to be processed. And at a lower level, it may be necessary for all N replicas to share access to a common database to assure that the independent processing activities do not create integrity conflicts.

In general, it is good to reuse time-proven legacy code as much as is feasible. However, in most cases, significant amounts of code must be rewritten and/or added in order to fully exploit the parallel processing capabilities of the more modern hardware. We recommend the use of Java for new code development for several reasons:

  1. In general, Java developers find themselves to be about twice as productive as C and C++ developers during the creation of new (sequential) functionality.
  2. During software maintenance and reuse activities, systems implemented with the Java language are maintained at one-fifth to one-tenth the cost of systems implemented with C or C++.
  3. Java has much stronger support than C and C++ for concurrent and parallel processing.
  4. There is a greater abundance of readily reusable, off-the-shelf multiprocessor aware software components written in the Java language.
  5. Most recently, it is much easier to recruit competent Java developers than developers expert in the use of C and C++.

Bug-free Java code
Since Java is designed for concurrent and parallel programming, many off-the-shelf software components and many legacy Java applications are already written to exploit concurrency. If you are lucky enough to inherit a legacy of Java application software, migration to a multiprocessor Java platform may be straightforward.

A word of caution, however: even though Java code may have been written with concurrency in mind, if the code has never been tested on a true multiprocessor platform, it may be that the code is not properly synchronized. This means certain race conditions may manifest on the multiprocessor system even though they did not appear when the same code ran as concurrent threads on a uniprocessor.

Therefore, budgeting time to test and fix bugs is advisable. As Java code is migrated from uniprocessor to multiprocessor platforms, code review or inspections by peers may help identify and address shortcomings more efficiently than extensive testing. Enrolling your entire staff of uniprocessor-savvy Java programmers for training on multiprocessor issues may be an investment that pays for itself many times over as part of the preparation for migrating existing Java code from uniprocessor to multiprocessor platforms.

The remainder of this section discusses issues relevant to creating Java code for or porting code into a multiprocessor environment.

Finding opportunities for parallel execution
To exploit the full bandwidth capabilities of a multiprocessor system, it is necessary to divide the total effort associated with a particular application between multiple, largely independent threads, with at least as many threads as there are processors. Ideally, the application should be divided into independent tasks, each having the relatively rare need to coordinate with other tasks. For efficient parallel operation, an independent task would execute hundreds of thousands of instructions before synchronizing with other tasks, and each synchronization activity would be simple and short, consuming no more than a thousand instructions at a time. Software engineers should seek to divide the application into as many independent tasks as possible, while maximizing the ratio between independent processing and synchronization activities for each task. To the extent these objectives can be satisfied, the restructured system will scale more effectively to larger numbers of processors.

Serial code is code that multiple tasks cannot execute in parallel. As software engineers work to divide an application into independent tasks, they must be careful not to introduce serial behavior among their independent tasks because of contention for shared hardware resources. Suppose, for example, that a legacy application has a single thread that reads values from 1,000 input sensors. A programmer might try to parallelize this code by dividing the original task into ten, each one reading values from 100 sensors.

Unfortunately, the new program will probably experience very little speedup because the original thread was most likely I/O bound. There is a limit on how fast sensor values can be fetched over the I/O bus, and this limit is almost certainly much slower than the speed of a modern processor. Asking ten processors to do the work previously done by one will result in ineffective use of all ten processors rather than just one. Similar I/O bottlenecks might be encountered with file subsystem operations, network communication, interaction with a data base server, or even a shared memory bus.

Explore pipelining as a mechanism to introduce parallel behavior into applications that might appear to be highly sequential. See Figure 2 for an example of an application that appears inherently serial.


Click on image to enlarge.

Figure 2. Soft PLC Implementation of Proportional Integral Derivative (PID) Control

This same algorithm can be effectively scheduled on two processors as shown in Figure 3 .


Click on image to enlarge.

Figure 3. Two-Processor Pipelined Implementation of Soft PLC

Pipelined execution does not reduce the time required to process any particular set input values. However, it allows an increased frequency of processing inputs. In the original sequential version of this code, the control loop ran once every 20 ms. In the two-processor version, the control loop runs every 10 ms.

This particular example is motivated by discussions with an actual customer who was finding it difficult to maintain a particular update frequency using a sequential version of their software Programmable Logic Controller (PLC). In the two-processor version of this code, one processor repeatedly reads all input values and then calculates new outputs, while the other processor repeatedly logs all input and output values and then writes all output values. This discussion helped the customer understand how they could use multiprocessing to improve update frequencies even though the algorithm itself was inherently sequential.

The two-processor pipeline was suggested based on an assumption that the same I/O bus is fully utilized by the input and output activities. If inputs are channeled through a different bus than outputs, or if a common I/O bus has capacity to handle all inputs and outputs within a 5 ms time slice, then the same algorithm can run with a four-processor pipeline, yielding a 5 ms update frequency, as shown in Figure 4 .


Click on image to enlarge.

Figure 4. Four-Processor Pipelined Implementation of Soft PLC

Remember that a goal of parallel restructuring is to allow each thread to execute independently, with minimal coordination between threads. Note that the independent threads are required to communicate intermediate computational results (the 1,000 input values and 1,000 output values) between the pipeline stages. A naive design might pass each value as an independent Java object, requiring thousands of synchronization activities between each pipeline stage. A much more efficient design would save all of the entries within a large array, and synchronize only once at the end of each stage to pass this array of values to the next stage.



Implementing parallel activities

Java’sbuilt-in support for concurrent programming makes implementation ofparallel code straightforward. However, there are various issues thatrequire appropriate attention at architecture and design time in orderto assure robust and scalable operation of the modernized application.

Thread pools
Manyreal-time workloads are much more dynamic than the rigidly scheduledsoftware PLC application described above. Events are signaledasynchronously, and a prescribed response must be provided for eachevent within a specific deadline associated with that event.

Becausestarting and stopping threads are relatively costly operations, it ispreferable to start all threads during application boot time. Eachthread iterates, waiting for an appropriate event to occur, providing aresponse to that event, and repeating the sequence.

In somecases, certain threads may be dedicated as servers capable of respondingto a variety of different events, depending on the application’s mosturgent requirements. A typical server thread implementation might berepresented by the following run() method:

  public void run() {
    WorkAssignment a_task;

    while (true) {
      synchronized (this) {
        while (my_work_assignment == null) {
          wait();
        }
        a_task = my_work_assignment;
        my_work_assignment = null;
        // Since the server thread is running, the only possible waiters are other
        // threads wanting to assign work to me.
        notify();
      }
      // do the work after releasing the lock, allowing other threads to offer a new
      // work assignment while I’m completing the previous assignment.
      a_task.doWork();
    }
  }

This code assumes existence of a WorkAssignment interface with a doWork() method. Elsewhere within this server thread’s class definition, adifferent method would enable other threads to make work assignments tothis server thread. That code might resemble the following:

  public synchronized void assignWork(WorkAssignment wa) {
    while (my_work_assignment != null) {
      wait();
  }
  my_work_assignment = wa;
  // I have just assigned work. It may be that the condition queue holds  other threads
  // wanting to assign work in addition to the server thread waiting to receive work. I’ll
  // wake them all just to be sure that the server gets a notification. This is reasonably
  // efficient under the assumption that this system is balanced so that the server is
  // generally keeping up with all requests so there would not normally be other threads
  // waiting to assign work which would be unproductively notified. Note that any thread
  // that is unproductively notified will simply place itself back on the wait queue.
  notifyAll();
}

Inorder to effectively manage response-time constraints, therecommendation is that threads with shorter deadlines are assignedhigher priority. This is known as rate-monotonic assignment ofpriorities. Analysis of schedulability to satisfy real-time constraintsis discussed below.

Note that if a server thread’s priority isassigned in rate-monotonic order, then it is necessary for all of thework tasks assigned to a given thread to have the same relativedeadline. Note also that exploiting the full capacity of an N-processorcomputer may require N running instances of each server thread.

Deadlock prevention
Deadlockoccurs if one thread is waiting for another thread to accomplish acritical task while that other thread is unable to accomplish thecritical task because it is waiting (possibly indirectly) for the firstthread to perform some other critical activity. It is the softwareequivalent of two overly polite individuals stepping up to a door withthe first person saying “You first,” and the second responding, “No,after you.” People are able to break this deadlock only by observingother visual queues.

Unfortunately, software tends to focus onone objective at a time, and is totally oblivious to the bigger picture.Thus, two software threads encountering a similar situation might bothwait forever for the conflict to be resolved. A symptom of this behavioris that the computer system becomes catatonic, and eventually somehuman operator decides to reboot it.

One risk when moving codefrom a uniprocessor to multiprocessor environment is that deadlocksnever seen in the uniprocessor code begin to manifest in themultiprocessor version. This is because certain coordination algorithmsthat work fine on a single processor do not work reliably in themultiprocessor context.

An example of one such algorithm is anon-blocking priority ceiling emulation (PCE) algorithm for mutualexclusion. This protocol specifies that when a thread enters aparticular critical section of code, it immediately raises its priorityto the ceiling priority associated with that code. By design, any threadrequiring mutually exclusive access to this same critical region ofcode must have a priority no greater than the ceiling priority. On auniprocessor system, no semaphores or queues are required to implementthe PCE protocol as long as threads holding a priority ceiling emulation“lock” do not suspend execution while holding the lock. This is becausea thread holding the lock is known to run at a higher priority than allother threads that may attempt to acquire the lock.

The PCEmechanism also works on a multiprocessor but a lock and queue may berequired in the implementation because there is no guarantee that otherthreads will not be running in parallel at the same or even lowerpriorities than this thread. Even though PCE generalizes to amultiprocessor environment, there is a risk that using this protocol on amultiprocessor will result in deadlock even though the same code worksreliably on a uniprocessor without deadlock.

Deadlock may occurif one thread uses PCE to lock resource A and a different thread usesPCE to lock resource B. If the first thread subsequently attempts tolock resource B using PCE while still holding its PCE lock on resource Aand the second thread attempts to lock resource A using PCE while stillholding its PCE lock on resource B, deadlock results. It would not bepossible for this sequence of events to occur on a uniprocessor systemso the uniprocessor implementation is deadlock free.

A commonalgorithm to avoid deadlock is to require that all threads acquirenested locks in the same order. A PCE practice that assures absence ofdeadlock even on multiprocessors is to require that inner-nested PCElocks have a higher ceiling than outer-nested PCE locks. Adopting thispractice for all PCE code will assure that the code is deadlock-freeboth on uniprocessor and multiprocessor platforms.

Data structure considerations
Often,mutual exclusion locks associated with shared data structures surfaceas the bottleneck that prevents effective parallel execution of multiplethreads executing on different processors. If only one processor at atime needs access to a complex data structure, then a single lock at theoutermost interfaces to this data structure may be an effective way toensure the data structure’s integrity. However, if many concurrentthreads seek to read and even modify the data structure in parallel,alternative locking mechanisms are necessary. Consider the followingcommon approaches:

1. Reorganize the data structures so thatphysical proximity correlates with temporal proximity for any giventhread. In other words, if a thread’s visit to a particular part of thedata structure predicts with high probability that another part of thedata structure will also be visited in the very near future, then it isdesirable that these two parts of the data structure be near to eachother within the in-memory data structure.

2. Restructure the locking protocols associated with the data structure.

  • Rather than a single global lock on the entire data structure, introduce multiple locks, each guarding a particular portion of the data structure. Data that has close physical proximity would be guarded by the same synchronization lock. The objective is to allow different processors to manipulate different parts of the data structure in parallel.
  • To lock each portion of the data structure, consider a locking mechanism that allows many concurrent readers but only one writer at a time.

3. Consider replicating certain datastructures so that particular threads can access local copies of certaindata without any synchronization at all. This might require thatalgorithms be adjusted to tolerate the possibility that localized datastructure copies may hold stale information.

Part 1: How Java eases multicore hardware demands on software

Part 3: Using Java with C and C++ for real-time multicore designs

Dr. Kelvin Nilsen is Chief Technology Officer over Java at Atego Systems, a mission- andsafety-critical solutions provider, where he oversees the design andimplementation of the PERC virtual machine and other Ategoembedded/real-time oriented products. Prior to joining Atego, Dr. Nilsenserved on the faculty of Iowa State University where he performedseminal research on real-time Java that led to the Perc family ofvirtual machine products. Dr. Nilsen participates in the Java CommunityProcess as a member of the JSR 282 and JSR 302 expert groups. He holds aB.S. in Physics and M.S. and Ph.D. degrees in Computer Science.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.