Using Java to deal with multicore programming complexity: Part 2 - Migrating legacy C/C++ code to Java
Editor’s Note:In Part 2 in a three part series on how embedded
developers can more effectively exploit the use of multicore
Java, Kelvin Nilsen details the factors to consider in making a decision
to shift from C and C++ to Java, as well as provides some guidelines
for migrating legacy code to a Java-based multicore development
For the past several decades, huge legacies of software have been created. Many embedded system industries have experienced doubling of application size every 18 to 36 months. Now, many of the companies responsible for maintaining these large legacies are faced with the difficult challenge of modernizing the code to run in multiprocessor environments.
One option: Keep your legacy code as is
Clearly, the path of least resistance is to keep large code legacies as is, and not worry about porting the code to new multiprocessor architectures. Most code originally developed to run as a single non-concurrent thread will run just fine on today’s multiprocessors as long as there is no need to scale performance to more than one processor at a time.
Sometimes you can scale to multiprocessor deployment by running multiple instances of the original application, with each instance running on a different processor. This improves bandwidth by enabling a larger number of transactions to be processed in any given unit of time, but it does not generally improve latency (the time required to process each transaction).
Even if it is possible to preserve large portions of an existing legacy application by replicating the application’s logic across multiple processors, some parallel processing infrastructure usually must be developed to manage the independent replicas. For example, a supervisor activity might need to decide which of the N replicas is responsible for each transaction to be processed. And at a lower level, it may be necessary for all N replicas to share access to a common database to assure that the independent processing activities do not create integrity conflicts.
In general, it is good to reuse time-proven legacy code as much as is feasible. However, in most cases, significant amounts of code must be rewritten and/or added in order to fully exploit the parallel processing capabilities of the more modern hardware. We recommend the use of Java for new code development for several reasons:
- In general, Java developers find themselves to be about twice as productive as C and C++ developers during the creation of new (sequential) functionality.
- During software maintenance and reuse activities, systems implemented with the Java language are maintained at one-fifth to one-tenth the cost of systems implemented with C or C++.
- Java has much stronger support than C and C++ for concurrent and parallel processing.
- There is a greater abundance of readily reusable, off-the-shelf multiprocessor aware software components written in the Java language.
- Most recently, it is much easier to recruit competent Java developers than developers expert in the use of C and C++.
Since Java is designed for concurrent and parallel programming, many off-the-shelf software components and many legacy Java applications are already written to exploit concurrency. If you are lucky enough to inherit a legacy of Java application software, migration to a multiprocessor Java platform may be straightforward.
A word of caution, however: even though Java code may have been written with concurrency in mind, if the code has never been tested on a true multiprocessor platform, it may be that the code is not properly synchronized. This means certain race conditions may manifest on the multiprocessor system even though they did not appear when the same code ran as concurrent threads on a uniprocessor.
Therefore, budgeting time to test and fix bugs is advisable. As Java code is migrated from uniprocessor to multiprocessor platforms, code review or inspections by peers may help identify and address shortcomings more efficiently than extensive testing. Enrolling your entire staff of uniprocessor-savvy Java programmers for training on multiprocessor issues may be an investment that pays for itself many times over as part of the preparation for migrating existing Java code from uniprocessor to multiprocessor platforms.
The remainder of this section discusses issues relevant to creating Java code for or porting code into a multiprocessor environment.
Finding opportunities for parallel execution
To exploit the full bandwidth capabilities of a multiprocessor system, it is necessary to divide the total effort associated with a particular application between multiple, largely independent threads, with at least as many threads as there are processors. Ideally, the application should be divided into independent tasks, each having the relatively rare need to coordinate with other tasks. For efficient parallel operation, an independent task would execute hundreds of thousands of instructions before synchronizing with other tasks, and each synchronization activity would be simple and short, consuming no more than a thousand instructions at a time. Software engineers should seek to divide the application into as many independent tasks as possible, while maximizing the ratio between independent processing and synchronization activities for each task. To the extent these objectives can be satisfied, the restructured system will scale more effectively to larger numbers of processors.
Serial code is code that multiple tasks cannot execute in parallel. As software engineers work to divide an application into independent tasks, they must be careful not to introduce serial behavior among their independent tasks because of contention for shared hardware resources. Suppose, for example, that a legacy application has a single thread that reads values from 1,000 input sensors. A programmer might try to parallelize this code by dividing the original task into ten, each one reading values from 100 sensors.
Unfortunately, the new program will probably experience very little speedup because the original thread was most likely I/O bound. There is a limit on how fast sensor values can be fetched over the I/O bus, and this limit is almost certainly much slower than the speed of a modern processor. Asking ten processors to do the work previously done by one will result in ineffective use of all ten processors rather than just one. Similar I/O bottlenecks might be encountered with file subsystem operations, network communication, interaction with a data base server, or even a shared memory bus.
Explore pipelining as a mechanism to introduce parallel behavior into applications that might appear to be highly sequential. See Figure 2 for an example of an application that appears inherently serial.
Click on image to enlarge.
This same algorithm can be effectively scheduled on two processors as shown in Figure 3.
Click on image to enlarge.
Pipelined execution does not reduce the time required to process any particular set input values. However, it allows an increased frequency of processing inputs. In the original sequential version of this code, the control loop ran once every 20 ms. In the two-processor version, the control loop runs every 10 ms.
This particular example is motivated by discussions with an actual customer who was finding it difficult to maintain a particular update frequency using a sequential version of their software Programmable Logic Controller (PLC). In the two-processor version of this code, one processor repeatedly reads all input values and then calculates new outputs, while the other processor repeatedly logs all input and output values and then writes all output values. This discussion helped the customer understand how they could use multiprocessing to improve update frequencies even though the algorithm itself was inherently sequential.
The two-processor pipeline was suggested based on an assumption that the same I/O bus is fully utilized by the input and output activities. If inputs are channeled through a different bus than outputs, or if a common I/O bus has capacity to handle all inputs and outputs within a 5 ms time slice, then the same algorithm can run with a four-processor pipeline, yielding a 5 ms update frequency, as shown in Figure 4.
Click on image to enlarge.
Remember that a goal of parallel restructuring is to allow each thread to execute independently, with minimal coordination between threads. Note that the independent threads are required to communicate intermediate computational results (the 1,000 input values and 1,000 output values) between the pipeline stages. A naive design might pass each value as an independent Java object, requiring thousands of synchronization activities between each pipeline stage. A much more efficient design would save all of the entries within a large array, and synchronize only once at the end of each stage to pass this array of values to the next stage.
Currently no items