Using Java to deal with multicore programming complexity: Part 3 – Using Java with C and C++ for real-time multicore designs

Editor’s Note: In this final part in a three-part series, Atego’s Kelvin Nilsen provides guidelines on how to use the combination of the Java parallel programming language and traditional sequential C and C++ methodologies to achieve real-time multicore software operation.

When multiple processors work together to accomplish a particular work assignment, the goal to minimize coordination activities may conflict with the need to assure that each processor has plenty of worthwhile work to perform. Coordination is often required to check the status of each processor’s ongoing activities, as the work completed by one processor may affect the work assignments given to others.

Individual processors are most efficient if they can focus their efforts on a small set of tasks that remain resident in that processor’s local memory and cache. However, aligning the efforts of each processor with globally important objectives may require frequent redirection of each individual processor’s efforts. Trading off between productivity of individual processors and satisfaction of global objectives is an intricate balancing act. This section discusses several alternative approaches to achieving this, and ends with a discussion of how to use Java in combination with the C and C++ programming languages.

Throughout this series, we have assumed real-time operation is an important consideration. Thus, there is an expectation that priorities are honored by the Java virtual machine (JVM) and underlying operating system. Note that non-real-time virtual machines do not necessarily honor thread priorities and will sometimes schedule a lower priority thread to run even though higher priority threads are ready. Real-time virtual machines are generally configured to schedule in full accordance with thread priorities.

One JVM for each processor
With a four-way multiprocessor, you can start up four independent Java virtual machines (JVMs), each one bound to a different processor. One advantage of this approach is that each JVM sees itself as running on a uniprocessor, and can thus use simpler synchronization protocols. Another advantage is that there is strong isolation of behavior between the code running on the independent processors. Each instance of the JVM has independent memory and CPU budgets, with very little potential for interference from one JVM to the next.

The key disadvantage of this approach is that it is much more difficult to share information between threads running in distinct JVMs. The typical communication mechanism would use Java Remote Method Invocation (RMI), orders of magnitude more costly than synchronized access to shared objects residing in the same process space. This high cost for information sharing makes it especially difficult to balance load, because this would typically require copying of large amounts of data, and possibly code as well, from one JVM to another.

Single JVM with global scheduling of Java threads
An alternative configuration runs a multiprocessor-aware JVM across all – or some subset of – available processors. Individual threads are free to migrate automatically between processors. The JVM uses a global scheduling technique, assuring at all times that the N highest priority threads in the Java application are running on the N available processors.

This approach makes information sharing between threads (and processors) very easy because all objects reside in the same global address space. Load balancing is automatically performed by the global scheduler. The disadvantages of this approach are that (1) the isolation barriers around independent components are less rigid, making the application vulnerable to CPU or memory starvation from misbehaving components; (2) frequent global synchronizations are required to maintain the data structures required to support global scheduling decisions; and (3) scheduling the N highest priority ready threads in the system requires frequent migration of threads from one processor to the next, resulting in poor cache locality.

Single JVM with threads bound to individual processors
Standard edition Java does not offer an API to set the affinity of particular threads to particular processors. However, some virtual machines offer special support for this capability. The idea is that certain threads can be bound to particular processors, thereby improving the cache locality and reducing the burden on a global scheduler. Binding threads to particular processors also improves predictability of scheduling, enabling analysis of schedulability to guarantee that threads will satisfy their real-time constraints.

Since all of the threads are running within the same virtual machine, they all share access to the same objects. This makes information sharing and load balancing straightforward. Note that load balancing must be performed by application software logic and does not happen automatically, unlike the globally scheduled approach.

One difficulty with assigning thread affinity to processors is that this complicates the implementation of priority inheritance. Suppose, for example, that thread α, bound to processor 1 and running at priority 6, holds a lock on resource R . Suppose further that there is a higher priority thread β, also bound to processor 1, running at priority 10. Now suppose that thread γ, running on processor 2 at priority 8, attempts to lock resource R . Since the resource is already locked by a lower priority thread, it will endow its priority to thread α because it currently holds the lock. However, raising thread α to priority 8 does not allow it to run, because it is bound to processor 1 and processor 1 is consumed by the demands of threadβ, which is running at priority 10.

If you are going to pursue an approach that binds individual threads running within a single JVM to different processors, it is important to know how the virtual machine deals with priority inheritance and take this into consideration when performing schedulability analysis. One JVM approach is to ignore a thread’s specified affinity whenever the thread is running with inherited priority. Another approach is to inherit not only thread priority but also thread affinity. A third approach is for the JVM to strictly honor thread affinity and simply document that priority inheritance does not work reliably when shared synchronization objects are accessed from threads bound to different processors.

Hybrid approaches. Various hybrid approaches are possible, each providing different tradeoffs between strong isolation of resources, ease of information sharing, and support for dynamic load balancing. A uniprocessor JVM may be bound to one processor, while a multiprocessor JVM may be bound to certain other processors. Each multiprocessor JVM may support a combination of global scheduling for certain threads and localized scheduling for threads bound to particular processors. In this sort of hybrid configuration, an unbound thread would be scheduled on any processor for which the highest priority bound ready thread has lower priority than the unbound thread’s priority. Note that allowing unbound threads to freely migrate between processors that are also scheduling bound threads adds complexity to the scheduling analysis performed on each processor. Address this complexity by limiting which processors an unbound thread is allowed to run on, or by assigning lower priorities to unbound threads.

Analysis of real-time constraints

An important activity in the development of any real-time system is analysis of compliance with real-time constraints. This activity, often described as schedulability analysis, requires a good understanding of the CPU time consumed by every real-time task, the frequency with which each task is triggered for execution, the maximum amount of time each task might be blocked contending for access to shared resources, and the deadline for completion of the task’s work measured from the moment it is triggered.

Many thousands of research papers have reported on various aspects of real-time schedulability analysis. One of the most pragmatic and common approaches, especially for software components that are independently developed and maintained, is rate monotonic analysis . Other scheduling techniques are able to extract higher utilization from the CPU, but these other techniques often incur higher run-time overheads, may require greater degrees of integration between the independently developed software components, and most do not deal as gracefully with transient work overloads as does the rate monotonic technique.

In hard real-time systems, which are typically characterized by an expectation that the system is proven to always meet all timing constraints with no tolerance for the slightest deviation from scheduling constraints, programmers are expected to theoretically analyze all of the worst case behaviors of each task and feed these worst-case numbers in as parameters to the schedulability analysis. In soft real-time systems, the expectation is less rigorous. Approximations of task behaviors based on various measurements are accepted as inputs to the schedulability analysis. The expectation is that scheduling analysis based on approximations of task behavior will usually satisfy all deadlines, but may occassionally miss certain deadlines by relatively small amounts of time.

RMA: uniprocessor vs. globally scheduled multiprocessing
Rate monotonic analysis (RMA) is much simpler, and provides higher guaranteed utilization of available CPU resources, on a uniprocessor. The basic approach of rate monotonic analysis is to assign task priorities in order of decreasing deadlines, so the tasks with the shortest deadlines get the highest priority.

In the absence of blocking behaviors, rate monotonic analysis promises that if the total CPU utilization of a workload for which task priorities have been assigned in rate monotonic order is less than 69%, all of the tasks will always satisfy their real-time constraints. Utilization is computed by multiplying every task’s worst-case execution time by its maximum execution frequency and summing the results. The analysis is more complex when tasks contend for exclusive access to shared resources. For a full description, see reference [13].

When an N -way multiprocessor is configured to always run the N highest priority ready threads in parallel on the N available processors, we call this global scheduling. Rate monotonic theory has been generalized to globally scheduled multiprocessors [14]. An interesting result is that the globally scheduled multiprocessor system needs much more slack than a uniprocessor in order to guarantee compliance with real-time constraints. One form of the result for globally scheduled multiprocessors is given by the following:

Corollary 1. Any periodic task system in which each task’s utilization is no more than one-third and the sum of the utilization of all the tasks is no more than N/3 is successfully scheduled by the Rate Monotonic Scheduling Algorithm upon N unit-capacity processors.

Note that the rate monotonic scheduling guarantee in this context applies only if the total utilization is less than 33.3% of the total capacity of all N processors. This is much lower than the 69% utilization that can be effectively scheduled on a uniprocessor. When applying this approach, it is also important to consider that the frequent migration of tasks from one processor to another is likely to increase the expected execution times of individual tasks compared to repeated execution of the tasks on the same processor because of loss of cache locality benefits.

Rate monotonic analysis for locally scheduled multiprocessors

When threads are bound to particular processors, the schedulability analysis for each processor is the same as the uniprocessor version of rate monotonic analysis. The greatest difficulty with this approach is deciding on the most optimal partitioning of threads between available processors as this is a form of the bin packing problem, known to be NP-hard. A 1998 paper describes a simple first-fit-decreasing partitioning heuristic and includes an analysis demonstrating that the use of this heuristic to assign threads to particular processors will result in a system that is rate-monotonic schedulable as long as:

where U is the total workload utilization and N is the number of processors [15]. Simplifying the math, this approach guarantees to satisfy real-time constraints as long as the total utilization is no greater than 41% of the multiprocessor capacity.

Real-time multicore garbage collection
As with uniprocessor real-time garbage collection, the key requirements for robust and reliable operation of real-time garbage collection are that it be

  • preemptible, allowing higher priority tasks to run even when garbage collection is active;
  • incremental, making appropriate forward progress during available increments of CPU time so that it does not need to restart itself whenever it is resumed following preemption;
  • accurate, guaranteeing to find and reclaim all garbage in the system;
  • defragmenting, relocating live objects to contiguous memory locations and coalescing contiguous free segments into larger free segments to improve reliability and allocation efficiency; and
  • paced, to ensure that the garbage collector replenishes the free pool at a rate consistent with the application’s appetite for new memory allocation.

The objective of garbage collection pacing is to make sure that garbage collection gets enough increments of CPU time to make sure it consistently replenishes the allocation pool before the available supply of memory has been exhausted.


Click on image to enlarge.

Figure 5: Allocatable Memory vs. Time

Allocatable Memory vs. Time in Figure 5 shows a simulated air traffic control workload with real-time garbagecollection running under the direction of a real-time pacing agent. Thisparticular application is running in a fairly predictable steady state,as characterized by the following observations: First, the slope of theyellow chart represents the amount of memory available for allocation,roughly constant whenever garbage collection is idle. This means theapplication’s allocation rate is approximately constant. Second, theheights of the yellow chart’s peaks are roughly identical. This meansthe amount of live memory retained by the application is roughlyconstant. In other words, the application is allocating new objects atapproximately the same rate it is discarding old objects. Finally, thepercentage of CPU time required for application processing is wellbehaved, ranging from about 20% to 50%.

Note that garbagecollection is idle most of the time. As memory becomes more scarce,garbage collection begins to run. When garbage collection runs, itinterferes with some, but not all, of the real-time applicationprocessing. For this reason, you’ll note a slight dip in applicationprocessing each time the garbage collector, represented by theoccasional red upward spikes, runs. You’ll also note a tendency forapplication processing to briefly increase following each burst ofgarbage collection. This is because the preempted application needs toperform a small amount of extra work to catch up with real-timescheduling constraints following each preemption. If properlyconfigured, the pacing agent will carefully avoid delaying theapplication threads by any more than the allowed scheduling jitter.

With multiprocessor real-time garbage collection, certain additional issues arise. Consider the following:

1. An N -processor system allows N times as much garbage to be created in a given unit of time. Thus, garbage collection for an N -processor system must perform N times as much work as the uniprocessor version.

  • An N -way multiprocessor can accomplish N times as much garbage collection work as an equivalent uniprocessor in the same amount of time only if the garbage collection algorithms scale well to multiprocessor architectures. When migrating from a uniprocessor to multiprocessor Java virtual machine, it is important to understand the performance characteristics of the multiprocessor garbage collector as its behavior may not scale linearly in the number of processors.
  • System architects might be tempted to dedicate one processor to garbage collection and the other N-1 processors to application processing. This may work for small numbers of processors but does not scale to large numbers of processors. Suppose, for example, that your uniprocessor application spends 15% of CPU time in garbage collection. On a 4-way multiprocessor, 25% of total CPU time would be available to garbage collection, which should be sufficient. However, on an 8-way multiprocessor, only 12.5% of total CPU time would be available to garbage collection, which is inadequate. Another disadvantage of this approach is that it does not make effective use of the N th processor during times when garbage collection is idle.

2.Incremental parallel but nonconcurrent real-time garbage collectionoffers the same benefits on a multiprocessor system as on auniprocessor. During certain time spans, all N processors are dedicated to application code. During other time spans, all N processors are dedicated to small increments of garbage collection.

3.Fully parallel and concurrent garbage collection offers greaterscheduling flexibility than parallel but nonconcurrent real-time garbagecollection. In particular, certain processors can be running incrementsof garbage collection in parallel with execution of application code onother processors. This form of real-time garbage collection is morecomplex and may impose a larger overhead on the execution of theapplication threads because those threads must coordinate with finergranularity with garbage collection activities.

Example: Why Calix migrated from C to multiprocessor Java
Calixis a supplier of simplified services platforms designed to facilitateall aspects of voice, data, and video service delivery to business andresidential subscribers for local exchange carriers of all sizes. Thecompany was founded in 1999. By 2007, they had become the number onesupplier of DSL broadband loop carriers, accounting for over 41% ofDSL/voice ports shipped on broadband loop carriers in North America.

In2001, Calix faced difficult challenges. Their management planesoftware, which had been implemented in C, lacked key features requiredby the market and the product had a number of significant but elusivebugs. One of the reasons it had been so difficult to manage the Cmanagement plane was because the problem domain needed multiple threadsbut the C language did not provide support for threading. Calixengineers had emulated multiple threads by using cooperative coroutines,implemented in C. This code was difficult to understand and maintain.Engineering management struggled to deal with these challenges.Eventually, they chose to rewrite the entire management plane asconcurrent Java code.

The Java implementation of the Calix C7multiservice access platform is designed to support asymmetricmultiprocessing with one copy of the JVM running multiple concurrentthreads on each processor. The benefit of multiprocessing in thisapplication is to allow scalability to large numbers of I/O channels.The uniprocessor configuration supports a combined total of 480 plainold telephone system (POTS) and digital subscriber line (DSL) ports. Upto five Calix C7 processors can be combined to support a total of 2,400ports. Each interconnection of Calix C7 processors serves the needs ofone local exchange.

Basic management activities such asprovisioning, fault and performance monitoring, and security enforcementrequire a unified framework capable of providing seamless communicationbetween the management software and the services on the Calix C7platform. By moving the application framework of its managementtechnology to an object-oriented platform, Calix sped applicationdevelopment and improved software quality. By using Java as theimplementation language, Calix simplified and centralized errorhandling, avoided many memory management issues, and took advantage of astandardized remote debugging interface.

At the start of themigration effort, Calix engineers were not familiar with Java. Theylearned Java as part of this engineering activity. Specific metricsprovided by Calix regarding the conversion are:

  1. Two-fold improvement in software developer productivity compared to the previous implementation in C, even counting the time required to learn Java
  2. Five-fold reduction in code size compared to the previous implementation in C
  3. Fewer bugs in the Java code than in the original C code
  4. More flexibility and generality in the new system, as the Java implementation of their management plane was more capable of adding key capabilities required by evolving market requirements

Inevaluating their experience with this migration, Calix identifiedseveral key factors that eased the migration from C to Java.

1. Portability of Java yielded several distinct benefits

  • Developers could test and debug most of their code on larger, faster desktop workstations (running Windows, Solaris, or Linux) even though final deployment was on a single-board computer running a real-time operating system
  • Breadth of off-the-shelf library code that was never before tested on their specialized hardware worked out of the box with no porting effort
  • Simplified coordination with outside consultants, who guided development but did not have access to any of their specialized hardware

2. Quality development tools

  • The symbolic cross debugger was essential, with the user interface running on a desktop workstation and the debugger agent running on their single-board computer
  • The remote performance profiler was essential during system tuning and optimization; profiler feedback was especially valuable as it helped educate Calix engineers new to Java development regarding the costs of particular coding practices
  • Calix summarized the importance of access to these development tools stating that the effort would have failed if they did not have tools to match the capabilities they had used during C development


Conclusion

Multiprocessorarchitectures offer superior performance and more efficient andscalable use of electric power. The shift by chip vendors to abandonuniprocessor designs in favor of multicore chips is forcing changes tothe ways that embedded software systems are architected and designed.

Inlight of the challenges associated with implementing and maintainingmultiprocessor software, the C and C++ languages are showing their age.Since the Java language was designed to simplify engineering ofmultiprocessor software, it is often preferred for the major softwarerewrites that are required to take full advantage of modernmultiprocessing capabilities. The two-fold improvement in developerproductivity and the five- to ten-fold improvement in ease of softwarereuse and integration are other reasons to prefer Java for majorsoftware renovation activities.

Software engineers who areresponsible for modernization of existing uniprocessor software legaciesneed to become familiar with various multiprocessing issues in order toeffectively manage the modernization efforts. These issues includetopics in information sharing and synchronization, parallel algorithmsand data structures, resource partitioning and load balancing, real-timeschedulability analysis, and choice of programming language.

Facedwith the need to migrate existing legacy software to multiprocessorplatforms, in-house engineering teams may benefit from involvement ofexternal experts who can provide essential training and consultingrelating to multiprocessor software.

Part 1: How Java eases multicore hardware demands on software
Part 2: Migrating legacy C/C++ code to Java

Dr. Kelvin Nilsen is Chief Technology Officer over Java at AtegoSystems, a mission- andsafety-critical solutions provider, where heoversees the design andimplementation of the PERC virtual machine andother Ategoembedded/real-time oriented products. Prior to joiningAtego, Dr. Nilsenserved on the faculty of Iowa State University wherehe performed seminalresearch on real-time Java that led to the Percfamily of virtualmachine products. Dr. Nilsen participates in theJava Community Processas a member of the JSR 282 and JSR 302 expertgroups. He holds a B.S. inPhysics and M.S. and Ph.D. degrees in Computer Science.

References

[1] Moore, G., “Cramming more components onto integrated circuits”, Electronics, vol. 38, no. 8, April 19, 1965

[2] “Transistor Count”, Wikipedia. (http://en.wikipedia.org/wiki/Transistor_count)

[3]“Design Considerations for Size, Weight, and Power (SWAP) ConstrainedRadios”, Presented at 2006 Software Defined Radio Technical Conferenceand Product Exposition, Nov. 14, 2006.(http://www.thalescomminc.com/media/DesignforSWAPRadios-SDRTechConference-Nov2006-s.pdf)

[4] Barroso, L. A., “The Price of Performance”, ACM Queue, pp. 48-53, Sept. 2005.

[5]Qi, X., Zhu, D., “Power Management for Real-Time Embedded Systems onBlock-Partitioned Multicore Platforms”, Proceedings of the 2008International Conference on Embedded Software and Systems, pp.110-117,IEEE, 2008

[6] “Multicore for Embedded Devices”, TechOnLine, Apr. 30, 2007. (www.techonline.com)

[7] Merritt, R., “Parallel Software Plays Catch-Up with Multicore”, EE Times, June 22, 2009.

[8]Suess, M., “An Interview with Dr. Jay Hoeflinger about AutomaticParallelization”, Aug. 14, 2007.(http://www.thinkingparallel.com/2007/08/14/an-interview-with-dr-jay-hoeflinger-about-automatic-parallelization/)

[9]Amdahl, G., “Validity of the single-processor approach to achievinglarge scale computing capabilities,” Proceedings of AFIPS Conference,1967, pp. 483-485.

[10] Krishnaprasad, S., “Uses and Abuses ofAmdahl’s Law”, Journal of Computing Sciences in Colleges, Vol. 17, No.2. Dec. 2001, pp. 288-293.

[11] Goetz, B., Peierls, T., Bloch,J., Bowbeer, J., Holmes, D., and Lea, D., Java Concurrency in Practice,Addison-Wesley, 2006, 403 pages.

[12] IEEE Std 1003.1, 2004 Edition (http://www.opengroup.org/onlinepubs/009695399/)

[13]Klein, M., Ralya, T., Pollak, B.,, Obenza, R., Gonzalez Harbour, M., APractitioner’s Handbook for Real-Time Analysis, Kluwer AcademicPublishersr, 1993.

[14] Baruah, S. K., Goossens, J.,“Rate-Monotonic Scheduling on Uniform Multiprocessors”, IEEETransactions on Computers, Vol. 52, No. 7, July 2003. pp. 966-970.

[15]Oh, D., Baker, T., “Utilization Bounds for N-Processor Rate MonotoneScheduling with Static Processor Assignment,” Real-Time Systems: TheInternational Journal on Time-Critical Computing, vol. 15, pp. 183-192,1998.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.