LINX: an open source IPC for distributed, multicore embedded designsA fact of life in today's embedded systems is the increasingly distributed nature and complexity of the designs: multiple processor nodes on a network, multiple servers in a cluster, multiple processor boards or blades in a system, multiple processors on a board, and multiple CPU cores on a single SOC.
To simplify the programming, resource management and debugging of applications, there has been a recent movement to extend the Inter Process Communications (IPC) protocol—a common element in most operating systems—to better meet the requirements of this new environment. But as yet there is no common IPC standard. Each OS and many hardware platforms incorporate proprietary IPC mechanisms adapted to those environments. Enea, for example, has used the LINKHandler message passing protocol in its OSE and OSEsk since 1993.
To meet the growing need for an open source IPC mechanism, Enea has developed LINX, an extension and enhancement of its proprietary LINKHandler protocol to meet the needs of the many new distributed computing challenges.
LINX is a direct message passing protocol that is offered as an open-source, processor-, OS- and interconnect-independent IPC. LINX is currently available for Linux and Enea's OSE and OSEsk. Support for other platforms will be available later this year. As an open source IPC, vendors can quickly adapt LINX to a variety of hardware and software platforms.
In the Linux implementation there will be two parts of the distribution that will be licensed differently. The API and some of the session layer implementation are linked with user applications, so therefore will be ascribed a modified BSD license " user application code therefore does NOT fall under the restrictions of the Gnu General Public License (GPL). However, the transport protocols part of LINX is linked with the Linux kernel (as a kernel module) and therefore is licensed under the GPL. As a result, some of the implementation of the LINX part in the Linux kernel uses GPL code.
The OSE implementations of LINX do not use GPL code, and therefore do not fall under the GPL. A BSD licensed portable version of LINX is planned.
A key element in the new LINX open source protocol is the use of a unique address mapping model, in which the IPC nodes store only the addresses needed for local connections, and as a result require minimal memory for code and data storage, and allowing for easier and faster on-the-fly reconfiguration. This enables LINX to scale to very large networks with complex cluster topologies (i.e., clusters connected by bridges and gateways), including those containing small-footprint DSP and microcontroller nodes.
Some other IPCs, such as TIPC, use a bit-mapped address model in which the complete system address map must be stored on every node in the system. This approach is memory intensive and complicates reconfiguration (i.e., after a failure, or when nodes are added or deleted), making it difficult to support devices such as DSPs and microcontrollers and to scale beyond simple clusters.
Because it employs a reliable, deterministic, high speed transport for control and data planes over both reliable and unreliable media, and supports encapsulation of other bearer protocols (TCP, UDP, SCTP, and TIPC) for data transport as well, LINX interoperates easily with most other IPC mechanisms.
Direct versus indirect message passing
Aside from its availability as an open source platform independent IPC mechanism, what makes LINC unique is its use of direct message passing. Direct message passing means that tasks running concurrently anywhere in a distributed system can send messages directly to one another, without going through intermedium mechanisms such as mailboxes along the way.
This differs radically from the communication methods used in most traditional RTOSes, where intervening mechanisms such as semaphores, mutexes, event flags, Unix-style signals must be used and messages must be sent to queues or mailboxes that are independent RTOS objects. In such systems, one task sends its message to a message queue; later the receiving task checks the message queue and fetches the message. Fetching can only be performed from the head of the message queue, so if a task needs to work with five different kinds of messages, it will need to work with five different message queues " one per message category.
Many message queues are needed since different kinds of messages must be passed through different queues. These queues are separate RTOS entities from tasks, and there may be a many-to-many relationship between tasks and queues.
Such mechanisms for inter-task communication require much more processing power than direct message passing and are error-prone. They become unwieldy when extended into the realm of distributed and multi-core systems, since synchronizing all the elements of communication between a task on one CPU and a task on another CPU can lead to long delays.
Direct message passing, on the other hand, allows a task to send messages directly to another task. There is no need for semaphores, mutexes, event flags, or Unix-style signals, no need for intervening message queues or mailboxes or for separate message queue IDs. Rather, a message is addressed to the recipient task by referring to it only by its Task ID.
If messages arrive at the recipient task more quickly than they can be handled, the RTOS automatically creates a queue for the recipient task and enqueues the arriving messages in this queue. However, only one queue is needed per task, and the queue is not an independent RTOS object, but a part of the recipient task. So it does not need its own message queue ID, but is referenced by the owner task's task ID. Messages can be fetched from the queue by their owner task either in traditional FIFO order or by message category.
In distributed multi-CPU and parallel processing applications, there are multiple benefits in passing data between tasks on different CPUs without requiring software on each CPU to synchronize with one another. In particular, this ability to categorize messages and fetch them by category is a major strength of the direct message passing, single-queue-per-task model of intertask communication.
The importance of IPC transparency
Transparency is essential for today's distributed systems. It enables applications and application processes running on multiple operating systems and devices to communicate with each other seamlessly, as if they were running on the same processor under the same operating system.
When direct message passing is the primary or only vehicle for inter-process communication, transparent operations are easy to implement because processes can be completely and logically separated from each other - a crucial requirement for high availability or reliability. You need only find the process location in order to send and receive messages. In other models where an external queue is used, you need to find both the process and its associated queue(s) in order to effect transparent operations.
The direct message passing model supports any type of physical interconnect and functions as a common interconnect between multiple nodes of a distributed system; supports and adapts to any protocol; and supports both reliable and unreliable communication links and redundant nodes and links. Thus it is a single software model for inter-CPU communications.
The choice of a direct message passing model is crucial for achieving high availability, reliability, failure isolation and recovery, and delivery transparency. In addition, direct message passing RTOSes provide very high performance and are easy to program, debug and maintain. Multi-CPU systems are easily expanded and distributed if they are initially designed using the direct message passing model, and direct message passing works exceptionally well with hardware memory protection between nodes, ensuring that faults in one node cannot corrupt other nodes.
LINX should be seen as a layer between clients and a data communications interconnect, such as ethernet, RapidIO, shared memory, or even a packet transport protocol such as TCP, UDP, SCTP, etc. LINX provides for reliable transport of messages between LINX clients. An LINX client is called a "Communicator". A Communicator serves as endpoints for all LINX communication.
The architecture of LINX may be described as a layered model that is roughly equivalent in standard terminology, to the Session, Transport, and Link layers of the OSI model, as seen in Figure 1 below. However, any transport level protocol may be "plugged in" to LINX " examples of this are shown in the block diagram, but at the Link Layer. The architecture is a functional model and does not necessarily reflect the detailed implementation.
|Figure 1. Architecture of the LINX layered IPC|
LINX Address Model
All LINX communication endpoints are called 'communicators'. A communicator has both a name and an ID or binary identifier, called a 'communicator ID' or CID. The LINX addressing model at the lowest level is 'flat'. That is, every communicator in a distributed system connected by LINX may be represented on any node in the system. LINX does not make any assumptions concerning the topology of the connected system " the addressing model adapts to any topology.
As shown in Figure 2 below, communicators' names and IDs on a given node are defined at create time; the user specifies the name as input to communicator creation, and the system automatically assigns the CID. Physical links (interconnects like Ethernet, SRIO, etc.) on each node are also given a user-defined name at link establishment time. Remote node communicator names appear as a concatenation of the links or 'link names' that represent the physical path to that communicator.
|Figure 2: LINX address model|
All remote communicators appear as paths relative to that node. Thus there is no master node for address resolution, and all addresses are peers in the network. There may be redundant paths via different physical links for remote communicators " in the LINX addressing model such redundant paths are represented by separate remote CIDs. Additionally, all remote communicators have a private CID on each node, but these are dynamically assigned at connection time.
Link Management Model
In LINX, links are established between peer nodes either automatically at start-up or by manual control. Manual control is especially useful in cluster environments where all possible connections are not necessarily desired. Each physical link must have a unique name across the entire distributed network for the addressing model described above to operate properly. Local link names are established at start-up.
There is a link establishment negotiation protocol by which two peers establish communication with each other. Upon link establishment, the peer link name is published in the local node " i.e., the fact that a peer exists is now available for users, and is an important part of the connection process described below and that may also be used to dynamically discover the complete physical network topology.
A binary 'Link ID' is created that is also associated with remote connection IDs, again described below. The link ID is needed to support multiple or redundant physical links between nodes. At link establishment time, there is a feature negotiation protocol that identifies the supported features of the peer node/link " revision for compatibility, endian-ness, and various other parameters of interest.
Once the link is established, LINX supports a user configurable heart-beat protocol for peer node/link health monitoring; this is called 'Supervision'. If the hardware supports it, there are up-calls into the Link Management service to report link failures. Should a failure occur, there are two reporting mechanisms:
* The failure is reported to the connection management system for tear-down of all connections, and possible recovery actions. The connection management system then reports this failure to the user " see connection model above.
* The failure may be reported directly to a LINX user, if the user registers for this.
Link failure results in a link teardown, which removes all the link registration information from the system. Links may be manually torn down under user control.
LINX Connection Model
The connection model of LINX is completely dynamic " i.e., connections are established at run time. Each communicator publishes its address locally, and may then seek another communicator, either local to its node or remote, by subscribing to the other communicator name. LINX will 'hunt' for the name across the network, and upon finding the other communicator, will notify the searching communicator that it has found the destination, and supplies the full path name and CID.
At this time, if the hunted or 'subscription' communicator is remote, the communicator name and the CID are published in the local node and internally associated with the binary Link ID of the first physical Link towards the remote communicator as described earlier. This is done so that during send operations to the remote CID, the appropriate link for the first hop is selected. In the local communicator, names and CIDs are already published and need no Link ID.
The address subscription process is non-blocking and persistent. That is, the subscription service will reply to the subscriber communicator should the subscription communicator appear in the future " either if that communicator is newly created in the network, or if it appears as a result of a new node appearing in the network.
Connections may be torn down either by the system or under user control. If the connection is remote, then the remote communicator name and CID are removed. Supervision of connections is an important and powerful aspect of the connection model. Users may request supervision of any local or remote connection that has been established.
LINX will notify the requestor if a failure occurs anywhere in the path to the subscribed communicator " either communicator deletion, node failure, or link failure. In addition, LINX provides fault isolation information that locates the exact source of the failure. This in itself provides a powerful fault management feature that works for the entire distributed network. Supervision notifications are in-band, meaning they are integrated into normal message traffic flow for efficiency and performance.
In this model, routes between communicators are established and de-established as part of the subscription and supervision service respectively " there is no separate requirement for updating of routing information in the system. Thus, by definition, routing information in the system always reflects the current state of the system, and is a feature that contributes greatly to the overall fault management and recover support provided in a LINX connected network.
LINX Messaging Model
The LINX 'send' message model employs only two parameters: message (pointer), and destination CID. Each message has a non-unique message ID defined by the user. Since CIDs are binary and are established in the same manner for both local and remote destination communicator addresses, application or user level transparency is achieved. Messages passed between local communicators (communicators on same CPU node) use pointer reference for zero copy transactions, unless the communicators are in separate memory protected segments.
The 'receive' message model supports a concept of 'receiver select', which allows receivers to select the message IDs they wish to receive at a given time. This model provides for a flexible priority messaging schema at the receiver end, and greatly contributes to finite state-machine designs at the receiver. Receivers do NOT have to process messages of a certain type (ID) until ready to do so. Time-outs are available on receive calls as well.
LINX Transport Model
LINX supports multiple pluggable transports under the session layer. In other words, multiple heterogeneous links, both reliable and unreliable, are supported in one framework. The link layer drivers for each physical link are bound to the transport protocol that supports it. LINX also supports both low-level transport protocols such as Ethernet, SRIO, HDLC, and ATM, as well as full transport protocols such as TCP, SCTP, and UDP. Shared memory is a special case.
For transports that have maximum transmission unit sizes (MTU), LINX supports fragmentation and de-fragmentation of large messages. For such transports it also supports congestion management, wherein small messages may be packed into the MTU frame " this is often called 'bundling'.
For unreliable transports, a guaranteed delivery protocol is supported using an efficient and user configurable 'sliding window' protocol. The protocol packet headers contain packet sequence information. Only messages that are not received, as triggered by out-of-sequence receipt, are requested for re-transmit, by supplying a NACK (non-acknowledge) for the missing packet. NACKs are part of the protocol packet headers and so they can be returned to the sender 'in band' for performance and efficiency.
LINX also features a traffic adaptation algorithm that enables optimization of performance for hardware that supports interrupt coalescing. Interrupt coalescing means that the hardware may be set to combine multiple interrupts in high traffic scenarios, since the hardware queues the packets anyway " no need to service an interrupt for every packet.
However, interrupt coalescing is not the optimal setting for low traffic scenarios because it tends to increase latency, i.e., time to respond to the arrival of a packet. LINX automatically conditions the hardware for maximum throughput in high traffic scenarios and minimum latency in low traffic scenarios.
LINX naming service
The LINX naming service extends the basic low-level connection model of LINX to provide name spaces and hierarchical naming domains that are overlaid on the basic flat addressing model. In other words, the naming service uses the underlying basic connection and addressing model. The user may then partition the network in any manner desired to create logical groupings of named addresses for publication and subscription. Again, the LINX naming service maps to any network topology.
LINX application level flow control
It is the philosophy of LINX that flow control is best effected between individual communicators at the application level, rather than at the link layer level. Users may use either standard messaging APIs or the flow control messaging APIs. A simple sent/received token system is used to moderate flow between applications. This model is highly tunable, and has the advantage of working well across multiple hops.
LINX operating environments
The Linux implementation of LINX features an API mapped to a standard socket model. Users may use either the simple and direct LINX API, or the more familiar standard socket model. The following paragraphs describe some details of the Linux implementation.
Both the API model and parts of the session layer model for LINX are born of standard OSE semantics, syntax, and implementation " i.e., OSE and OSEck operating systems natively support most of these services.
However, there is not such a messaging model in Linux, so the basic API and session layer model for LINX have a different implementation on Linux. This section provides an overview of this implementation. This layer is implemented in user space " i.e., not linked with the Linux kernel. Some of the session layer and all of the transport layer protocols (both Transport and Link layers) are linked with the Linux kernel for performance.
LINX interconnect support
LINX supports any interconnect or transport protocol. Important interconnects that are part of the first release of LINX include:
* Gigabit Ethernet. The Gigabit Ethernet transport for the first release of LINX will feature raw Ethernet (dedicated ETH packet number), which supports message fragmentation, bundling, and the sequence/retransmission protocol services of LINX for reliable message delivery. Also, the Ethernet channel will be able to be shared with other non-LINX traffic such as standard IP.
*RapidIO. The RIO transport uses type 10 (doorbell) packets for management and control, and for messaging either type 11 packets (RIO message passing) or the direct I/O (memory access) subsystem. LINX is but one client of the RIO device/switch " other clients may share it as well.
All LINX semantics and syntax are part of the standard model for OSE operating systems and are detailed in the standard OSE documentation for system implementation. In OSE and OSEck, a 'communicator' is an OSE process, and a CID is the process ID (PID). In OSEck there is no supervision service available for connections that it has established. Instead, a non-OSEck external node may supervise connections to communicators on OSEck nodes.
Enea will be adding support for new transport media and hardware platforms throughout 2006, including a higher-level naming service for discovery that enhances scalability; security features such as encryption; system congestion control; support for redundant link pairs; and endian conversion.
Michael Christofferson is director of product management at Enea Embedded Technology, Inc.