Saving System States -

Saving System States

If the average, non-technical Joe and Josephine are to accept and depend on the Internet, the World Wide Web, and the networked home, these structures must achieve a level of reliability that is now only associated with the physical infrastructure of the freeway system.

That means the technologies that will become important are those that offer engineers the ability to save the entire state of the system and allow it to be retrieved quickly in cases where power is lost, bad code cripples a system, or there is some interference in its operation.

Currently, most of the solutions being considered, while quite sophisticated and effective, are quick and complicated. They must be replaced by a better, simpler solution. I believe that the principle of Occam's Razor applies to engineering solutions as well as scientific theories. That is, just as the most likely scientific explanation for a phenomenon is often the simplest, it is also true that the best engineering solutions are often the least complex. But none of the current alternatives to ensuring reliable operation of connected computers or smart devices are simple, either in concept or implementation.

Let's start with desktop computers. The PC has never been that reliable, subject to regular freeze-ups and crashes. Lost data and lost functionality have been a constant concern in organizations using large numbers of PCs. This trend has accelerated as 24/7 connectivity becomes the norm.

A number of solutions have been developed, first used within large corporations and just now being deployed by service bureaus to support PC users at home or in remote locations. For example, has developed a proprietary set of algorithms that allows a desktop connected to an enterprise network to be monitored and information to be collected on its working state. This software identifies and quantifies the state information of every item associated with an application. This information is stored away at a safe location out on the network and if a desktop connected to the network goes down, the software returns it to the last safe operating state. But this only saves the state of the program environment, and any data in any file may be lost unless additional software backup mechanisms are used.

In the servers, routers, and switches that constitute the infrastructure of our new connected computing environment, the situation is even more complex. A lot of attention has been devoted to high availability and fault-tolerant design, at least at the hardware level. Embedded compactPCI board vendors have targeted telecom and networking with a variety of solutions that provide mechanisms for much more reliable operation of the servers and routers in which they are used.

Operating systems vendors now incorporate features that support fault-tolerant operation of hardware. But redundancy is complex and expensive to install and maintain. In the past, it has been confined mostly to applications such as air traffic control and onboard aircraft electronics, where safety is critical, and in the power industry to ensure that the grid stays in operation.

A variety of hardware and software redundancy techniques are used, but at their core the concept is the same: the use of primary systems, and virtually identical backups. If the primary fails, the backup can be called into service. Switching back and forth with few interruptions or hiccups is easy if the data never changes. Butin net-centric computing the name of the game is dynamism and change. Moving to a backup without interruption is difficult if data or files change constantly. Even when there is no real problem, just a communications delay, redundant systems can often fail, especially in a complex distributed environment of servers and clients and their associated files, data, and programs. If the system assumes a non-existent computer problem and sends out an update to the backup, the servers and the backups are now no longer identical, and the system goes down anyway.

New techniques for fault-tolerant operation in a network environment have been developed, such as active replication. In this approach, a distributed system's software establishes redundant copies of vital programs on servers through the use of process groups. These are closely linked sets of programs that cooperate over the network. The most important function of a process group, with regard to a distributed computer system, is that each process group provides a means of sending messages to each of its members. Active replication enables a system to tolerate faults because any group member can handle any request, so if any one machine crashes, work can be redirected to an operational site.

However, it is not clear to me how robust active replication is or how scaleable. It works well on research networks with 50 or so users. But how well it does it do on systems with 5,000 users accessing at a time, or, as in the Web environment, hundreds of thousands? And how well does it do in environments where the user load can scale up and down by the tens of thousands over an hour or so?

In terms of smart devices such as embedded Internet devices and information appliances, there is less of a problem of interdependence, except for distributed control network confederation schemes such as Sun's Jini and Microsoft's Universal Plug-and-Play. Most current approaches involve the use of nonvolatile technologies such as flash memory in combination with SRAM and DRAM to save information when a smart device fails, as well as the use of advanced CMOS technology to decrease power drain, and better batteries to extend operation. More recently companies such as Microsoft with its .NET and C-sharp proposals would shift the application from the device to the server. Instead of selling software for use on a device, such companies would be selling services. But now we are back to the reliability of the network and of the now-critical servers.

I believe it is time for an Occam's Razor solution to emerge. My candidate is ferroelectric devices, based on a nonvolatile semiconductor fabrication technology that saves the entire state of the system, both logic and memory, when a problem occurs and a reboot is necessary. Put on the backburner years ago because it was deemed impractical, it has fast read and write times and low power consumption, allowing it to store logic configuration data and the complete state of a system, both memory and logic, in cases of system failure.

This technology has not received much of a hearing for a number of reasons. First, traditional nonvolatile technologies such as flash memory have seemed adequate so far and do not require any changes in the fabrication status quo, as ferroelectric would. Second, ferroelectric requires a modification to the underlying dielectric constant. But now, as semiconductor makers move to such things as copper instead of aluminum interconnect, which requires changes in the dielectric constant, this objection does not seem insurmountable.

While the computer industry may resist such changes, economics and the nature of the net-centric market will dictate the transition. We are facing the same dilemma the scientific and technical community in Germany faced at the end of the 19th and the beginning of the 20th century. Its lead in such areas a metallurgy and chemistry had made it one of the leading industrial powers of the era. But to move further, a better understanding of the underlying physics of metals was needed, especially under temperature stress. However, traditional physics, based on Faraday's concept of heat and all other radiation being wavelike in nature, was not able to predict or explain something as simple as how the color of a metal changed with increasing temperature. Attempting to solve this problem, physicists such as Niels Bohr had to reject classical physics for a more modern and accurate model based on the idea of discrete quanta, an idea that had been considered and abandoned earlier as unnecessary and inadequate.

That conceptual shift changed physics. A similar shift in the computer industry about the most appropriate memory and logic fabrication technologies will also have profound influences on the emergence of the net-centric computing paradigm.

Ferroelectric is not the only possible Occam's Razor solution. I will be talking about others in later columns. I would like to hear your proposals as well. But something must change. The market will demand it.

Reader Feedback

Intel, the world's largest semiconductor company, and the largest maker of FLASH memory seems to be committing to OUM as the replacement nv memory technology. STM, probably third in FLASH, seems to concur, expecting to introduce OUM products in 2004.

Why not OUM rather than FRAM?

Allen Benn

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.