Paul Clayton

image
unemployed

Biography has not been added

Paul A. Clayton

's contributions
Articles
Comments
    • Presumably you meant "segmented" not "paged". Most more recent systems with address translation use paging. (PA-RISC [and Itanium] and PowerPC are a bit unusual in using segments to extend the address space which is then translated at the page level; but bringing up these architectures would increase complexity with little benefit.) While I can understand excluding such from a short introductory article, Single Address Space OSes do not provide separate address spaces to each process, but still provide protection. In addition, even without address translation a memory protection unit (which is simpler than an MMU--though you may be using MMU in a more generic sense that includes MPUs) can provide permission isolation.

    • Thanks for the information. (I wonder if bit-band regions are more C-friendly. It might be possible for a compiler to infer bit operations [e.g., "device |= 0x04", but it would be dangerous to rely on compiler optimization when an I/O device access might have side effects].) Again, thanks for the teaching.

    • "The inline keyword is also implemented in many modern C compilers as a language extension," C99 includes 'inline', so for C99 it is not a language extension. Non-inlining of single call site static functions has been used to extract uncommonly executed code from the main path. A better solution would be something like gcc's __builtin_expect()--likely hidden behind a macro--(assuming using actual profile data is impractical), but if the compiler does not support such avoiding inlining would be the next best option.

    • I think "specialdevice.bit[3] = 1;" would be more friendly than "specialdevice |= 0x0004;"; but "specialdevice.function_active = true;" would be better (where "function_active" describes the role of the bit in a way that "true/false" would fit as values). In addition to the gray area of overloading based on appearance (as in the stream interface), there are also cases where an operation might be applied to a single member (or even a member selected based on the type of the other operand). While such may abstract the object in a way that simplifies the code, it may make code after future modifications less clear ("Why do I use 'object += more_foo;' to increase foo-ness but 'object.bar += more_bar' to increase bar-ness?") or confuse the programmer ("What is 'object'? There is 'object += more_foo' but also 'object += more_bar'. Why do I get a compiler error for 'object += 3'?" or worse: "What is the difference between 'object += more_bar', 'object.bar += more_bar', and 'object.later_bar += more_bar'?"). My guess would be that "when in doubt, leave it out" applies in most cases. (I am not a programmer, much less an embedded systems programmer.) If better unicode support was common, appearance-based overloading would be less useful. E.g., something like "⇐" and "⇒" might have been used for the stream interface.

    • Computers in Spaceflight: The NASA Experience is also available for web browsing at http://history.nasa.gov/computers/Compspace.html (I ran into this from searching for information about Voyager's use of data compression, inspired by this question on the new Space Exploration Stack Exchange site: http://space.stackexchange.com/questions/317/what-data-compression-techniques-have-been-successfully-used-in-spacecraft [This site is currently in "private Beta", which means that for a couple of weeks or so only those initially committed to the site can post.].)

    • Moving to simpler cores will tend to run into the parallelism wall. In addition, more cores will mean more interconnect overhead--not as much overhead as the aggressive execution engines in a great big OoO core, but non-zero overhead. (Some multicore chips do have private L2 caches [sharing in L3]. Also, sharing would mainly add a small arbitration delay. With replicated tags, the arbitration delay in accessing data could be overlapped with tag checking. With preferential placement, a local partition of a shared L2 could typically supply data; less timing critical [e.g., prefetch-friendly] data could be placed in a more remote partition.) It should also be noted that memory bandwidth is also a problem. Pin (ball) count has not been scaling as rapidly as transistor count. (Tighter integration can help, but such can also make thermal issues harder to deal with.) While "necessity is the mother of invention", the problems are very difficult (arguably increasingly difficult--earlier, software-defined parallelism was not necessary for a single processor chip to achieve good performance, now, SIMD and multithreading are increasingly important). Even the cost of following the actual Moore's Law (doubling transistor count in a "manufacturable" chip) has been increasing, so economic limits might slow growth in compute performance. In any case, this can be viewed as an opportunity for clever design.

    • Expecting training would also tend to weed out the sub-mediocre employees (either by making them better or by recognizing that they are not willing to learn) which tends to improve productivity and morale. Longer employee retention while excluding those unwilling to learn would tend to increase productivity and morale by increasing cultural coherence and trust. Knowing who to ask, when to ask, and how to ask are important factors in communication (but cannot be covered in a pamphlet or even an organizational wiki). Trust also substantially reduces the need for communication and improves morale (reducing stress, increasing feelings of being valued, and increasing feelings of community).

    • The only way "intrinsically-serial code" can be converted into parallel code is with algorithmic changes--such a drastic rewrite is generally not considered converting. There is much more hope for converting *implicitly* serial code into explicitly parallel code, at least for limited degrees of parallelism. However, for personal computers, it seems doubtful that the benefit of programming for greater parallelism would justify the cost when most programs have "good enough" performance (or are limited by memory or I/O performance). If most code is not aggressively optimized for single-threaded operation, why would the programmers seek to optimize the code for parallel operation? By the way, "Number 4: ARM's BIG.little heterogeneous cores" should, of course, be "Number 4: ARM's big.LITTLE heterogeneous cores"; ARM seems to be emphasizing the smaller core.