Expand your 8051 memory
Suppose for a moment that you're an engineer designing a new 8051-based product. Not unexpectedly, the application's code size will greatly exceed the 64KB architectural limit of the 8051's program memory. Which of the following two methods would you chose to get the needed extra program memory space?
Do you expand the program memory space using a method that wastes a large amount of physical memory; makes some of the 8051 I/O pins unusable; segments the memory into disjointed pages; slows down the execution of the program; and uses extra bytes in the 8051's internal data memory?
Or do you expand the program memory space using a method that employs all available physical memory up to 512KB; uses no 8051 I/O pins; provides a linear, nonsegmented 512KB of program memory; executes the program at full speed with no performance penalty; and uses no internal data memory resources?
Given these two choices you may be surprised to learn that every previous engineer facing this situation has selected the first alternative, which is the conventional bank-switching method. This article describes an approach that shows how the second choice is now possible.
The usual method—bank switching
Figure 1 shows a typical 8051 bank-switching application. Three I/O pins are used as bank select bits BS[2..0] and are gated by address pin A15. A15 is the most significant bit of the 8051's 16-bit Program Counter (PC) as it is output on the address bus. The resulting memory organization is shown in Figure 2.
Figure 1: Typical 8051 bank-switching hardware configuration
Figure 2: Typical 8051 bank-switching memory configuration
Whenever the 8051's PC is between addresses 0x0000 and 0x7FFF, BS[2..0] are forced low and a common area for the program is accessed. In addition to holding the reset and interrupt vectors, this common area contains the extra code that does the special handling to switch banks. Because of the control A15 has over BS[2..0], the common area is always accessed as the lowest 32KB of the physical memory regardless of which bank is selected by the BS pins. Therefore, the lower 32KB of Banks 1 through 7 are unused and unavailable.
When code in one bank calls a routine in a different bank, the call is routed through this common area. The common area is where all the housekeeping tasks needed to switch banks are done. These tasks include:
- Saving the current bank. The I/O port used for bank switching is read and the value on the BS pins is isolated and saved. This is done to capture the current bank so that it can be restored upon returning to the calling routine.
- Switching the BS pins to the bank of the target routine. The BS pins are cleared and then the new bank value is logically OR'd into them.
- Creating a call to a restoration routine. The purpose of this routine is to reverse the actions of the first step and restore the BS bits to the bank of the calling routine. The address of this restoration routine is pushed onto the top of the stack so that it's executed as part of the return sequence.
- And finally, transferring program execution to the target routine.
While most compilers handle all the convoluted details of switching and restoring banks, the programmer still has to be acutely aware of these details in order to understand the program flow, be able to appreciate the associated performance issues, and especially be able to debug the code. These details are especially daunting for assembly language programmers.
The serious shortcomings of bank switching should be clear from this brief explanation:
- Bank switching wastes a large amount of physical memory. As the example shows, 224KB (seven banks multiplied by 32KB of unavailable memory per bank) of the 512KB memory is unusable for the application's program. Thus 43% of the memory is wasted, which in some applications may force the engineer to design in a larger, more expensive memory chip.
- Bank switching makes some of the 8051 I/O pins unusable. In the example three I/O pins are used for address bits. These become dedicated pins and are taken out of the pool of available I/O pins that can be used by the application. For a typical 40-pin 8051 part using external memory there are only 16 pins available for I/O, many of which have important alternate functions such as external interrupt, serial I/O and read and write strobes for data memory. Having to dedicate three pins for address can force the engineer to reduce the features of the application or choose a more expensive chip with more I/O.
- Bank switching segments the memory into blocks that the programmer has to be extremely sensitive to. In addition to understanding the indirect process of switching banks, the programmer must be aware of the block size and work within its constraints. For example, no program segment can straddle a bank's 32KB boundary. If the program were to execute past the 32KB boundary it would "wrap around" to the bottom of the bank causing the program to malfunction. Additionally, the programmer must be careful of table sizes. Large tables that exceed the bank size must be broken into smaller tables and extra code has to be written to access them properly.
- Bank switching slows down the execution of the program. Every call to another bank penalizes the performance due to the overhead associated with switching banks.
- Bank switching consumes bytes of internal data memory. The 8051 has a maximum of only 256 bytes of internal data memory. This memory includes the four register banks, the 16 bytes of bit-addressable memory, the stack, and variable memory. Bank switching uses up to 4 bytes of this internal data memory for every call that switches banks. Programmers have to craft their programs to limit the depth of any nested inter-bank calls in order to minimize this memory consumption.
Now that we've seen bank switching and the problems associated with it, let's look at a new method for expanding the 8051's program memory that provides a linear, nonsegmented memory without any wasted memory. This method allows direct jumps and calls over the whole expanded memory space without any performance penalty and without consuming any internal data memory. And, it uses no I/O pins.
This technique can be used with any 8051-based chip that accesses external memory, from the very first 8051's produced by Intel in 1980 to the latest 8051-compatible chips just off the fab line.
I'll call this new method Program Memory Expander for 8051, or PME-51 for short. The PME-51 eliminates all the disadvantages of bank switching while expanding the 8051 program memory space up to 512KB. As Figure 3 shows, the PME-51 is a logic block that sits between the 8051 and program memory. It internally generates the three extra address bits, A[18..16] that are directly applied to the program memory.
Figure 3: PME-51 system diagram
The PME-51 provides a true linear, unsegmented 512KB program memory space without using any I/O pins or internal memory resources and without any performance penalty. In order for the PME-51 to accomplish this it must be able to satisfy the following conditions:
- The three extra address bits, A[18..16], and the 8051's PC must be directly loadable by a single instruction so that the program can directly jump to any address or call a subroutine at any address in the 512KB from any other address in the 512KB.
- A[18..16] must be saved on subroutine calls and interrupts and restored on returns.
- Any byte of a multibyte instruction must be allowed to be located anywhere in the 512KB without regard to any 64KB boundary. Any instruction straddling a 64KB boundary must execute normally.
- All relative jump instructions must be able to execute both forward and backward jumps across any 64KB boundary in the 512KB space without affecting program execution.
So how does the PME-51 generate a 19-bit address to allow a program to directly jump or call any address in the 512KB memory in one instruction? We'll save that detail for later, but first a little background information on the 8051 jump and subroutine call instructions will help you fully understand the techniques involved.
The 8051 instruction set includes two types of direct-jump instructions. One type is called the Long Jump instruction, which has the mnemonic LJMP. As Figure 4 shows, the LJMP instruction is a three-byte instruction with the first byte being the opcode byte. The second byte is the high byte of the target address, and the third byte is the low byte of the target address. Thus, the LJMP instruction provides a 16-bit target address. During execution of an LJMP, the 8051 replaces the contents of its 16-bit PC with the second and third bytes of the LJMP instruction. This provides the ability to jump anywhere in a 64KB program memory.
Figure 4: LJMP instruction format
Figure 5: AJMP instruction format
Figure 6: FJMP instruction format
The second type of direct-jump instruction is called the Absolute Jump, which has the mnemonic AJMP. As shown in Figure 5, AJMP is a two-byte instruction with the first byte being the opcode byte and the second byte being the lower byte of the target address. An unusual aspect of the AJMP instruction is that three bits of the target address are embedded as the most significant bits of the opcode byte. The 8051 only uses the lower five bits (00001) to decode the instruction as an AJMP.
The AJMP instruction therefore provides an 11-bit target address with three bits taken from the first byte and eight bits from the second byte. During execution of an AJMP, the 8051 replaces the 11 least significant bits of its PC with the 11-bit target address from the AJMP instruction. This allows AJMP to jump anywhere within a 2KB "page" of program memory.
To enable jumps across the full 512KB memory space, PME-51 creates a new hybrid instruction that combines AJMP and LJMP. It uses the AJMP opcode with the LJMP's two bytes of address. This hybrid instruction is called Far Jump and is given the mnemonic of FJMP. Figure 6 shows the layout of the FJMP instruction.
Of course, this new hybrid FJMP instruction is an undefined instruction to any real 8051. If it were allowed to pass through to the 8051, the program would malfunction by jumping to an unintended address. So the PME-51 translates the instruction on behalf of the 8051. When the first byte of the FJMP instruction is read from the program memory, the PME-51 detects that it's an AJMP opcode and takes the following three actions:
- It takes the three bits of embedded address from the AJMP opcode and loads them into a holding register.
- It blocks the AJMP opcode and instead outputs an LJMP opcode (002) onto the 8051 data bus. The 8051, upon reading and decoding this LJMP opcode, will read the remaining two bytes of the instruction and load them into its 16-bit PC.
- The PME-51, upon detecting the end of the instruction, will transfer the contents of the holding register to its eXtended Address Register (XAR) and output these three bits as address bits A[18..16]. These address bits form the three most significant bits of the desired 19-bit target address, while the 8051 outputs the 16 least significant bits.
FJMP 54321h ;jump to address} ; 0x54321}
the assembler would take the specified 19-bit target address and put the three most significant bits in an AJMP opcode byte, the next eight most significant bits in the second byte of the instruction, and the eight least significant bits in the third byte. Figure 7 shows how the resulting instruction would appear in memory.
Figure 7: FJMP 0x54321 instruction encoding
Figure 8: FCALL instruction format
When this FJMP instruction is read from memory the three bus cycles will fetch the bytes 0xA1, 0x43, 0x21. The PME-51 detects that the 0xA1 opcode byte is an AJMP opcode and translates it into a 0x02 LJMP opcode byte before outputting it to the 8051. Thus after passing through the PME-51, the byte sequence received by the 8051 is 0x02, 0x43, 0x21. Upon decoding the LJMP instruction, the 8051 reads the next two bytes, 0x43 and 0x21, and loads them into its PC. The PME-51 uses the three embedded address bits, 0x5, in the 0xA1 opcode byte and loads them into its XAR. The next instruction is fetched from the address 0x54321—a direct jump to a 19-bit address in one instruction!
Since we can now directly jump to any address in the 512KB, it should come as no surprise that direct 512KB subroutine calls are done similarly. As in the direct jump case, the 8051 contains two direct call instructions: the 16-bit long call (LCALL) and the 11-bit absolute call (ACALL). When one of these instructions is executed, the 8051 pushes its 16-bit PC onto its hardware stack and loads the PC with either a new 16-bit value (LCALL) or 11-bit value (ACALL).
Here again the PME-51 combines these two instructions into a Far Call (FCALL) hybrid instruction that creates a 19-bit target addresses so direct calls can be made to anywhere in the 512KB memory space. The FCALL instruction format is shown in Figure 8.
Additionally, the PME-51 contains a small amount of RAM (192 bits) as its own extended-address hardware stack to store the values of the XAR for all executed FCALL instructions. As in the case of the FJMP instruction, the FCALL instruction is translated by the PME-51. When the processor reads the first byte of the FCALL instruction from program memory, the PME-51 detects that it is an ACALL opcode and takes the following four actions:
- It takes the three bits of embedded address from the ACALL opcode byte and loads them into a holding register.
- It blocks the ACALL opcode and instead outputs an LCALL opcode (012) onto the 8051's data bus. The 8051, upon reading and decoding the LCALL opcode, will read the remaining two bytes of the instruction and load them into its 16-bit PC after pushing the current contents of the PC onto its hardware stack.
- After the 8051 has read the remaining two bytes, the PME-51 will take the contents of its XAR and push these three bits onto its own hardware stack.
- The PME-51, upon detecting the end of the instruction, will transfer the contents of the holding register to its XAR and output it as A[18..16]. These three extended address bits form the three most significant bits, and the 8051 outputs the 16 least significant bits of the new 19-bit subroutine address.
All callable subroutines end with a return instruction, whose mnemonic is RET. Similarly, all interrupt routines return from interrupt with the RETI instruction. The 8051 executes these return instructions by popping two bytes off its hardware stack and placing them in its 16-bit PC. When the PME-51 detects that one of these return instructions is being executed, it pops its own hardware stack and loads the three bits into its XAR at the end of the instruction.
Across 64KB boundaries
It's great that we now can directly call or jump to any address in the newly expanded 512KB program memory space in only one instruction. But since the 8051's PC is only 16 bits wide, don't we now have to worry about 64KB boundaries during normal program execution? The two cases where 64KB boundaries could be a concern are when the program naturally flows across a 64KB boundary and when a relative jump instruction jumps from one 64KB block to an adjacent 64KB block. PME-51 handles both cases transparently to the 8051, preventing any 64KB segmenting or boundary effects.
For the program to flow smoothly over any 64KB boundary, the PME-51 must be able to detect that the 8051's 16-bit PC has overflowed. When the PME-51 detects an overflow it needs to increment the XAR before outputting it as A[18..16].
For example, take the case where the current program is fetching a two-byte instruction at address 0x3FFFF (XAR=3, PC=0xFFFF). The first byte of the instruction is fetched with the 8051 outputting 0xFFFF and the PME-51 outputting 0x03 on A[18..16]. Once the first byte has been read, the 8051 increments its PC in order to read the second byte of the instruction. As a result of this increment the 8051 PC will overflow and be output as 0x0000. The PME-51 needs to detect this overflow, increment its XAR and output a 0x04 on A[18..16] for a 19-bit address of 0x40000. Such operation allows the program to flow straight ahead and instructions to straddle 64KB boundaries without affecting program execution.
As is probably pretty obvious, the PME-51 detects the PC overflow condition simply by monitoring the most significant address bit output by the 8051. This bit, A15, is output by the 8051 on its P2.7 pin. Whenever the PME-51 detects in two consecutive bus cycles where A15 transitions from a "1" to a "0" it increments the contents of its XAR before it's output as A[18..16].
Note that this overflow detection is shut off during specific bus cycles of the MOVX and MOVC instructions since a transition on A15 during these cycles is not related to the program flow. Additionally, the overflow detection is shut off during the first bus cycle of the instruction after the execution of LJMP, LCALL, RET, and RETI instructions since normal operation of these instructions can mimic an overflow condition when none exists. Finally, detection is also shut off during interrupt vectoring.
To handle relative jump instructions the PME-51 again monitors the PC address. But in this case the PME-51 must be able to detect both overflow and underflow of the PC since the relative jump instructions can go either forward or backward across a 64KB boundary. Also, A15 alone is not sufficient to detect these conditions because relative jumps can be well inside a 64KB block and still cause A15 to toggle. However, realizing that all relative jumps in the 8051 instruction set have a possible range of only -128 to +127 addresses, all PC overflow and underflow conditions can be detected with the addition of monitoring only one extra address bit, specifically A7.
To detect underflow, the PME-51 first looks at A7 (P0.7) to see if underflow is possible. Since the maximum backwards jump is 128 addresses, the 16-bit PC can only underflow if the value of its low byte is less than 128 (0x80). A jump backwards across a 64KB boundary is therefore only possible when A7 is 0.
The PME-51 samples A7 during the last bus cycle of all relative jump instructions. If A7 is 1, no underflow is possible. But if A7 is a 0 the PME-51 compares the value of A15 during the last bus cycle of the relative jump instruction with the value being output during the first bus cycle of the next instruction. If A15 was a 0 and is now a 1, then an underflow condition exists and the PME-51 decrements the contents of the XAR before outputting it as A[18..16].
A similar approach detects overflow on forward relative jumps. The PC can only overflow if its low byte overflows. Since the maximum forward jump is 127 addresses, the low byte can only overflow if its value is greater than 128 (0x80).
The PME-51 samples A7 during the last bus cycle of the relative jump instruction. If A7 is a 0, no overflow is possible. If A7 is a 1, the PME-51 compares the value of A15 during the last bus cycle of the relative jump instruction with the value being output during the first cycle of the next instruction. If A15 was a 1 and is now a 0, an overflow condition exists and the PME-51 increments its XAR before outputting it as A[18..16].
PME-51 block diagram
We've satisfied all the conditions for a completely linear, nonsegmented 512KB program memory space for the 8051. All this was accomplished with some simple techniques and creative use of the instruction set. And it was done with no performance penalty, no internal data memory requirements, and no I/O pins used. Now let's see what it takes to accomplish this little sleight of hand.
Figure 9: PME-51 block diagram
Figure 9 shows a block diagram of the PME-51 logic function. It consists of an 8051 instruction sequencer and decoder, the three-bit XAR and associated logic, a local stack to save and restore the XAR on calls/interrupts and returns, an instruction translation circuit in the data path, synchronization circuits, address latches, and various input and output buffers.
The XAR is loaded at the beginning of every bus cycle. It is normally loaded with the output of the Inc/Dec circuit, but it can also be loaded from the holding register when FJMP and FCALL instructions are executed and from the top of the XAR stack when RET and RETI instructions are executed.
The Inc/Dec circuit is used to increment the three bits of the XAR in the case of a forward relative jump or normal program flow or across a 64KB boundary. It decrements the XAR when there is a backward jump across a 64KB boundary. It is the output of the Inc/Dec circuit that is actually output as A[18..16]. It is also pushed onto XAR stack at the end of FCALL instructions or during interrupt vectoring.
The PME-51 sequencer keeps track of which program memory reads are opcode fetches and then decodes the opcodes to determine if they are FCALL, RET, RETI or any of the relative jump instructions. It also decodes the interrupt sequence as well as output various instruction timing signals such as first and last bus cycles of an instruction. The XAR block uses this information to select what is loaded into the XAR and to control the XAR stack.
The P0[7..0] and P2[7..0] interface inputs the current 16-bit address from the 8051, latches it, and outputs it as the address bus A[15..0]. The data from the program memory, D[7..0], is input and flows through a Detect and Force circuit before being output on the P0[7..0] bus, which the 8051 reads for its data.
The Detect and Force circuit handles the translation that is needed for the FJMP and FCALL instructions to work properly. If it is the first bus cycle of an instruction and the data from the memory is either an AJMP or ACALL opcode, this circuit will instead output either an LJMP or LCALL opcode on the P0[7..0] bus.
PME-51 system applications
Since the PME-51 is a relatively simple logic block with standard 8051 and memory interfaces, the concept can be used in a number of ways. In the simplest approach it could be produced as a standalone component and used in applications as a memory management peripheral chip placed between the 8051 and its program memory.
A better approach would be to combine the PME-51 logic with 512KB of flash memory. As shown in Figure 10, such a component could be realized in only a 20-pin package. It simply connects to the 8051's Port 0, Port 2, ALE, and /PSEN. While such a component could be used with any 8051 chip that accesses external memory, it would be the perfect memory complement to many of the current 8051 offerings from such companies as Atmel, Dallas/Maxim, and Philips that include additional on-chip RAM. Many applications' complete memory requirements would be met with the extra on-chip RAM of these 8051's and the 512KB of program memory in this small 20-pin component.
Figure 10: Pin configuration of a 20-pin component that combines 512KB flash memory and the PME-51 logic
Figure 11: PME-51 and 512KB flash components that are drop-in replacements for standard memories
To take the idea of combining the PME-51 logic with 512KB of flash memory one step further, Figure 11 shows potential PME-51—based components that could be drop-in, pin-compatible replacements for standard 28-pin EPROMs and 32-lead PLCC flash/EPROM memory components. For the thousands of existing 8051 applications, such a component would allow instant system upgrades through firmware enhancements without any changes at all to the hardware design. Just remove the existing memory and plop in this pin-compatible component for an immediate 64KB-to-512KB upgrade.
Other possible uses for the PME-51 include adding it as a block to an FPGA design that has an embedded 8051. Or use the PME-51 and a 512KB flash memory with a standard 8051 in a multichip module. In this approach one silicon die would be a standard 8051 and the other die would contain up to 512KB of flash memory and the PME-51 logic. This combined module would expand the program memory space of the 8051 up to 512KB without any design changes to the 8051 die.
Such a multichip module, in addition to solving the 8051 memory-expansion problems, would allow extremely quick response to emerging market opportunities while saving significant design resources. Any available 8051 die from any manufacturer could be combined with the PME-51/flash die and an expanded memory product would be available in a fraction of the time and effort that normal product development takes.
The 8051 microcontroller will continue to be used in increasingly complex applications that will demand ever more memory. Bank switching offers a memory solution to the designer but with its obtuse programming model, wasteful use of resources, and performance-robbing overhead, it's far from an ideal solution. A better approach to expanding the 8051's memory space would extend the longevity of the 8051 architecture and its vast investment in designs, tools, and expertise. This certainly makes the PME-51 concept worth considering.
Martin Pawloski has been involved with the 8051 since his early days as a circuit designer at Intel working on the very first 8051 chip. He is currently president of MetaLink Corp., the 8051 emulator company. You can reach him at email@example.com.
You’ve reinvented a square wheel. The article "Expand your 8051 memory" is not very useful:
1) The Philips Mx/669 and, I believe, some Dallas offerings have this in hardware.
2) Code banking today with sub $10 ARMs is ridiculous.
3) The devices in 1) work with the current tools, what is described in the article does not.
So, once more we have an article "see how brilliant I am," not "here is something useful."
- Erik Malund
Let me answer each of your comments (paraphrased a bit) one by one:
1) Your point--Philips and Dallas already have 8051 chips that have native >64K program memory addressability.
My comment--At last count there were over 400 different versions of the 8051 from more than 30 vendors and more varieties are coming every month. The number of 8051s with native >64K program memory addressability from the likes of Philips and Dallas is less than 10. Not very good odds that the Philips/Dallas-type chips are appropriate for any given application. What if you're using an 8051 with USB--out of luck. What about an 8051 with high-performance A/D--out of luck. I could go on and on, but my point is that you're suggesting a very specific solution whereas the PME-51 is a general solution that can be used with any 8051 from any vendor that can access external memory. It doesn't depend on a particular feature set provided by a couple of vendors on a very few number of chips.
2) Your point--Code banking today with sub $10 ARMs is ridiculous.
My comment--You sound like an engineer that has the luxury of using the latest widgets for all your new projects. You're very fortunate. But many, many engineers are not as lucky as you. They're “stuck” with legacy products, decisions, tools, investments, etc. that dictate what chips they can use for their next project. It's an engineer’s job to design the best product possible given a set a constraints. In the embedded world, those constraints often include using existing microcontrollers, tools, code, and expertise that are already in the company in order to reduce the project’s risk, expedite the schedule, and lower the development costs. I think you would be surprised at how many 8051 designs are out there going through an upgrade cycle. I've gotten e-mail in response to this article from engineers that want to buy products based on this concept right away because they're in a situation where all they need is an additional 64K or 128K to put together existing, debugged code to get a product out. They don’t want to redesign the hardware or trim the code or both and have to go through another debug cycle. The last thing they want to do is use a different microcontroller architecture.
3) Your point--Tools already exist for those chips that natively address >64K.
My comment--As I said under item 1 the few chips that natively address >64K don’t solve the general problem. But your point addresses the broader issue of tools. You may not have noticed, but I am president of a company that develops and markets 8051 in-circuit emulators. As such, consider the hardware tool problem solved. As for software tools, we would make available a post-processor for Keil and other popular software suites that would support this concept. So tools are a non-issue. I apologize for not specifically mentioning that in the article.
A final point I'd like to make is that this article wasn’t written as a “mental exercise.” I truly believe, and the e-mails that I’ve received confirm it, that this concept can and would solve a real problem many engineers face designing 8051-based products in today’s complex environment. But we're a small company. We don’t have the silicon design resources, semiconductor marketing expertise, nor the component distribution channels to make this concept a reality at this time--but we are working on finding a way.
- Martin Pawloski
This is a very clever idea but not one I'd favor. For legacy products, compilers and tools support the ugly bank switching approach. They don't support the hacked instruction set of the article. For new products, given the clean, fast well-supported and cheap alternatives, this looks like an idea whose time has passed.
- Peter Camilleri
Senior Software Engineer
This is a clever idea. Some thoughts...
- Yes, there are cheap 32-bit parts out there. But 8051 use is still expanding. And tools like Keil make them so easy to implement. The 25-100 MIPS 8051s from SiLabs let you do things with them that you cannot with lots of larger ARM parts. Try doing a 1MBit UART on an ARM.
- No one ought to be writing programs exceeding 64K in assembly language. Hence what is really needed is for C Compiler vendors (like Keil) to adopt such a scheme to make it useful. This would happen if silicon vendors like SiLabs who ship 8051s with 128K flash would implement the new instructions natively.
- By the way, with a little programmable logic a traditional code-banking scheme can be implemented with no wasted memory.
- Daraius Hathiram
Sr. Design Engineer
Ten X Technology Inc.
While I loathe bank switching as much as anyone, the statement that it "wastes a large amount of physical memory" is not necessarily true. In fact, I have never encountered a bank-switching implementation that did not use the "bottoms" of all the memory chips.
- Charlie Carothers
Check out the ST 8051's or PSD IC's with PLD, FLASH(256KB), 32KB bootFlash and SRAM. It is very configurable. In a few minutes, you can change system design. This together with Keil compiler makes coding simpler.
- Sameer Cholayil