Editor's Note: You may also be interested in my Rounding Algorithms 101 Redux article...
Your first thought as a logic designer might be to tell the person asking to go away and come back with a different question. This is because good techniques for FPGA implementations of the mod operation have not been widely known.
It's really not that bad! Maybe some of the older generation remember "casting out nines." A related technique turns modular arithmetic, for a constant modulus, into a straightforward exercise in combinational logic. You could do it the hard way: divide your variable x by constant N and take the remainder. Or you could think about the problem differently...
The basic approach: one bit at a time
Let's start with a familiar piece of arithmetic:
In other words, you can compute the modulus of a sum term-by-term; you don't have to add everything up and do the mod operation at the end. When you let a and b each be sums in their own right, this expands recursively to sums of any number of terms.
As a concrete example, let's take the modulus N to be 5, and compute some variable x mod 5. In order to use Equation 1, we start by treating the value of unsigned x as the sum of its binary-weighted digits:
In other words, xj contributes the value j when the corresponding bit is one and contributes zero when the bit value is zero. So, using Equation 1, we can rewrite x mod 5 one binary digit at a time:
Conveniently, bits have two possible values: 0 and 1. That means that (xk mod 5) has two possible values: 0 when xk = 0, or some Nk when xk = 1. Thus...

So, when x is a 32-bit number, computing "x mod 5" means adding up small numbers that total only 80, at most. Then, when you have a sum in the 0-80 range, you can repeat the process or do a table lookup to get the final sum-mod-five value.
Any experienced logic designer will already have noticed lots of clever optimizations that can be applied to that sum of bit-by-bit modulus values. Instead, let's look for bigger efficiency gains by taking another look at x.
By the way, x mod some power of two is the easy case. You already know to take the few least significant bits and discard the rest. If you actually work through Equation 2 for a power-of-two modulus, you'll see that it ends up with the same result.
To demonstrate the approach, group bits of x into pairs. In practice, you'd probably use groups of four bits (hexadecimal), eight bits (base 256 representation), or some other technology-appropriate size. Bit-pairs are shown here only in order to demonstrate the technique; you can rework the approach to whatever grouping makes sense in your application and FPGA fabric.
Here's how Equation 2 looks when bits of x are taken two at a time:

Instead of thirty-two terms to add up for a 32-bit input x, this produces only 16 terms with a maximum total of 56, and with fewer terms to deal with as the groups of bits get larger.
As before, the designer is sure to find many additional optimizations once the solution has gone this far. In particular, when N has some power of two as a factor, modulus values for all xj with j>N have that power of two as a factor also. When you add up set of even numbers, for example, you can skip the least significant bit – it will always be zero. The same applies to intermediate sums of large enough xj in this approach, so there are gates to be saved when N is some odd multiple of 2, 4, 8, etc.
You can also configure the block RAMs in an FPGA as lookup tables. In this case, the mod-5 operation requires a table with at least three-bit outputs. Taking the M9K memories in Altera FPGAs as an example, one RAM can be configured to 2K x 4 bits – a twelve-bit lookup table that holds three-bit values, with a little left over.
Three of these tables would cover x up to 36 bits wide, with an intermediate sum of only three terms. That sum, which never exceeds 12, can be converted to a mod-5 answer using just a little combinational logic. An Altera Stratix III EPS3SL340 contains over a thousand M9K blocks; three of them could be a worthwhile trade off of block RAMs against gate count and logic depth, in some applications.
Dealing with rounding
As noted in the introduction, rounding can be rephrased as a problem in modular arithmetic. Using the example of rounding to the nearest hundred, it can be written like this:

This almost solves the problem, and with very straightforward logic. However, one issue remains; this always rounds 50 up to the hundred, introducing a small bias. Unbiased rounding uses a more complicated rule. If the decimal number ends in 50, then round to the nearest even hundred. In other words, 1150 and 1250 both round to 1200, one rounds up, the other rounds down.
A slight modification of Equation 4 easily takes care of this. The even-hundreds logic mean that we need to treat even and odd hundreds differently, so we must examine a wider range than in Equation 4 to give us Equation 5 as follows:

Conclusion
Even experienced logic designers can run into trouble with x mod N calculations, even when the constant N is known in advance. There are widely known ways to handle division by constants, such as multiplying by a fraction in fixed point. Modular arithmetic, even when disguised as a rounding problem, can be just as simple, if not simpler.
The technique discussed here works for any constant N, delivers solutions even when x has 32 or more bits, and can be adapted for efficient use of many different FPGA fabrics.
Tom VanCourt, Ph.D., is currently a senior member of technical staff at Altera Corporation. He develops system-building tools and champions performance computing on FPGAs. Tom has spent over 25 years in industry with DEC, HP, and other companies. He has also taught at Boston University, where he earned his Ph.D. in computer systems engineering.
Tom's interests include FPGA-based computing in finance, life science, medical imagining, and other application areas. His ongoing efforts include adapting existing FPGA tools and creating new ones to move beyond logic synthesis and into the very different world of FGPA-based performance computing.