SIMD Within a Register: How I Doubled Hash Table Lookup Performance hero image placeholder SIMD Within a Register: How I Doubled Hash Table Lookup Performance hero image

SIMD Within a Register: How I Doubled Hash Table Lookup Performance

It started with a simple thought: four bytes in a hash table bucket look just like an integer. Luckily, this one idea led to a deep dive into bit-twiddling and a 2x performance boost.

← Back to all posts

While working on a Cuckoo Filter implementation in C#, I created an array-like structure for the underlying hash table. I chose an 8-bit fingerprint: it aligns nicely on a byte boundary and still keeps the false-positive rate around 3 %.

The layout looked straightforward—just a byte array where the start of each bucket is calculated as bucketIdx * bucketSize. The size of each bucket is 4 slots, which is a solid choice for Cuckoo Filter.

Bucket03A00B7F2Bucket14C9100DEBucketnAA005FC8Table:...

But those four bytes in a bucket reminded me of something. They feel like … an integer!

I wasn’t chasing ultra-low latency—after all, this is C#—but I couldn’t resist experimenting. Could I speed up lookups in my Cuckoo Filter by replacing the 4-byte bucket with a plain old 32-bit integer?

Time to find out. 💪

Contents

A Flexible and Simple Implementation

Here’s the naïve storage I began with. To keep the rest of the code clean, all table logic lives in its own class, whose heart is a pre-allocated byte array:

private readonly byte[] _table;

Each bucket has 4 slots, so there was no need for a second dimension; mapping a key to its bucket is obvious:

var offset = bucketIdx * 4;

Checking whether a particular fingerprint (a single byte) sits in a bucket is just as obvious:

public bool Contains(byte fingerprint, uint bucketIdx)
{
    var s = bucketIdx * 4;
    for (var i = s; i < s + BucketSize; i++)
    {
        if (_table[i] == fingerprint) return true;
    }

    return false;
}

(Please ignore the missing bounds checks—they’re not the point today.)

I’m no bit-twiddling wizard—I live in a cosy, GC-collected world—but those 4 bytes kept itching. A bucket lines up perfectly with a 32-bit uint; that opens the door to a future lock-free compare-and-swap. So I decided to play.

Migrating to a uint Table

Switching the backing array is trivial because the hash table is still one-dimensional:

- private readonly byte[] _table;
+ private readonly uint[] _table;

Now let’s refactor the lookup:

public bool Contains(byte fingerprint, uint bucketIdx)
{
-   var s = bucketIdx * 4;
+   var bucket = _table[bucketIdx];
-   for (var i = s; i < s + BucketSize; i++)
+   for (var i = 0; i < BucketSize; i++)
    {
-       if (_table[i] == fingerprint) return true;
+       var shift = i * 8;
+       var fp = (byte)(bucket >> shift);
+       if (fp == fingerprint) return true;
    }

    return false;
}

The bucket offset disappears because each uint is a bucket. But I’ve lost the luxury of direct indexing—unless I convert the integer to bytes first:

  Span<byte> bucketBytes = stackalloc byte[sizeof(uint)];
  BitConverter.TryWriteBytes(bucketBytes, _table[bucketIdx]);

  for (var i = 0; i < BucketSize; i++)
  {
      var fp = bucketBytes[i];
      if (fp == fingerprint) return true;
  }

I ran a quick benchmark on both uint-based lookups, and the results were revealing. The shifting version gave a nice speed boost, about 35% faster than the original byte-array loop. I’ve also tested the unrolled loop, but it showed no significant improvement because the compiler has already unrolled it.

MethodOperationsMeanRatio
ByteTable_PositiveLookups128239.9 ns1.00
IntTable_PositiveLookups128166.9 ns0.70
ByteTable_NegativeLookups128321.9 ns1.34
IntTable_NegativeLookups128203.9 ns0.85
ByteTable_PositiveLookups10241,847.0 ns1.00
IntTable_PositiveLookups10241,300.9 ns0.70
ByteTable_NegativeLookups10242,523.7 ns1.37
IntTable_NegativeLookups10241,597.2 ns0.86
ByteTable_PositiveLookups10485761,907,503.2 ns1.00
IntTable_PositiveLookups10485761,762,474.3 ns0.92
ByteTable_NegativeLookups10485762,575,301.5 ns1.35
IntTable_NegativeLookups10485761,637,653.6 ns0.86

The BitConverter approach, however, was a step backward. It was even slower than the original, likely due to the additional Span overhead. I’m not about to introduce complexity for negative gain, so the BitConverter version was a non-starter.

MethodOperationsMeanRatio
ByteTable_PositiveLookups128239.7 ns1.00
IntTable_PositiveLookups128366.8 ns1.53
ByteTable_NegativeLookups128322.6 ns1.35
IntTable_NegativeLookups128429.2 ns1.79
ByteTable_PositiveLookups10241,850.0 ns1.00
IntTable_PositiveLookups10242,877.2 ns1.56
ByteTable_NegativeLookups10242,512.9 ns1.36
IntTable_NegativeLookups10243,396.8 ns1.84
ByteTable_PositiveLookups10485761,909,607.9 ns1.00
IntTable_PositiveLookups10485763,352,454.1 ns1.76
ByteTable_NegativeLookups10485762,566,696.5 ns1.34
IntTable_NegativeLookups10485763,536,050.6 ns1.85
So what’s next?

Even the shifting version is quite performant, can we do better? Maybe just eliminate the loop entirely?

Finding a Byte with Masking

Long ago I bookmarked Sean Anderson’s great Bit Twiddling Hacks. One gem there—Determine if a word has a zero byte—is exactly what I need. The C# version is nearly identical:

private static bool HasZero(uint v)
{
    return ((v - 0x01010101U) & ~v & 0x80808080U) != 0;
}

Admittedly opaque, so let’s unpack it.

The core of the trick is (v - 0x01010101U) & ~v. This expression has a special property:

  • For any non-zero byte b, the most significant bit of (b - 1) & ~b will always be 0.
  • For a zero-byte b = 0x00, the expression becomes (0x00 - 1) & ~0x00, which is 0xFF & 0xFF = 0xFF. The most significant bit is 1.

So, this operation creates a “marker” bit (it sets the most significant bit to 1) in any byte position that was originally 0x00.

Let’s apply it to our v, e.g., 0x4462002E:

First, we subtract 0x01010101U.

01000100  01100010  00000000  00101110  (0x4462002E)  00000001  00000001  00000001  00000001  (0x01010101)01000011  01100000  11111111  00101101  (0x4360FF2D)\begin{array}{r} \texttt{01000100\;01100010\;00000000\;00101110}\;(\texttt{0x4462002E}) \\[-2pt] -\; \texttt{00000001\;00000001\;00000001\;00000001}\;(\texttt{0x01010101}) \\ \hline \texttt{01000011\;01100000\;11111111\;00101101}\;(\texttt{0x4360FF2D}) \end{array}
Note

Note that this is a single 32-bit subtraction, so borrows can cross byte boundaries. The 0x00 byte borrows from 0x62, resulting in 0x...60FF....

Second, we apply a bitwise NOT to the original value:

¬  01000100  01100010  00000000  00101110  (0x4462002E)10111011  10011101  11111111  11010001  (0xBB9DFFD1)\begin{array}{r} \neg\; \texttt{01000100\;01100010\;00000000\;00101110}\;(\texttt{0x4462002E}) \\ \hline \texttt{10111011\;10011101\;11111111\;11010001}\;(\texttt{0xBB9DFFD1}) \end{array}

Now, & them together:

01000011  01100000  11111111  00101101  (0x4360FF2D)&  10111011  10011101  11111111  11010001  (0xBB9DFFD1)00000011  00000000  11111111  00000001  (0x0300FF01)\begin{array}{r} \texttt{01000011\;01100000\;11111111\;00101101}\;(\texttt{0x4360FF2D}) \\[-2pt] \mathbin{\&}\; \texttt{10111011\;10011101\;11111111\;11010001}\;(\texttt{0xBB9DFFD1}) \\ \hline \texttt{00000011\;00000000\;11111111\;00000001}\;(\texttt{0x0300FF01}) \end{array}

Notice that the third byte from the left is 0xFF. Its most significant bit is 1, indicating that this was our zero-byte position.

The final & 0x80808080U is a mask that removes all other bits, leaving only the most significant bit of each byte.

00000011  00000000  11111111  00000001  (0x0300FF01)&  10000000  10000000  10000000  10000000  (0x80808080)00000000  00000000  10000000  00000000  (0x00008000)\begin{array}{r} \texttt{00000011\;00000000\;11111111\;00000001}\;(\texttt{0x0300FF01}) \\[-2pt] \mathbin{\&}\; \texttt{10000000\;10000000\;10000000\;10000000}\;(\texttt{0x80808080}) \\ \hline \texttt{00000000\;00000000\;10000000\;00000000}\;(\texttt{0x00008000}) \end{array}

The method returns the result of 0x00008000 != 0. Since 0x8000 is not zero, the expression is true, and the method correctly reports that the zero byte was found.

Great, now I can detect a zero byte without branches. All that remains is to turn the byte I’m searching for into zero.

XOR to the Rescue

Let’s assume our integer, aka bucket, is 0x12345678 and I’m looking for byte 0x56. Without shifts, this seems tough. Luckily, all we need to do is transform the 0x56 byte to 0x00.

Note

What happens to the other bytes is irrelevant, because our HasZero trick only cares if any byte is zero.

The XOR (^) operator has a useful property: A ^ B = 0 if and only if A == B. So, if we XOR the entire bucket with a repeating mask of our target byte, only the matching byte will become zero.

00010010  00110100  01010110  01111000  (0x12345678)  01010110  01010110  01010110  01010110  (0x56565656)01000100  01100010  00000000  00101110  (0x4462002E)\begin{array}{r} \texttt{00010010\;00110100\;01010110\;01111000}\;(\texttt{0x12345678}) \\[-2pt] \oplus\; \texttt{01010110\;01010110\;01010110\;01010110}\;(\texttt{0x56565656}) \\ \hline \texttt{01000100\;01100010\;00000000\;00101110}\;(\texttt{0x4462002E}) \end{array}

So the remaining part of this puzzle is to make a repetitive mask, which is pretty arithmetic: I should multiply target byte by 0x01010101U:

mask  =  0x56×0x01010101  =  0x56565656\text{mask} \;=\; \text{0x56} \times \text{0x01010101} \;=\; \text{0x56565656}

What About Existing Zeros?

A careful reader might ask: what happens if the bucket already has a zero byte in it? Does that mess up the logic?

Let’s say our bucket is 0xAA00CCDD and we’re searching for a non-zero fingerprint like 0xBB. The XOR operation transforms the original zero into a non-zero value, so HasZero correctly returns false.

10101010  00000000  11001100  11011101  (0xAA00CCDD)  10111011  10111011  10111011  10111011  (0xBBBBBBBB)00010001  10111011  01110111  01100110  (0x11BB7766)\begin{array}{r} \texttt{10101010\;00000000\;11001100\;11011101}\;(\texttt{0xAA00CCDD}) \\[-2pt] \oplus\; \texttt{10111011\;10111011\;10111011\;10111011}\;(\texttt{0xBBBBBBBB}) \\ \hline \texttt{00010001\;10111011\;01110111\;01100110}\;(\texttt{0x11BB7766}) \end{array}

Now, what if we search for 0x00 itself (an empty slot in my case)? The mask is 0x00000000, so the XOR leaves the bucket unchanged. HasZero is then applied to the result, which correctly finds the pre-existing zero and returns true.

10101010  00000000  11001100  11011101  (0xAA00CCDD)  00000000  00000000  00000000  00000000  (0x00000000)10101010  00000000  11001100  11011101  (0xAA00CCDD)\begin{array}{r} \texttt{10101010\;00000000\;11001100\;11011101}\;(\texttt{0xAA00CCDD}) \\[-2pt] \oplus\; \texttt{00000000\;00000000\;00000000\;00000000}\;(\texttt{0x00000000}) \\ \hline \texttt{10101010\;00000000\;11001100\;11011101}\;(\texttt{0xAA00CCDD}) \end{array}

So no, nothing breaks, the algorithm is still robust: HasZero only gives a positive result if a zero byte exists after the XOR, which only happens if our target fingerprint was in the bucket to begin with.

Putting It All Together

Here’s the final, branch-free lookup:

public bool Contains(byte fingerprint, uint bucketIdx)
{
    uint bucket = _table[bucketIdx];
    uint mask = fingerprint * 0x01010101U;
    uint xored = bucket ^ mask;
    return ((xored - 0x01010101U) & ~xored & 0x80808080U) != 0;
}

We XOR to zero-out matching bytes, then use the bit-twiddling trick to see if any byte is zero.

The benchmarks confirmed this bit-twiddling exercise was well worth the effort. Positive lookups became over 60% faster, while negative lookups became more than twice as fast compared to the original byte-array implementation. It’s a significant leap over the shifting version, too. While readability certainly took a hit, the raw performance gain is a trade-off I’m ok with.

MethodOperationsMeanRatio
ByteTable_PositiveLookups128245.2 ns1.00
IntTable_PositiveLookups128147.7 ns0.60
ByteTable_NegativeLookups128324.1 ns1.32
IntTable_NegativeLookups128147.1 ns0.60
ByteTable_PositiveLookups10241,845.6 ns1.00
IntTable_PositiveLookups10241,139.4 ns0.62
ByteTable_NegativeLookups10242,561.2 ns1.39
IntTable_NegativeLookups10241,136.3 ns0.62
ByteTable_PositiveLookups10485761,908,031.9 ns1.00
IntTable_PositiveLookups10485761,170,627.6 ns0.61
ByteTable_NegativeLookups10485762,574,882.8 ns1.35
IntTable_NegativeLookups10485761,172,802.0 ns0.61

Final Thoughts

I’m still not a huge fan of stuffing dense bit tricks into production C#—they can be hard to read and even harder to maintain if something goes wrong. But I think this little adventure has been worth it: the lookup path is now twice as fast, and the codebase is still compact enough to keep the trick well-commented. I hope these notes save someone else a detour—or at the very least that you enjoyed this little optimization trip with me.

© 2025 | Aleksey Maltsev