While working on a Cuckoo Filter implementation in C#, I created an array-like structure for the underlying hash table. I chose an 8-bit fingerprint: it aligns nicely on a byte boundary and still keeps the false-positive rate around 3 %.
The layout looked straightforward—just a byte array where the start of each bucket is calculated as bucketIdx * bucketSize
. The size of each bucket is 4 slots, which is a solid choice for Cuckoo Filter.
But those four bytes in a bucket reminded me of something. They feel like … an integer!
I wasn’t chasing ultra-low latency—after all, this is C#—but I couldn’t resist experimenting. Could I speed up lookups in my Cuckoo Filter by replacing the 4-byte bucket with a plain old 32-bit integer?
Time to find out. 💪
Contents
- A Flexible and Simple Implementation
- Migrating to a
uint
Table - Finding a Byte with Masking
- XOR to the Rescue
- What About Existing Zeros?
- Putting It All Together
- Final Thoughts
A Flexible and Simple Implementation
Here’s the naïve storage I began with. To keep the rest of the code clean, all table logic lives in its own class, whose heart is a pre-allocated byte array:
private readonly byte[] _table;
Each bucket has 4 slots, so there was no need for a second dimension; mapping a key to its bucket is obvious:
var offset = bucketIdx * 4;
Checking whether a particular fingerprint (a single byte) sits in a bucket is just as obvious:
public bool Contains(byte fingerprint, uint bucketIdx)
{
var s = bucketIdx * 4;
for (var i = s; i < s + BucketSize; i++)
{
if (_table[i] == fingerprint) return true;
}
return false;
}
(Please ignore the missing bounds checks—they’re not the point today.)
I’m no bit-twiddling wizard—I live in a cosy, GC-collected world—but those 4 bytes kept itching. A bucket lines up perfectly with a 32-bit uint
; that opens the door to a future lock-free compare-and-swap. So I decided to play.
uint
Table
Migrating to a Switching the backing array is trivial because the hash table is still one-dimensional:
- private readonly byte[] _table;
+ private readonly uint[] _table;
Now let’s refactor the lookup:
public bool Contains(byte fingerprint, uint bucketIdx)
{
- var s = bucketIdx * 4;
+ var bucket = _table[bucketIdx];
- for (var i = s; i < s + BucketSize; i++)
+ for (var i = 0; i < BucketSize; i++)
{
- if (_table[i] == fingerprint) return true;
+ var shift = i * 8;
+ var fp = (byte)(bucket >> shift);
+ if (fp == fingerprint) return true;
}
return false;
}
The bucket offset disappears because each uint
is a bucket. But I’ve lost the luxury of direct indexing—unless I convert the integer to bytes first:
Span<byte> bucketBytes = stackalloc byte[sizeof(uint)];
BitConverter.TryWriteBytes(bucketBytes, _table[bucketIdx]);
for (var i = 0; i < BucketSize; i++)
{
var fp = bucketBytes[i];
if (fp == fingerprint) return true;
}
I ran a quick benchmark on both uint
-based lookups, and the results were revealing. The shifting version gave a nice speed boost, about 35% faster than the original byte-array loop. I’ve also tested the unrolled loop, but it showed no significant improvement because the compiler has already unrolled it.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 239.9 ns | 1.00 |
IntTable_PositiveLookups | 128 | 166.9 ns | 0.70 |
ByteTable_NegativeLookups | 128 | 321.9 ns | 1.34 |
IntTable_NegativeLookups | 128 | 203.9 ns | 0.85 |
ByteTable_PositiveLookups | 1024 | 1,847.0 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 1,300.9 ns | 0.70 |
ByteTable_NegativeLookups | 1024 | 2,523.7 ns | 1.37 |
IntTable_NegativeLookups | 1024 | 1,597.2 ns | 0.86 |
ByteTable_PositiveLookups | 1048576 | 1,907,503.2 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 1,762,474.3 ns | 0.92 |
ByteTable_NegativeLookups | 1048576 | 2,575,301.5 ns | 1.35 |
IntTable_NegativeLookups | 1048576 | 1,637,653.6 ns | 0.86 |
The BitConverter
approach, however, was a step backward. It was even slower than the original, likely due to the additional Span
overhead. I’m not about to introduce complexity for negative gain, so the BitConverter
version was a non-starter.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 239.7 ns | 1.00 |
IntTable_PositiveLookups | 128 | 366.8 ns | 1.53 |
ByteTable_NegativeLookups | 128 | 322.6 ns | 1.35 |
IntTable_NegativeLookups | 128 | 429.2 ns | 1.79 |
ByteTable_PositiveLookups | 1024 | 1,850.0 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 2,877.2 ns | 1.56 |
ByteTable_NegativeLookups | 1024 | 2,512.9 ns | 1.36 |
IntTable_NegativeLookups | 1024 | 3,396.8 ns | 1.84 |
ByteTable_PositiveLookups | 1048576 | 1,909,607.9 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 3,352,454.1 ns | 1.76 |
ByteTable_NegativeLookups | 1048576 | 2,566,696.5 ns | 1.34 |
IntTable_NegativeLookups | 1048576 | 3,536,050.6 ns | 1.85 |
Even the shifting version is quite performant, can we do better? Maybe just eliminate the loop entirely?
Finding a Byte with Masking
Long ago I bookmarked Sean Anderson’s great Bit Twiddling Hacks. One gem there—Determine if a word has a zero byte—is exactly what I need. The C# version is nearly identical:
private static bool HasZero(uint v)
{
return ((v - 0x01010101U) & ~v & 0x80808080U) != 0;
}
Admittedly opaque, so let’s unpack it.
The core of the trick is (v - 0x01010101U) & ~v
. This expression has a special property:
- For any non-zero byte b, the most significant bit of
(b - 1) & ~b
will always be0
. - For a zero-byte b = 0x00, the expression becomes
(0x00 - 1) & ~0x00
, which is0xFF & 0xFF = 0xFF
. The most significant bit is1
.
So, this operation creates a “marker” bit (it sets the most significant bit to 1
) in any byte position that was originally 0x00
.
Let’s apply it to our v
, e.g., 0x4462002E
:
First, we subtract 0x01010101U
.
Note that this is a single 32-bit subtraction, so borrows can cross byte boundaries. The 0x00
byte borrows from 0x62
, resulting in 0x...60FF...
.
Second, we apply a bitwise NOT to the original value:
Now, &
them together:
Notice that the third byte from the left is 0xFF
. Its most significant bit is 1
, indicating that this was our zero-byte position.
The final & 0x80808080U
is a mask that removes all other bits, leaving only the most significant bit of each byte.
The method returns the result of 0x00008000 != 0
. Since 0x8000
is not zero, the expression is true
, and the method correctly reports that the zero byte was found.
Great, now I can detect a zero byte without branches. All that remains is to turn the byte I’m searching for into zero.
XOR to the Rescue
Let’s assume our integer, aka bucket, is 0x12345678
and I’m looking for byte 0x56
. Without shifts, this seems tough. Luckily, all we need to do is transform the 0x56
byte to 0x00
.
What happens to the other bytes is irrelevant, because our HasZero
trick only cares if any byte is zero.
The XOR (^
) operator has a useful property: A ^ B = 0
if and only if A == B
. So, if we XOR the entire bucket with a repeating mask of our target byte, only the matching byte will become zero.
So the remaining part of this puzzle is to make a repetitive mask, which is pretty arithmetic: I should multiply target byte by 0x01010101U
:
What About Existing Zeros?
A careful reader might ask: what happens if the bucket already has a zero byte in it? Does that mess up the logic?
Let’s say our bucket is 0xAA00CCDD
and we’re searching for a non-zero fingerprint like 0xBB
. The XOR operation transforms the original zero into a non-zero value, so HasZero
correctly returns false
.
Now, what if we search for 0x00
itself (an empty slot in my case)? The mask is 0x00000000
, so the XOR leaves the bucket unchanged. HasZero
is then applied to the result, which correctly finds the pre-existing zero and returns true
.
So no, nothing breaks, the algorithm is still robust: HasZero
only gives a positive result if a zero byte exists after the XOR, which only happens if our target fingerprint was in the bucket to begin with.
Putting It All Together
Here’s the final, branch-free lookup:
public bool Contains(byte fingerprint, uint bucketIdx)
{
uint bucket = _table[bucketIdx];
uint mask = fingerprint * 0x01010101U;
uint xored = bucket ^ mask;
return ((xored - 0x01010101U) & ~xored & 0x80808080U) != 0;
}
We XOR to zero-out matching bytes, then use the bit-twiddling trick to see if any byte is zero.
The benchmarks confirmed this bit-twiddling exercise was well worth the effort. Positive lookups became over 60% faster, while negative lookups became more than twice as fast compared to the original byte-array implementation. It’s a significant leap over the shifting version, too. While readability certainly took a hit, the raw performance gain is a trade-off I’m ok with.
Method | Operations | Mean | Ratio |
---|---|---|---|
ByteTable_PositiveLookups | 128 | 245.2 ns | 1.00 |
IntTable_PositiveLookups | 128 | 147.7 ns | 0.60 |
ByteTable_NegativeLookups | 128 | 324.1 ns | 1.32 |
IntTable_NegativeLookups | 128 | 147.1 ns | 0.60 |
ByteTable_PositiveLookups | 1024 | 1,845.6 ns | 1.00 |
IntTable_PositiveLookups | 1024 | 1,139.4 ns | 0.62 |
ByteTable_NegativeLookups | 1024 | 2,561.2 ns | 1.39 |
IntTable_NegativeLookups | 1024 | 1,136.3 ns | 0.62 |
ByteTable_PositiveLookups | 1048576 | 1,908,031.9 ns | 1.00 |
IntTable_PositiveLookups | 1048576 | 1,170,627.6 ns | 0.61 |
ByteTable_NegativeLookups | 1048576 | 2,574,882.8 ns | 1.35 |
IntTable_NegativeLookups | 1048576 | 1,172,802.0 ns | 0.61 |
Final Thoughts
I’m still not a huge fan of stuffing dense bit tricks into production C#—they can be hard to read and even harder to maintain if something goes wrong. But I think this little adventure has been worth it: the lookup path is now twice as fast, and the codebase is still compact enough to keep the trick well-commented. I hope these notes save someone else a detour—or at the very least that you enjoyed this little optimization trip with me.