Chapter 01
Where It Started
My Analysis of Algorithms class had a practical component. Build a search algorithm, benchmark it, explain the results. I built something I called HyperBloom Search — a two-level bucket structure with a Bloom filter in front.
The first version used three levels of nested Java collections. I thought the Bloom filter would save it.
// Version 1 — this is genuinely what I started with
static ArrayList<ArrayList<ArrayList<Integer>>> hyperBloomBuckets;
static BitSet bloomFilter = new BitSet(BLOOM_SIZE);
The benchmark was embarrassing. Not because the results were wrong — they were correct. But because the algorithm was slower than a plain sorted list on most workloads.
The Question That Started Everything
I spent a week tuning parameters — bucket sizes, Bloom filter size, hash function count. Nothing helped. Then I asked the right question: why is the correct algorithm still slow?
Chapter 02
The Problem Was Never the Algorithm
The profiler gave me three numbers that changed how I think about software performance:
HashMap.get(): 111ns/op
BinarySearch on bucket: 840ns/op ← supposed to be fast
Direct array read (ideal): 5ns/op
The binary search was taking 840 nanoseconds — not because O(log n) is slow, but because every comparison chased a pointer to a heap-allocated Integer object. The CPU had no idea what to prefetch. Every comparison was a cache miss.
Diagram 01 — Memory Layout: Pointer Chasing vs Flat Array
❌ ArrayList<Integer> (old)
ARRAYLIST HEADER
ptr → ptr → ptr → ...
↓ ↓ ↓
HEAP OBJ
Integer(42)
16 bytes header
HEAP OBJ
Integer(7)
16 bytes header
Each read = cache miss = ~100ns
✓ flat int[] (new)
Prefetcher loads whole block = ~5ns
The algorithm was O(log n). The memory layout made it 170× slower than theory.
Chapter 03
Six Versions, One Algorithm
Diagram 02 — Version Evolution Timeline
v1
ArrayList³ + Bloom Filter
Three nested ArrayList<Integer> levels. Bloom filter pre-reject. Correct results, catastrophically slow. Every lookup paid 840ns in pointer chasing.
v2
int[] Buckets + MurmurHash3
Replaced ArrayList<Integer> with flat int[] arrays. Fixed hash functions (FPR: 2.24% → 0.90%). Sorted insert instead of sort-on-every-insert. Real improvement, still lost on high hit rates.
v3
FlatBits + Range Index
Abolished buckets entirely. One sorted int[] + flat rangeStart[] jump index. Eliminated HashMap and binary search on scattered heap objects. First version to consistently beat HashSet.
v4
Single Bit Array + Direct-Mapped Cache
The key insight: one bit per possible value. No sorted array, no range index. Plus a 512-slot direct-mapped cache. Dense workload: 6ns. All distributions now winning.
v5
Full Feature Set
Added negatives (offset remapping), deletion (clear bit), dynamic growth (amortised doubling), sparse mode (two-level bitmap for large ranges), adaptive cache (disables on uniform traffic).
v6 — Final
Production-Ready
Removed VarHandle and ThreadLocal from hot path (225ns → 8ns). Fixed Integer.MIN_VALUE sentinel collision. Thread-safe CAS writes with plain-array reads. 219 lines, 124 tests, zero failures.
Chapter 04
The Final Architecture
Diagram 03 — System Architecture
search(target)
Public API entry point
Direct-Mapped Cache
512 slots · slot = target & 511 · ~2ns on hit
return cached result
RETURN
true / false · ~2ns total
check sparseMode flag
Dense Read
flatBits[m>>>6] · 1 array read
Sparse Read
L1 reject · L2 fine bit
Diagram 04 — FlatBits: One Bit Per Value
Dataset: {3, 7, 12, 19, 25}. Range: 0–31. Each long holds 64 bits. Bit[v]=1 means v is present.
long[0] — values 0 to 63:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Lookup: search(25)
word = 25 >>> 6 = 0 · bit = 25 & 63 = 25 · (long[0] >>> 25) & 1L = 1 → true
Diagram 05 — Sparse Mode: Two-Level Bitmap
For large ranges (e.g., 120K values in 1B range). L1 coarse index eliminates 99.98% of lookups instantly.
L1 — Coarse (29KB)
1 bit per 4096-value chunk
1B range = 244K chunks = 29KB
L2 — Fine (lazy, per-chunk)
C1
Chunk 1: null — not allocated (no values in this range)
L1 fast-reject: if chunk bit = 0, return false immediately. For 99.98% of queries on sparse data, the search ends in 1 array read.
Chapter 05
The Search Path, Step by Step
Diagram 06 — Search() Data Flow
1
Cache Check
slot = target & 511. If cacheValid[slot] && cacheKey[slot] == target → return cached result immediately.
~2ns on hit · ~0ns overhead on miss
2
Mode Branch
Check sparseMode flag (set once at construction, never changes). Branch prediction makes this essentially free.
~0ns
3
Dense: Single Bit Read
mapped = target − offset. Read rBits[mapped >>> 6]. Shift and mask. Return.
~5ns · 1 array read · no pointer chasing
3′
Sparse: L1 Reject → L2 Fine
chunk = mapped >> 12. Check L1 bit — rejects 99.98% of misses in 1 read. If chunk exists, read L2 fine bit.
~25ns on miss · ~35ns on hit
4
Cache Write + Adaptive Tick
Store result in cache slot. Every 1024 queries, evaluate hit rate. If <1.5%, disable cache. Re-enable after next 1024 queries.
~3ns · fully automatic · no configuration
Chapter 06
Memory Footprint
For 1M values in a 10M range (dense mode). All figures are for the live structure, not the input array.
Diagram 07 — Memory Comparison (1M values, 10M range)
Sparse (100 vals, 1B range)
HashSet boxes each Integer: 16 byte object header + 4 byte value + 8 byte table entry = ~36 bytes/element vs 1 bit/element in FlashSet.
Chapter 07
The Bug That Hid in Plain Sight
With correctness tests passing and performance solid, I nearly missed a bug that would have caused silent wrong answers in production.
⚠ Sentinel Collision Bug
The cache used Integer.MIN_VALUE as an "empty slot" marker. But Integer.MIN_VALUE is a valid integer that can be inserted and searched.
// The broken code
int EMPTY = Integer.MIN_VALUE;
Arrays.fill(cacheKeys, EMPTY);
// When target = Integer.MIN_VALUE:
int slot = target & 511; // → slot 0
// cacheKeys[0] = EMPTY = Integer.MIN_VALUE
// cacheKeys[0] == target → TRUE (false match!)
// returns cacheVals[0] → false (default)
// Result: search(MIN_VALUE) returns false even after insert
// The fix: a separate validity array
private final boolean[] cacheValid = new boolean[CACHE_SIZE];
// Check becomes:
if (cacheValid[slot] && cacheKey[slot] == target) { ... }
// Invalidate sets cacheValid to false — no sentinel needed
What Caught It
Not a benchmark. Not a profiler. A unit test that specifically called search(Integer.MIN_VALUE). This bug would have been invisible to every performance test ever written. Tests matter.
Chapter 08
The Numbers
Fair benchmark: all structures built first, 10 full equalised warmup passes, median of 5 timed runs. No JIT order bias.
Diagram 08 — Search Latency by Scenario (ns/query)
Fraud Blocklist
Dense · 1M values · 6M range
5.9× faster
IoT Dedup
Sparse · 120K values · 1B range
Competitive
API Hotset
800K values · 16 hot keys · 92% repeat
1.4× faster
| Scenario |
FlashSet |
HashSet |
BitSet* |
Result |
| Fraud blocklist (dense) |
4.7ns |
28ns |
0.8ns |
5.9× faster |
| IoT dedup (sparse) |
28ns |
35ns |
n/a |
Competitive |
| API abuse hotset |
7ns |
10ns |
0.8ns |
1.4× faster |
*BitSet is read-only, no insert/delete, no negatives, no dynamic growth. Not a fair comparison for mutable use cases.
Chapter 09
What I Actually Learned
01
Profile before you optimise
I spent a week tuning bucket sizes before running a real profiler. One profiling session told me more than a week of guessing. The bottleneck was memory layout, not algorithmic complexity.
02
Your benchmark is probably lying
If you designed the benchmark while designing the algorithm, they are optimised for each other. My original benchmark had a 10% hit rate — right in the algorithm's sweet spot. A fair benchmark includes cases you didn't design for.
03
O(1) is not a complete description
HashSet is O(1). So is a direct array read. One takes 150ns and the other takes 5ns. Asymptotic complexity tells you how an algorithm scales, not how fast it actually runs.
04
Tests catch what benchmarks miss
The Integer.MIN_VALUE sentinel bug would have passed every performance test. It was invisible to benchmarks. A unit test caught it in five seconds. Write tests for edge cases that seem unlikely.
05
Simplicity is a feature, not a shortcut
The final code is 219 meaningful lines. An earlier version was 400+. The simplification pass didn't change a single behaviour. It made every bug easier to see and every future change safer to make.