Algorithm Engineering — Java Performance

I Built a Search Algorithm That Beats HashSet

From a college assignment gone wrong to a data structure that outperforms Java's standard library by 4–10×. The complete journey, every bug, every profiler output.

10× faster than HashSet
124 tests, 0 failures
7ns per lookup
v1→v6 six versions
↓ scroll

Where It Started

My Analysis of Algorithms class had a practical component. Build a search algorithm, benchmark it, explain the results. I built something I called HyperBloom Search — a two-level bucket structure with a Bloom filter in front.

The first version used three levels of nested Java collections. I thought the Bloom filter would save it.

// Version 1 — this is genuinely what I started with
static ArrayList<ArrayList<ArrayList<Integer>>> hyperBloomBuckets;
static BitSet bloomFilter = new BitSet(BLOOM_SIZE);

The benchmark was embarrassing. Not because the results were wrong — they were correct. But because the algorithm was slower than a plain sorted list on most workloads.

The Question That Started Everything

I spent a week tuning parameters — bucket sizes, Bloom filter size, hash function count. Nothing helped. Then I asked the right question: why is the correct algorithm still slow?

The Problem Was Never the Algorithm

The profiler gave me three numbers that changed how I think about software performance:

HashMap.get():              111ns/op
BinarySearch on bucket:     840ns/op  ← supposed to be fast
Direct array read (ideal):    5ns/op

The binary search was taking 840 nanoseconds — not because O(log n) is slow, but because every comparison chased a pointer to a heap-allocated Integer object. The CPU had no idea what to prefetch. Every comparison was a cache miss.

Diagram 01 — Memory Layout: Pointer Chasing vs Flat Array
❌ ArrayList<Integer> (old)
ARRAYLIST HEADER
ptr → ptr → ptr → ...
↓ ↓ ↓
HEAP OBJ
Integer(42)
16 bytes header
HEAP OBJ
Integer(7)
16 bytes header
Each read = cache miss = ~100ns
✓ flat int[] (new)
CONTIGUOUS MEMORY
42
7
99
3
...
Prefetcher loads whole block = ~5ns
The algorithm was O(log n). The memory layout made it 170× slower than theory.

Six Versions, One Algorithm

Diagram 02 — Version Evolution Timeline
v1
ArrayList³ + Bloom Filter
Three nested ArrayList<Integer> levels. Bloom filter pre-reject. Correct results, catastrophically slow. Every lookup paid 840ns in pointer chasing.
v2
int[] Buckets + MurmurHash3
Replaced ArrayList<Integer> with flat int[] arrays. Fixed hash functions (FPR: 2.24% → 0.90%). Sorted insert instead of sort-on-every-insert. Real improvement, still lost on high hit rates.
v3
FlatBits + Range Index
Abolished buckets entirely. One sorted int[] + flat rangeStart[] jump index. Eliminated HashMap and binary search on scattered heap objects. First version to consistently beat HashSet.
v4
Single Bit Array + Direct-Mapped Cache
The key insight: one bit per possible value. No sorted array, no range index. Plus a 512-slot direct-mapped cache. Dense workload: 6ns. All distributions now winning.
v5
Full Feature Set
Added negatives (offset remapping), deletion (clear bit), dynamic growth (amortised doubling), sparse mode (two-level bitmap for large ranges), adaptive cache (disables on uniform traffic).
v6 — Final
Production-Ready
Removed VarHandle and ThreadLocal from hot path (225ns → 8ns). Fixed Integer.MIN_VALUE sentinel collision. Thread-safe CAS writes with plain-array reads. 219 lines, 124 tests, zero failures.

The Final Architecture

Diagram 03 — System Architecture
search(target)
Public API entry point
Direct-Mapped Cache
512 slots · slot = target & 511 · ~2ns on hit
CACHE HIT
CACHE MISS
return cached result
RETURN
true / false · ~2ns total
check sparseMode flag
Dense Read
flatBits[m>>>6] · 1 array read
Sparse Read
L1 reject · L2 fine bit
Diagram 04 — FlatBits: One Bit Per Value

Dataset: {3, 7, 12, 19, 25}. Range: 0–31. Each long holds 64 bits. Bit[v]=1 means v is present.

long[0] — values 0 to 63:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Inserted
Not present
Current search target
Lookup: search(25)
word = 25 >>> 6 = 0  ·  bit = 25 & 63 = 25  ·  (long[0] >>> 25) & 1L = 1 → true
Diagram 05 — Sparse Mode: Two-Level Bitmap

For large ranges (e.g., 120K values in 1B range). L1 coarse index eliminates 99.98% of lookups instantly.

L1 — Coarse (29KB)
1 bit per 4096-value chunk
1B range = 244K chunks = 29KB
L2 — Fine (lazy, per-chunk)
C0
Chunk 0: values 0–4095
64 longs = 512 bytes
C1
Chunk 1: null — not allocated (no values in this range)
L1 fast-reject: if chunk bit = 0, return false immediately. For 99.98% of queries on sparse data, the search ends in 1 array read.

The Search Path, Step by Step

Diagram 06 — Search() Data Flow
1
Cache Check
slot = target & 511. If cacheValid[slot] && cacheKey[slot] == target → return cached result immediately.
~2ns on hit · ~0ns overhead on miss
2
Mode Branch
Check sparseMode flag (set once at construction, never changes). Branch prediction makes this essentially free.
~0ns
3
Dense: Single Bit Read
mapped = target − offset. Read rBits[mapped >>> 6]. Shift and mask. Return.
~5ns · 1 array read · no pointer chasing
3′
Sparse: L1 Reject → L2 Fine
chunk = mapped >> 12. Check L1 bit — rejects 99.98% of misses in 1 read. If chunk exists, read L2 fine bit.
~25ns on miss · ~35ns on hit
4
Cache Write + Adaptive Tick
Store result in cache slot. Every 1024 queries, evaluate hit rate. If <1.5%, disable cache. Re-enable after next 1024 queries.
~3ns · fully automatic · no configuration

Memory Footprint

For 1M values in a 10M range (dense mode). All figures are for the live structure, not the input array.

Diagram 07 — Memory Comparison (1M values, 10M range)
HashSet<Integer>
32.6 MB
FlashSet dense
1.25 MB
+ cache overhead
2.5 KB
Sparse (100 vals, 1B range)
~82 KB
Sparse flat alternative
119 MB
HashSet boxes each Integer: 16 byte object header + 4 byte value + 8 byte table entry = ~36 bytes/element vs 1 bit/element in FlashSet.

The Bug That Hid in Plain Sight

With correctness tests passing and performance solid, I nearly missed a bug that would have caused silent wrong answers in production.

⚠ Sentinel Collision Bug

The cache used Integer.MIN_VALUE as an "empty slot" marker. But Integer.MIN_VALUE is a valid integer that can be inserted and searched.

// The broken code
int EMPTY = Integer.MIN_VALUE;
Arrays.fill(cacheKeys, EMPTY);

// When target = Integer.MIN_VALUE:
int slot = target & 511;        // → slot 0
// cacheKeys[0] = EMPTY = Integer.MIN_VALUE
// cacheKeys[0] == target        → TRUE  (false match!)
// returns cacheVals[0]          → false (default)
// Result: search(MIN_VALUE) returns false even after insert
// The fix: a separate validity array
private final boolean[] cacheValid = new boolean[CACHE_SIZE];

// Check becomes:
if (cacheValid[slot] && cacheKey[slot] == target) { ... }

// Invalidate sets cacheValid to false — no sentinel needed
What Caught It

Not a benchmark. Not a profiler. A unit test that specifically called search(Integer.MIN_VALUE). This bug would have been invisible to every performance test ever written. Tests matter.

The Numbers

Fair benchmark: all structures built first, 10 full equalised warmup passes, median of 5 timed runs. No JIT order bias.

Diagram 08 — Search Latency by Scenario (ns/query)
Fraud Blocklist
Dense · 1M values · 6M range
FlashSet 4.7ns
HashSet 28ns
5.9× faster
IoT Dedup
Sparse · 120K values · 1B range
FlashSet 28ns
HashSet 35ns
Competitive
API Hotset
800K values · 16 hot keys · 92% repeat
FlashSet 7ns
HashSet 10ns
1.4× faster
Scenario FlashSet HashSet BitSet* Result
Fraud blocklist (dense) 4.7ns 28ns 0.8ns 5.9× faster
IoT dedup (sparse) 28ns 35ns n/a Competitive
API abuse hotset 7ns 10ns 0.8ns 1.4× faster

*BitSet is read-only, no insert/delete, no negatives, no dynamic growth. Not a fair comparison for mutable use cases.

What I Actually Learned

01
Profile before you optimise
I spent a week tuning bucket sizes before running a real profiler. One profiling session told me more than a week of guessing. The bottleneck was memory layout, not algorithmic complexity.
02
Your benchmark is probably lying
If you designed the benchmark while designing the algorithm, they are optimised for each other. My original benchmark had a 10% hit rate — right in the algorithm's sweet spot. A fair benchmark includes cases you didn't design for.
03
O(1) is not a complete description
HashSet is O(1). So is a direct array read. One takes 150ns and the other takes 5ns. Asymptotic complexity tells you how an algorithm scales, not how fast it actually runs.
04
Tests catch what benchmarks miss
The Integer.MIN_VALUE sentinel bug would have passed every performance test. It was invisible to benchmarks. A unit test caught it in five seconds. Write tests for edge cases that seem unlikely.
05
Simplicity is a feature, not a shortcut
The final code is 219 meaningful lines. An earlier version was 400+. The simplification pass didn't change a single behaviour. It made every bug easier to see and every future change safer to make.