FlashSet

Chapter 01

Where It Started

My Analysis of Algorithms class had a practical component. Build a search algorithm, benchmark it, explain the results. I built something I called HyperBloom Search — a two-level bucket structure with a Bloom filter in front.

The first version used three levels of nested Java collections. I thought the Bloom filter would save it.

// Version 1 — this is genuinely what I started with
static ArrayList<ArrayList<ArrayList<Integer>>> hyperBloomBuckets;
static BitSet bloomFilter = new BitSet(BLOOM_SIZE);

The benchmark was embarrassing. Not because the results were wrong — they were correct. But because the algorithm was slower than a plain sorted list on most workloads.

The Question That Started Everything

I spent a week tuning parameters — bucket sizes, Bloom filter size, hash function count. Nothing helped. Then I asked the right question: why is the correct algorithm still slow?

Chapter 02

The Problem Was Never the Algorithm

The profiler gave me three numbers that changed how I think about software performance:

HashMap.get():              111ns/op
BinarySearch on bucket:     840ns/op  ← supposed to be fast
Direct array read (ideal):    5ns/op

The binary search was taking 840 nanoseconds — not because O(log n) is slow, but because every comparison chased a pointer to a heap-allocated Integer object. The CPU had no idea what to prefetch. Every comparison was a cache miss.

Diagram 01 — Memory Layout: Pointer Chasing vs Flat Array

❌ ArrayList<Integer> (old)

ARRAYLIST HEADER

ptr → ptr → ptr → ...

↓ ↓ ↓

HEAP OBJ

Integer(42)

16 bytes header

HEAP OBJ

Integer(7)

16 bytes header

Each read = cache miss = ~100ns

✓ flat int[] (new)

CONTIGUOUS MEMORY

42

7

99

3

...

Prefetcher loads whole block = ~5ns

The algorithm was O(log n). The memory layout made it 170× slower than theory.

Chapter 03

Six Versions, One Algorithm

Diagram 02 — Version Evolution Timeline

v1

ArrayList³ + Bloom Filter

Three nested ArrayList<Integer> levels. Bloom filter pre-reject. Correct results, catastrophically slow. Every lookup paid 840ns in pointer chasing.

v2

int[] Buckets + MurmurHash3

Replaced ArrayList<Integer> with flat int[] arrays. Fixed hash functions (FPR: 2.24% → 0.90%). Sorted insert instead of sort-on-every-insert. Real improvement, still lost on high hit rates.

v3

FlatBits + Range Index

Abolished buckets entirely. One sorted int[] + flat rangeStart[] jump index. Eliminated HashMap and binary search on scattered heap objects. First version to consistently beat HashSet.

v4

Single Bit Array + Direct-Mapped Cache

The key insight: one bit per possible value. No sorted array, no range index. Plus a 512-slot direct-mapped cache. Dense workload: 6ns. All distributions now winning.

v5

Full Feature Set

Added negatives (offset remapping), deletion (clear bit), dynamic growth (amortised doubling), sparse mode (two-level bitmap for large ranges), adaptive cache (disables on uniform traffic).

v6 — Final

Production-Ready

Removed VarHandle and ThreadLocal from hot path (225ns → 8ns). Fixed Integer.MIN_VALUE sentinel collision. Thread-safe CAS writes with plain-array reads. 219 lines, 124 tests, zero failures.

Chapter 04

The Final Architecture

Diagram 03 — System Architecture

search(target)

Public API entry point

Direct-Mapped Cache

512 slots · slot = target & 511 · ~2ns on hit

CACHE HIT

CACHE MISS

return cached result

RETURN

true / false · ~2ns total

check sparseMode flag

Dense Read

flatBits[m>>>6] · 1 array read

Sparse Read

L1 reject · L2 fine bit

Diagram 04 — FlatBits: One Bit Per Value

Dataset: {3, 7, 12, 19, 25}. Range: 0–31. Each long holds 64 bits. Bit[v]=1 means v is present.

long[0] — values 0 to 63:

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

Inserted

Not present

Current search target

Lookup: search(25)

word = 25 >>> 6 = 0 · bit = 25 & 63 = 25 · (long[0] >>> 25) & 1L = 1 → true

Diagram 05 — Sparse Mode: Two-Level Bitmap

For large ranges (e.g., 120K values in 1B range). L1 coarse index eliminates 99.98% of lookups instantly.

L1 — Coarse (29KB)

1 bit per 4096-value chunk
1B range = 244K chunks = 29KB

L2 — Fine (lazy, per-chunk)

C0

Chunk 0: values 0–4095

64 longs = 512 bytes

C1

Chunk 1: null — not allocated (no values in this range)

L1 fast-reject: if chunk bit = 0, return false immediately. For 99.98% of queries on sparse data, the search ends in 1 array read.

Chapter 05

The Search Path, Step by Step

Diagram 06 — Search() Data Flow

1

Cache Check

slot = target & 511. If cacheValid[slot] && cacheKey[slot] == target → return cached result immediately.

~2ns on hit · ~0ns overhead on miss

2

Mode Branch

Check sparseMode flag (set once at construction, never changes). Branch prediction makes this essentially free.

~0ns

3

Dense: Single Bit Read

mapped = target − offset. Read rBits[mapped >>> 6]. Shift and mask. Return.

~5ns · 1 array read · no pointer chasing

3′

Sparse: L1 Reject → L2 Fine

chunk = mapped >> 12. Check L1 bit — rejects 99.98% of misses in 1 read. If chunk exists, read L2 fine bit.

~25ns on miss · ~35ns on hit

4

Cache Write + Adaptive Tick

Store result in cache slot. Every 1024 queries, evaluate hit rate. If <1.5%, disable cache. Re-enable after next 1024 queries.

~3ns · fully automatic · no configuration

Chapter 06

Memory Footprint

For 1M values in a 10M range (dense mode). All figures are for the live structure, not the input array.

Diagram 07 — Memory Comparison (1M values, 10M range)

HashSet<Integer>

32.6 MB

FlashSet dense

1.25 MB

+ cache overhead

2.5 KB

Sparse (100 vals, 1B range)

~82 KB

Sparse flat alternative

119 MB

HashSet boxes each Integer: 16 byte object header + 4 byte value + 8 byte table entry = ~36 bytes/element vs 1 bit/element in FlashSet.

Chapter 07

The Bug That Hid in Plain Sight

With correctness tests passing and performance solid, I nearly missed a bug that would have caused silent wrong answers in production.

⚠ Sentinel Collision Bug

The cache used Integer.MIN_VALUE as an "empty slot" marker. But Integer.MIN_VALUE is a valid integer that can be inserted and searched.

// The broken code
int EMPTY = Integer.MIN_VALUE;
Arrays.fill(cacheKeys, EMPTY);

// When target = Integer.MIN_VALUE:
int slot = target & 511;        // → slot 0
// cacheKeys[0] = EMPTY = Integer.MIN_VALUE
// cacheKeys[0] == target        → TRUE  (false match!)
// returns cacheVals[0]          → false (default)
// Result: search(MIN_VALUE) returns false even after insert

// The fix: a separate validity array
private final boolean[] cacheValid = new boolean[CACHE_SIZE];

// Check becomes:
if (cacheValid[slot] && cacheKey[slot] == target) { ... }

// Invalidate sets cacheValid to false — no sentinel needed

What Caught It

Not a benchmark. Not a profiler. A unit test that specifically called search(Integer.MIN_VALUE). This bug would have been invisible to every performance test ever written. Tests matter.

Chapter 08

The Numbers

Fair benchmark: all structures built first, 10 full equalised warmup passes, median of 5 timed runs. No JIT order bias.

Diagram 08 — Search Latency by Scenario (ns/query)

Fraud Blocklist
Dense · 1M values · 6M range

FlashSet 4.7ns

HashSet 28ns

5.9× faster

IoT Dedup
Sparse · 120K values · 1B range

FlashSet 28ns

HashSet 35ns

Competitive

API Hotset
800K values · 16 hot keys · 92% repeat

FlashSet 7ns

HashSet 10ns

1.4× faster

Scenario	FlashSet	HashSet	BitSet*	Result
Fraud blocklist (dense)	4.7ns	28ns	0.8ns	5.9× faster
IoT dedup (sparse)	28ns	35ns	n/a	Competitive
API abuse hotset	7ns	10ns	0.8ns	1.4× faster

*BitSet is read-only, no insert/delete, no negatives, no dynamic growth. Not a fair comparison for mutable use cases.

Chapter 09

What I Actually Learned

01

Profile before you optimise

I spent a week tuning bucket sizes before running a real profiler. One profiling session told me more than a week of guessing. The bottleneck was memory layout, not algorithmic complexity.

02

Your benchmark is probably lying

If you designed the benchmark while designing the algorithm, they are optimised for each other. My original benchmark had a 10% hit rate — right in the algorithm's sweet spot. A fair benchmark includes cases you didn't design for.

03

O(1) is not a complete description

HashSet is O(1). So is a direct array read. One takes 150ns and the other takes 5ns. Asymptotic complexity tells you how an algorithm scales, not how fast it actually runs.

04

Tests catch what benchmarks miss

The Integer.MIN_VALUE sentinel bug would have passed every performance test. It was invisible to benchmarks. A unit test caught it in five seconds. Write tests for edge cases that seem unlikely.

05

Simplicity is a feature, not a shortcut

The final code is 219 meaningful lines. An earlier version was 400+. The simplification pass didn't change a single behaviour. It made every bug easier to see and every future change safer to make.

I Built a Search Algorithm That Beats HashSet

Where It Started

The Problem Was Never the Algorithm

Six Versions, One Algorithm

The Final Architecture

The Search Path, Step by Step

Memory Footprint

The Bug That Hid in Plain Sight

The Numbers

What I Actually Learned