Benchmarks
Performance
Single-threaded, one file (7.3 GB Illumina WES). Pipelines typically parallelise across files via an external scheduler (Nextflow, Snakemake), so single-file throughput is the relevant metric.
Wall clock time
Peak memory (RSS)
Where the speed comes from
Three specific optimisations, each identified through CPU profiling:
SIMD adapter search
The memchr crate uses ARM NEON / SSE to compare 16 bytes per cycle.
Adapter search drops from 35% to 6% of runtime.
Lookup-table base counting
A 256-byte compile-time table replaces unpredictable branch chains, halving time spent in per-base modules.
Zero per-read allocation
In-place uppercase conversion, zero-copy tile parsing, byte-slice tracking. No garbage collection overhead.
Test dataset
119 GB across 11 files: Illumina WES, Element AVITI, ONT PromethION long-read, and BAM. Apple Silicon (aarch64-apple-darwin). The full report has per-file results, memory profiles, a CPU breakdown, and analysis of whether these optimisations could be back-ported to Java.