FastQC: Rust vs Java Benchmark Report

Date: 9 April 2026
Machine: Apple Silicon Mac (aarch64-apple-darwin)
Implementations:

Java — FastQC 0.12.2.devel, OpenJDK, G1GC, 4GB max heap
Rust — fastqc-rs 0.12.1 with native zlib, LTO, lookup-table base counting

All runs are single-threaded, one file at a time. This reflects the typical pipeline use case where an external scheduler (Nextflow, Snakemake) handles parallelism across files.

Java

Rust

Test Dataset

File	Size	Platform	Type
SRR7890918_tumor_1.fastq.gz	7.3 GB	Illumina	WES, short-read
SRR7890918_tumor_2.fastq.gz	7.5 GB	Illumina	WES, short-read
SRR7890919_normal_1.fastq.gz	6.2 GB	Illumina	WES, short-read
SRR7890919_normal_2.fastq.gz	6.3 GB	Illumina	WES, short-read
ERR16944282_1.fastq.gz	487 MB	Element AVITI	Short-read
ERR16944282_2.fastq.gz	630 MB	Element AVITI	Short-read
ERR16944299_1.fastq.gz	454 MB	Element AVITI	Short-read
ERR16944299_2.fastq.gz	597 MB	Element AVITI	Short-read
ERR16962265.fastq.gz	21 GB	ONT PromethION	Long-read (wheat WGS)
SRR37915503.fastq.gz	58 GB	ONT PromethION	Long-read (snake WGS)
GM12878_REP1.markdup.sorted.bam	11 GB	Illumina	BAM
Total: 119 GB		11 files across 3 platforms + BAM

Output Correctness

All output files (fastqc_data.txt, summary.txt) are byte-identical between Java and Rust for all 11 test files, except for the Sequence Duplication Levels module where floating-point values differ at the 14th-15th decimal place. This is caused by HashMap iteration order affecting accumulation in the statistical correction algorithm and is functionally meaningless.

Per-File Results

Wall Clock Time

File	Size	Java	Rust	Speedup
Illumina WES (short-read, 150bp)
tumor_1.fastq.gz	7.3 GB	9m 46s	3m 13s	3.0x
tumor_2.fastq.gz	7.5 GB	9m 09s	3m 18s	2.8x
normal_1.fastq.gz	6.2 GB	7m 08s	2m 45s	2.6x
normal_2.fastq.gz	6.3 GB	7m 05s	2m 54s	2.4x
Element AVITI (short-read)
ERR16944282_1.fastq.gz	487 MB	35s	14s	2.5x
ERR16944282_2.fastq.gz	630 MB	36s	16s	2.3x
ERR16944299_1.fastq.gz	454 MB	35s	14s	2.5x
ERR16944299_2.fastq.gz	597 MB	41s	15s	2.7x
ONT PromethION (long-read)
ERR16962265.fastq.gz	21 GB	20m 28s	10m 16s	2.0x
SRR37915503.fastq.gz	58 GB	49m 37s	30m 21s	1.6x
BAM
GM12878.bam	11 GB	8m 00s	4m 51s	1.6x

Peak Memory (RSS)

File	Java	Rust	Ratio
Illumina WES (short-read)
tumor_1.fastq.gz	444 MB	28 MB	15.9x less
tumor_2.fastq.gz	438 MB	29 MB	15.1x less
normal_1.fastq.gz	439 MB	25 MB	17.6x less
normal_2.fastq.gz	405 MB	28 MB	14.5x less
Element AVITI (short-read)
ERR16944282_1.fastq.gz	412 MB	53 MB	7.8x less
ERR16944282_2.fastq.gz	483 MB	54 MB	8.9x less
ERR16944299_1.fastq.gz	388 MB	55 MB	7.1x less
ERR16944299_2.fastq.gz	361 MB	54 MB	6.7x less
ONT PromethION (long-read)
ERR16962265.fastq.gz	1550 MB	1170 MB	1.3x less
SRR37915503.fastq.gz	2209 MB	1028 MB	2.1x less
BAM
GM12878.bam	543 MB	30 MB	18.1x less

Memory and long reads: ONT PromethION data uses significantly more memory in both implementations because reads can be tens of thousands of bases long, requiring proportionally larger per-position arrays. For short-read data (Illumina, AVITI), Rust uses 7-18x less memory. For long-read data the ratio narrows to 1.3-2.1x because the per-position arrays dominate in both.

Speedup by Platform

Platform	Read Type	Avg Speedup	Java Throughput	Rust Throughput
Illumina	Short-read (150bp)	2.7x	14 MB/s	38 MB/s
Element AVITI	Short-read	2.5x	15 MB/s	37 MB/s
ONT PromethION	Long-read (1-20kb)	1.8x	19 MB/s	34 MB/s
BAM	Mapped reads	1.6x	23 MB/s	39 MB/s

Throughput = compressed file size / wall-clock time, single-threaded.

Short-read FASTQ benefits most (2.5-3.0x) because the three optimisations (SIMD adapter search, lookup-table base counting, zero allocation) all target the per-read processing loop, which dominates for short reads. Long-read data shows a smaller improvement (1.6-2.0x) because gzip decompression and per-base processing of very long sequences become the dominant costs, diluting the relative impact of the adapter and allocation optimisations.

Why is BAM Slower than FASTQ?

The BAM speedup (1.6x) is lower than short-read FASTQ (2.5-3.0x) because of the record parsing overhead. Java uses htsjdk, a mature and heavily optimised C-backed BAM library. Rust uses noodles, a pure Rust implementation. BAM requires BGZF decompression, CIGAR parsing, flag decoding, and reverse-complement reconstruction for every record. noodles does all of this in pure Rust, which is solid but cannot match htsjdk's years of C-level optimisation. This narrows the gap that the other optimisations create.

Where the Time Goes (Rust)

CPU profile of the Rust implementation processing 2M short reads (macOS sample tool, 3,300 active samples). This shows where the Rust rewrite spends its time:

Component	%
Gzip decompression (native libz)	25%
Adapter search (memchr SIMD)	18%
PerBaseSequenceContent	12%
NContent	11%
BasicStats	9%
FASTQ I/O + parsing	8%
PerBaseQuality + PerTileQuality	6%
PerSequenceQuality + GCContent	3%
Memory, hashing, CRC, other	8%

A quarter of the time is spent in gzip decompression (native C libz, shared with Java). The remaining 75% is application code where the three optimisations described below have their effect. An equivalent profile of Java is not available because Java Flight Recorder cannot sample native C code or JVM intrinsics, making a direct side-by-side comparison misleading.

What Makes the Rust Version Faster

The speed improvement comes from three specific optimisations, not from Rust being inherently faster than Java. Each was identified through CPU profiling and contributes a measurable share of the overall gain.

1. SIMD-accelerated adapter search (biggest single factor)

FastQC searches every read for 6 adapter sequences. In Java, String.indexOf() compares one character at a time. The Rust version uses the memchr crate, which employs ARM NEON SIMD instructions to compare 16 bytes per CPU cycle. The adapter search patterns are also pre-compiled once at startup rather than being reconstructed on every call. This single change reduces adapter searching from 35% of total runtime to 6%.

2. Lookup-table base counting (halves module processing time)

Three QC modules (BasicStats, PerBaseSequenceContent, NContent) independently iterate over every byte of every read to classify bases as A, C, G, T, or N. In Java, each module uses a multi-way switch or if/else chain. With random DNA data, the CPU branch predictor can only guess correctly 25% of the time, causing pipeline stalls on every misprediction.

The Rust version replaces these branches with a 256-byte compile-time lookup table that maps each ASCII byte directly to a base index. This turns an unpredictable branch into a single indexed memory load. Combined with storing per-position counts in a contiguous [u64; 4] array (instead of 4 separate arrays), this halves the time spent in module processing.

3. Zero per-read allocation

Java creates several new objects for every read: a Sequence object, a new String from toUpperCase(), String[] arrays from split(":") for tile parsing, and substring() calls for overrepresented sequence tracking. Each allocation contributes to heap pressure and eventually triggers garbage collection. Over a typical run, the JVM performs tens of thousands of GC events.

The Rust version avoids all of this. Sequences are uppercase-converted in place (a single pass over the existing byte buffer, no new allocation). Tile IDs are parsed by scanning for colons without splitting. Overrepresented sequences are tracked using byte slices that reference the existing data. With no garbage collector, there are no GC pauses and no stop-the-world events.

Could these improvements be back-ported to Java?

We tested each optimisation in Java to see whether the same gains could be achieved without a full rewrite.

SIMD adapter search: No. Java has no way to invoke NEON or SSE intrinsics for substring matching. String.indexOf() already uses HotSpot's best internal intrinsic. We tested replacing it with a manual char[] loop and it was actually slower (16.0s vs 12.9s) because HotSpot's built-in implementation is more optimised than hand-written Java. The JVM Vector API (JEP 338) exists in preview but does not support substring search patterns.

Lookup-table base counting: Marginal. Java's JIT compiler already converts small switch statements on byte values into jump tables internally. Replacing the switch with an explicit array lookup in Java measured only ~3% improvement. The remaining gap is that Rust's compile-time constant lookup table with a register-allocated local accumulator compiles to tighter machine code than what HotSpot generates at runtime.

Zero-allocation processing: Structurally limited. Java's String is immutable and heap-allocated by design. toUpperCase() must return a new String. split(":") must allocate a String[]. We tested workarounds (check-before-uppercase, manual colon scanning, HashMap to array conversion) and measured negligible improvement in each case. HotSpot's escape analysis and generational GC already handle short-lived allocations efficiently, but not for free: the JVM still recorded tens of thousands of GC events per run, each causing brief pauses.

The Java version is already close to its performance ceiling. HotSpot's JIT does a good job with the patterns that exist in the codebase, and the remaining gap comes from structural properties of the JVM: heap-allocated immutable strings, no SIMD substring search, and garbage collection overhead that cannot be fully eliminated.

Build Variants

The Rust binary defaults to native zlib for gzip decompression (available on all Linux/macOS). A fully static pure-Rust build is available for WASM, Windows, or environments without libz:

Build	Command	Dependencies
Default (native zlib)	`cargo build --release`	libz (system)
Pure Rust	`cargo build --release --no-default-features`	None

Overall Summary

Metric	Java	Rust
Short-read speed (avg)	14 MB/s	38 MB/s (2.7x faster)
Long-read speed (avg)	19 MB/s	34 MB/s (1.8x faster)
BAM speed	23 MB/s	39 MB/s (1.6x faster)
Peak memory (short-read)	361-483 MB	25-55 MB (7-18x less)
Peak memory (long-read)	1.5-2.2 GB	1.0-1.2 GB (1.3-2.1x less)
Binary size	~100 MB (JRE + Perl)	4.8 MB
Runtime dependencies	JRE + Perl	libz only (or none for pure build)

Real-World Impact

To put these performance gains in context, we analysed FastQC task execution data from the Seqera Platform Cloud, which runs Nextflow pipelines at scale for bioinformatics teams worldwide.

Scale of FastQC usage (Seqera Platform Cloud only)

Metric	Value
Period analysed	Jan 2025 - Mar 2026 (15 months)
Total FastQC tasks	3.6 million
Total CPU hours consumed	1.1 million hours
Average per month	~242,000 tasks / ~73,000 CPU hours
Annualised	~2.9 million tasks / ~877,000 CPU hours

Projected savings with the Rust rewrite

With a 2.7x average short-read speedup (the dominant use case), each CPU hour of FastQC work would complete in 22 minutes instead of 60. Applied to the annualised Seqera Cloud usage alone:

Metric	Current (Java)	Projected (Rust)	Saved
Annual CPU hours	877,000	325,000	552,000 hours
Annual cloud compute cost*	$26,300	$9,700	$16,600
Annual CO2 emissions**	61 tonnes	23 tonnes	39 tonnes

* Estimated at $0.03/CPU-hour (typical cloud spot pricing for compute-optimised instances).
** Estimated using global average grid carbon intensity (0.35 kgCO2/kWh) and ~200W per CPU core. 39 tonnes of CO2 is roughly equivalent to what 1,600 mature trees absorb in a year.

Beyond Seqera Platform

These figures represent a single cloud platform. FastQC is one of the most widely used bioinformatics tools in the world, running in university clusters, hospital genomics labs, national sequencing centres, and cloud pipelines globally. The total compute spent on FastQC across all platforms is likely orders of magnitude larger.

The memory reduction also has practical consequences: cloud instances can be significantly downsized (a 512 MB task becomes a 30 MB task for short reads), reducing the minimum instance size required and improving bin-packing efficiency in schedulers like Kubernetes and AWS Batch.