FastQC: Rust vs Java Benchmark Report
Date: 9 April 2026
Machine: Apple Silicon Mac (aarch64-apple-darwin)
Implementations:
- Java — FastQC 0.12.2.devel, OpenJDK, G1GC, 4GB max heap
- Rust — fastqc-rs 0.12.1 with native zlib, LTO, lookup-table base counting
All runs are single-threaded, one file at a time. This reflects the typical pipeline use case where an external scheduler (Nextflow, Snakemake) handles parallelism across files.
Test Dataset
| File | Size | Platform | Type |
|---|---|---|---|
| SRR7890918_tumor_1.fastq.gz | 7.3 GB | Illumina | WES, short-read |
| SRR7890918_tumor_2.fastq.gz | 7.5 GB | Illumina | WES, short-read |
| SRR7890919_normal_1.fastq.gz | 6.2 GB | Illumina | WES, short-read |
| SRR7890919_normal_2.fastq.gz | 6.3 GB | Illumina | WES, short-read |
| ERR16944282_1.fastq.gz | 487 MB | Element AVITI | Short-read |
| ERR16944282_2.fastq.gz | 630 MB | Element AVITI | Short-read |
| ERR16944299_1.fastq.gz | 454 MB | Element AVITI | Short-read |
| ERR16944299_2.fastq.gz | 597 MB | Element AVITI | Short-read |
| ERR16962265.fastq.gz | 21 GB | ONT PromethION | Long-read (wheat WGS) |
| SRR37915503.fastq.gz | 58 GB | ONT PromethION | Long-read (snake WGS) |
| GM12878_REP1.markdup.sorted.bam | 11 GB | Illumina | BAM |
| Total: 119 GB | 11 files across 3 platforms + BAM | ||
Output Correctness
All output files (fastqc_data.txt, summary.txt) are byte-identical between Java and Rust for all 11 test files, except for the Sequence Duplication Levels module where floating-point values differ at the 14th-15th decimal place. This is caused by HashMap iteration order affecting accumulation in the statistical correction algorithm and is functionally meaningless.
Per-File Results
Wall Clock Time
| File | Size | Java | Rust | Speedup | |
|---|---|---|---|---|---|
| Illumina WES (short-read, 150bp) | |||||
| tumor_1.fastq.gz | 7.3 GB | 9m 46s | 3m 13s | 3.0x | |
| tumor_2.fastq.gz | 7.5 GB | 9m 09s | 3m 18s | 2.8x | |
| normal_1.fastq.gz | 6.2 GB | 7m 08s | 2m 45s | 2.6x | |
| normal_2.fastq.gz | 6.3 GB | 7m 05s | 2m 54s | 2.4x | |
| Element AVITI (short-read) | |||||
| ERR16944282_1.fastq.gz | 487 MB | 35s | 14s | 2.5x | |
| ERR16944282_2.fastq.gz | 630 MB | 36s | 16s | 2.3x | |
| ERR16944299_1.fastq.gz | 454 MB | 35s | 14s | 2.5x | |
| ERR16944299_2.fastq.gz | 597 MB | 41s | 15s | 2.7x | |
| ONT PromethION (long-read) | |||||
| ERR16962265.fastq.gz | 21 GB | 20m 28s | 10m 16s | 2.0x | |
| SRR37915503.fastq.gz | 58 GB | 49m 37s | 30m 21s | 1.6x | |
| BAM | |||||
| GM12878.bam | 11 GB | 8m 00s | 4m 51s | 1.6x | |
Peak Memory (RSS)
| File | Java | Rust | Ratio |
|---|---|---|---|
| Illumina WES (short-read) | |||
| tumor_1.fastq.gz | 444 MB | 28 MB | 15.9x less |
| tumor_2.fastq.gz | 438 MB | 29 MB | 15.1x less |
| normal_1.fastq.gz | 439 MB | 25 MB | 17.6x less |
| normal_2.fastq.gz | 405 MB | 28 MB | 14.5x less |
| Element AVITI (short-read) | |||
| ERR16944282_1.fastq.gz | 412 MB | 53 MB | 7.8x less |
| ERR16944282_2.fastq.gz | 483 MB | 54 MB | 8.9x less |
| ERR16944299_1.fastq.gz | 388 MB | 55 MB | 7.1x less |
| ERR16944299_2.fastq.gz | 361 MB | 54 MB | 6.7x less |
| ONT PromethION (long-read) | |||
| ERR16962265.fastq.gz | 1550 MB | 1170 MB | 1.3x less |
| SRR37915503.fastq.gz | 2209 MB | 1028 MB | 2.1x less |
| BAM | |||
| GM12878.bam | 543 MB | 30 MB | 18.1x less |
Speedup by Platform
| Platform | Read Type | Avg Speedup | Java Throughput | Rust Throughput |
|---|---|---|---|---|
| Illumina | Short-read (150bp) | 2.7x | 14 MB/s | 38 MB/s |
| Element AVITI | Short-read | 2.5x | 15 MB/s | 37 MB/s |
| ONT PromethION | Long-read (1-20kb) | 1.8x | 19 MB/s | 34 MB/s |
| BAM | Mapped reads | 1.6x | 23 MB/s | 39 MB/s |
Throughput = compressed file size / wall-clock time, single-threaded.
Short-read FASTQ benefits most (2.5-3.0x) because the three optimisations (SIMD adapter search, lookup-table base counting, zero allocation) all target the per-read processing loop, which dominates for short reads. Long-read data shows a smaller improvement (1.6-2.0x) because gzip decompression and per-base processing of very long sequences become the dominant costs, diluting the relative impact of the adapter and allocation optimisations.
Why is BAM Slower than FASTQ?
The BAM speedup (1.6x) is lower than short-read FASTQ (2.5-3.0x) because of the record parsing overhead. Java uses htsjdk, a mature and heavily optimised C-backed BAM library. Rust uses noodles, a pure Rust implementation. BAM requires BGZF decompression, CIGAR parsing, flag decoding, and reverse-complement reconstruction for every record. noodles does all of this in pure Rust, which is solid but cannot match htsjdk's years of C-level optimisation. This narrows the gap that the other optimisations create.
Where the Time Goes (Rust)
CPU profile of the Rust implementation processing 2M short reads (macOS sample tool, 3,300 active samples). This shows where the Rust rewrite spends its time:
| Component | % | |
|---|---|---|
| Gzip decompression (native libz) | 25% | |
| Adapter search (memchr SIMD) | 18% | |
| PerBaseSequenceContent | 12% | |
| NContent | 11% | |
| BasicStats | 9% | |
| FASTQ I/O + parsing | 8% | |
| PerBaseQuality + PerTileQuality | 6% | |
| PerSequenceQuality + GCContent | 3% | |
| Memory, hashing, CRC, other | 8% |
A quarter of the time is spent in gzip decompression (native C libz, shared with Java). The remaining 75% is application code where the three optimisations described below have their effect. An equivalent profile of Java is not available because Java Flight Recorder cannot sample native C code or JVM intrinsics, making a direct side-by-side comparison misleading.
What Makes the Rust Version Faster
The speed improvement comes from three specific optimisations, not from Rust being inherently faster than Java. Each was identified through CPU profiling and contributes a measurable share of the overall gain.
1. SIMD-accelerated adapter search (biggest single factor)
FastQC searches every read for 6 adapter sequences. In Java, String.indexOf() compares one character at a time. The Rust version uses the memchr crate, which employs ARM NEON SIMD instructions to compare 16 bytes per CPU cycle. The adapter search patterns are also pre-compiled once at startup rather than being reconstructed on every call. This single change reduces adapter searching from 35% of total runtime to 6%.
2. Lookup-table base counting (halves module processing time)
Three QC modules (BasicStats, PerBaseSequenceContent, NContent) independently iterate over every byte of every read to classify bases as A, C, G, T, or N. In Java, each module uses a multi-way switch or if/else chain. With random DNA data, the CPU branch predictor can only guess correctly 25% of the time, causing pipeline stalls on every misprediction.
The Rust version replaces these branches with a 256-byte compile-time lookup table that maps each ASCII byte directly to a base index. This turns an unpredictable branch into a single indexed memory load. Combined with storing per-position counts in a contiguous [u64; 4] array (instead of 4 separate arrays), this halves the time spent in module processing.
3. Zero per-read allocation
Java creates several new objects for every read: a Sequence object, a new String from toUpperCase(), String[] arrays from split(":") for tile parsing, and substring() calls for overrepresented sequence tracking. Each allocation contributes to heap pressure and eventually triggers garbage collection. Over a typical run, the JVM performs tens of thousands of GC events.
The Rust version avoids all of this. Sequences are uppercase-converted in place (a single pass over the existing byte buffer, no new allocation). Tile IDs are parsed by scanning for colons without splitting. Overrepresented sequences are tracked using byte slices that reference the existing data. With no garbage collector, there are no GC pauses and no stop-the-world events.
Could these improvements be back-ported to Java?
We tested each optimisation in Java to see whether the same gains could be achieved without a full rewrite.
SIMD adapter search: No. Java has no way to invoke NEON or SSE intrinsics for substring matching. String.indexOf() already uses HotSpot's best internal intrinsic. We tested replacing it with a manual char[] loop and it was actually slower (16.0s vs 12.9s) because HotSpot's built-in implementation is more optimised than hand-written Java. The JVM Vector API (JEP 338) exists in preview but does not support substring search patterns.
Lookup-table base counting: Marginal. Java's JIT compiler already converts small switch statements on byte values into jump tables internally. Replacing the switch with an explicit array lookup in Java measured only ~3% improvement. The remaining gap is that Rust's compile-time constant lookup table with a register-allocated local accumulator compiles to tighter machine code than what HotSpot generates at runtime.
Zero-allocation processing: Structurally limited. Java's String is immutable and heap-allocated by design. toUpperCase() must return a new String. split(":") must allocate a String[]. We tested workarounds (check-before-uppercase, manual colon scanning, HashMap to array conversion) and measured negligible improvement in each case. HotSpot's escape analysis and generational GC already handle short-lived allocations efficiently, but not for free: the JVM still recorded tens of thousands of GC events per run, each causing brief pauses.
The Java version is already close to its performance ceiling. HotSpot's JIT does a good job with the patterns that exist in the codebase, and the remaining gap comes from structural properties of the JVM: heap-allocated immutable strings, no SIMD substring search, and garbage collection overhead that cannot be fully eliminated.
Build Variants
The Rust binary defaults to native zlib for gzip decompression (available on all Linux/macOS). A fully static pure-Rust build is available for WASM, Windows, or environments without libz:
| Build | Command | Dependencies |
|---|---|---|
| Default (native zlib) | cargo build --release | libz (system) |
| Pure Rust | cargo build --release --no-default-features | None |
Overall Summary
| Metric | Java | Rust |
|---|---|---|
| Short-read speed (avg) | 14 MB/s | 38 MB/s (2.7x faster) |
| Long-read speed (avg) | 19 MB/s | 34 MB/s (1.8x faster) |
| BAM speed | 23 MB/s | 39 MB/s (1.6x faster) |
| Peak memory (short-read) | 361-483 MB | 25-55 MB (7-18x less) |
| Peak memory (long-read) | 1.5-2.2 GB | 1.0-1.2 GB (1.3-2.1x less) |
| Binary size | ~100 MB (JRE + Perl) | 4.8 MB |
| Runtime dependencies | JRE + Perl | libz only (or none for pure build) |
Real-World Impact
To put these performance gains in context, we analysed FastQC task execution data from the Seqera Platform Cloud, which runs Nextflow pipelines at scale for bioinformatics teams worldwide.
Scale of FastQC usage (Seqera Platform Cloud only)
| Metric | Value |
|---|---|
| Period analysed | Jan 2025 - Mar 2026 (15 months) |
| Total FastQC tasks | 3.6 million |
| Total CPU hours consumed | 1.1 million hours |
| Average per month | ~242,000 tasks / ~73,000 CPU hours |
| Annualised | ~2.9 million tasks / ~877,000 CPU hours |
Projected savings with the Rust rewrite
With a 2.7x average short-read speedup (the dominant use case), each CPU hour of FastQC work would complete in 22 minutes instead of 60. Applied to the annualised Seqera Cloud usage alone:
| Metric | Current (Java) | Projected (Rust) | Saved |
|---|---|---|---|
| Annual CPU hours | 877,000 | 325,000 | 552,000 hours |
| Annual cloud compute cost* | $26,300 | $9,700 | $16,600 |
| Annual CO2 emissions** | 61 tonnes | 23 tonnes | 39 tonnes |
* Estimated at $0.03/CPU-hour (typical cloud spot pricing for compute-optimised instances).
** Estimated using global average grid carbon intensity (0.35 kgCO2/kWh) and ~200W per CPU core. 39 tonnes of CO2 is roughly equivalent to what 1,600 mature trees absorb in a year.
Beyond Seqera Platform
These figures represent a single cloud platform. FastQC is one of the most widely used bioinformatics tools in the world, running in university clusters, hospital genomics labs, national sequencing centres, and cloud pipelines globally. The total compute spent on FastQC across all platforms is likely orders of magnitude larger.
The memory reduction also has practical consequences: cloud instances can be significantly downsized (a 512 MB task becomes a 30 MB task for short reads), reducing the minimum instance size required and improving bin-packing efficiency in schedulers like Kubernetes and AWS Batch.