FastQC: Rust vs Java Benchmark Report

Date: 9 April 2026
Machine: Apple Silicon Mac (aarch64-apple-darwin)
Implementations:

All runs are single-threaded, one file at a time. This reflects the typical pipeline use case where an external scheduler (Nextflow, Snakemake) handles parallelism across files.

Java
Rust

Test Dataset

FileSizePlatformType
SRR7890918_tumor_1.fastq.gz7.3 GBIlluminaWES, short-read
SRR7890918_tumor_2.fastq.gz7.5 GBIlluminaWES, short-read
SRR7890919_normal_1.fastq.gz6.2 GBIlluminaWES, short-read
SRR7890919_normal_2.fastq.gz6.3 GBIlluminaWES, short-read
ERR16944282_1.fastq.gz487 MBElement AVITIShort-read
ERR16944282_2.fastq.gz630 MBElement AVITIShort-read
ERR16944299_1.fastq.gz454 MBElement AVITIShort-read
ERR16944299_2.fastq.gz597 MBElement AVITIShort-read
ERR16962265.fastq.gz21 GBONT PromethIONLong-read (wheat WGS)
SRR37915503.fastq.gz58 GBONT PromethIONLong-read (snake WGS)
GM12878_REP1.markdup.sorted.bam11 GBIlluminaBAM
Total: 119 GB11 files across 3 platforms + BAM

Output Correctness

All output files (fastqc_data.txt, summary.txt) are byte-identical between Java and Rust for all 11 test files, except for the Sequence Duplication Levels module where floating-point values differ at the 14th-15th decimal place. This is caused by HashMap iteration order affecting accumulation in the statistical correction algorithm and is functionally meaningless.

Per-File Results

Wall Clock Time

FileSizeJavaRustSpeedup
Illumina WES (short-read, 150bp)
tumor_1.fastq.gz7.3 GB 9m 46s3m 13s3.0x

tumor_2.fastq.gz7.5 GB 9m 09s3m 18s2.8x

normal_1.fastq.gz6.2 GB 7m 08s2m 45s2.6x

normal_2.fastq.gz6.3 GB 7m 05s2m 54s2.4x

Element AVITI (short-read)
ERR16944282_1.fastq.gz487 MB 35s14s2.5x

ERR16944282_2.fastq.gz630 MB 36s16s2.3x

ERR16944299_1.fastq.gz454 MB 35s14s2.5x

ERR16944299_2.fastq.gz597 MB 41s15s2.7x

ONT PromethION (long-read)
ERR16962265.fastq.gz21 GB 20m 28s10m 16s2.0x

SRR37915503.fastq.gz58 GB 49m 37s30m 21s1.6x

BAM
GM12878.bam11 GB 8m 00s4m 51s1.6x

Peak Memory (RSS)

FileJavaRustRatio
Illumina WES (short-read)
tumor_1.fastq.gz444 MB28 MB15.9x less
tumor_2.fastq.gz438 MB29 MB15.1x less
normal_1.fastq.gz439 MB25 MB17.6x less
normal_2.fastq.gz405 MB28 MB14.5x less
Element AVITI (short-read)
ERR16944282_1.fastq.gz412 MB53 MB7.8x less
ERR16944282_2.fastq.gz483 MB54 MB8.9x less
ERR16944299_1.fastq.gz388 MB55 MB7.1x less
ERR16944299_2.fastq.gz361 MB54 MB6.7x less
ONT PromethION (long-read)
ERR16962265.fastq.gz1550 MB1170 MB1.3x less
SRR37915503.fastq.gz2209 MB1028 MB2.1x less
BAM
GM12878.bam543 MB30 MB18.1x less
Memory and long reads: ONT PromethION data uses significantly more memory in both implementations because reads can be tens of thousands of bases long, requiring proportionally larger per-position arrays. For short-read data (Illumina, AVITI), Rust uses 7-18x less memory. For long-read data the ratio narrows to 1.3-2.1x because the per-position arrays dominate in both.

Speedup by Platform

PlatformRead TypeAvg SpeedupJava ThroughputRust Throughput
IlluminaShort-read (150bp)2.7x14 MB/s38 MB/s
Element AVITIShort-read2.5x15 MB/s37 MB/s
ONT PromethIONLong-read (1-20kb)1.8x19 MB/s34 MB/s
BAMMapped reads1.6x23 MB/s39 MB/s

Throughput = compressed file size / wall-clock time, single-threaded.

Short-read FASTQ benefits most (2.5-3.0x) because the three optimisations (SIMD adapter search, lookup-table base counting, zero allocation) all target the per-read processing loop, which dominates for short reads. Long-read data shows a smaller improvement (1.6-2.0x) because gzip decompression and per-base processing of very long sequences become the dominant costs, diluting the relative impact of the adapter and allocation optimisations.

Why is BAM Slower than FASTQ?

The BAM speedup (1.6x) is lower than short-read FASTQ (2.5-3.0x) because of the record parsing overhead. Java uses htsjdk, a mature and heavily optimised C-backed BAM library. Rust uses noodles, a pure Rust implementation. BAM requires BGZF decompression, CIGAR parsing, flag decoding, and reverse-complement reconstruction for every record. noodles does all of this in pure Rust, which is solid but cannot match htsjdk's years of C-level optimisation. This narrows the gap that the other optimisations create.

Where the Time Goes (Rust)

CPU profile of the Rust implementation processing 2M short reads (macOS sample tool, 3,300 active samples). This shows where the Rust rewrite spends its time:

Component%
Gzip decompression (native libz)25%
Adapter search (memchr SIMD)18%
PerBaseSequenceContent12%
NContent11%
BasicStats9%
FASTQ I/O + parsing8%
PerBaseQuality + PerTileQuality6%
PerSequenceQuality + GCContent3%
Memory, hashing, CRC, other8%

A quarter of the time is spent in gzip decompression (native C libz, shared with Java). The remaining 75% is application code where the three optimisations described below have their effect. An equivalent profile of Java is not available because Java Flight Recorder cannot sample native C code or JVM intrinsics, making a direct side-by-side comparison misleading.

What Makes the Rust Version Faster

The speed improvement comes from three specific optimisations, not from Rust being inherently faster than Java. Each was identified through CPU profiling and contributes a measurable share of the overall gain.

1. SIMD-accelerated adapter search (biggest single factor)

FastQC searches every read for 6 adapter sequences. In Java, String.indexOf() compares one character at a time. The Rust version uses the memchr crate, which employs ARM NEON SIMD instructions to compare 16 bytes per CPU cycle. The adapter search patterns are also pre-compiled once at startup rather than being reconstructed on every call. This single change reduces adapter searching from 35% of total runtime to 6%.

2. Lookup-table base counting (halves module processing time)

Three QC modules (BasicStats, PerBaseSequenceContent, NContent) independently iterate over every byte of every read to classify bases as A, C, G, T, or N. In Java, each module uses a multi-way switch or if/else chain. With random DNA data, the CPU branch predictor can only guess correctly 25% of the time, causing pipeline stalls on every misprediction.

The Rust version replaces these branches with a 256-byte compile-time lookup table that maps each ASCII byte directly to a base index. This turns an unpredictable branch into a single indexed memory load. Combined with storing per-position counts in a contiguous [u64; 4] array (instead of 4 separate arrays), this halves the time spent in module processing.

3. Zero per-read allocation

Java creates several new objects for every read: a Sequence object, a new String from toUpperCase(), String[] arrays from split(":") for tile parsing, and substring() calls for overrepresented sequence tracking. Each allocation contributes to heap pressure and eventually triggers garbage collection. Over a typical run, the JVM performs tens of thousands of GC events.

The Rust version avoids all of this. Sequences are uppercase-converted in place (a single pass over the existing byte buffer, no new allocation). Tile IDs are parsed by scanning for colons without splitting. Overrepresented sequences are tracked using byte slices that reference the existing data. With no garbage collector, there are no GC pauses and no stop-the-world events.

Could these improvements be back-ported to Java?

We tested each optimisation in Java to see whether the same gains could be achieved without a full rewrite.

SIMD adapter search: No. Java has no way to invoke NEON or SSE intrinsics for substring matching. String.indexOf() already uses HotSpot's best internal intrinsic. We tested replacing it with a manual char[] loop and it was actually slower (16.0s vs 12.9s) because HotSpot's built-in implementation is more optimised than hand-written Java. The JVM Vector API (JEP 338) exists in preview but does not support substring search patterns.

Lookup-table base counting: Marginal. Java's JIT compiler already converts small switch statements on byte values into jump tables internally. Replacing the switch with an explicit array lookup in Java measured only ~3% improvement. The remaining gap is that Rust's compile-time constant lookup table with a register-allocated local accumulator compiles to tighter machine code than what HotSpot generates at runtime.

Zero-allocation processing: Structurally limited. Java's String is immutable and heap-allocated by design. toUpperCase() must return a new String. split(":") must allocate a String[]. We tested workarounds (check-before-uppercase, manual colon scanning, HashMap to array conversion) and measured negligible improvement in each case. HotSpot's escape analysis and generational GC already handle short-lived allocations efficiently, but not for free: the JVM still recorded tens of thousands of GC events per run, each causing brief pauses.

The Java version is already close to its performance ceiling. HotSpot's JIT does a good job with the patterns that exist in the codebase, and the remaining gap comes from structural properties of the JVM: heap-allocated immutable strings, no SIMD substring search, and garbage collection overhead that cannot be fully eliminated.

Build Variants

The Rust binary defaults to native zlib for gzip decompression (available on all Linux/macOS). A fully static pure-Rust build is available for WASM, Windows, or environments without libz:

BuildCommandDependencies
Default (native zlib)cargo build --releaselibz (system)
Pure Rustcargo build --release --no-default-featuresNone

Overall Summary

MetricJavaRust
Short-read speed (avg)14 MB/s38 MB/s (2.7x faster)
Long-read speed (avg)19 MB/s34 MB/s (1.8x faster)
BAM speed23 MB/s39 MB/s (1.6x faster)
Peak memory (short-read)361-483 MB25-55 MB (7-18x less)
Peak memory (long-read)1.5-2.2 GB1.0-1.2 GB (1.3-2.1x less)
Binary size~100 MB (JRE + Perl)4.8 MB
Runtime dependenciesJRE + Perllibz only (or none for pure build)

Real-World Impact

To put these performance gains in context, we analysed FastQC task execution data from the Seqera Platform Cloud, which runs Nextflow pipelines at scale for bioinformatics teams worldwide.

Scale of FastQC usage (Seqera Platform Cloud only)

MetricValue
Period analysedJan 2025 - Mar 2026 (15 months)
Total FastQC tasks3.6 million
Total CPU hours consumed1.1 million hours
Average per month~242,000 tasks / ~73,000 CPU hours
Annualised~2.9 million tasks / ~877,000 CPU hours

Projected savings with the Rust rewrite

With a 2.7x average short-read speedup (the dominant use case), each CPU hour of FastQC work would complete in 22 minutes instead of 60. Applied to the annualised Seqera Cloud usage alone:

MetricCurrent (Java)Projected (Rust)Saved
Annual CPU hours877,000325,000552,000 hours
Annual cloud compute cost*$26,300$9,700$16,600
Annual CO2 emissions**61 tonnes23 tonnes39 tonnes

* Estimated at $0.03/CPU-hour (typical cloud spot pricing for compute-optimised instances).
** Estimated using global average grid carbon intensity (0.35 kgCO2/kWh) and ~200W per CPU core. 39 tonnes of CO2 is roughly equivalent to what 1,600 mature trees absorb in a year.

Beyond Seqera Platform

These figures represent a single cloud platform. FastQC is one of the most widely used bioinformatics tools in the world, running in university clusters, hospital genomics labs, national sequencing centres, and cloud pipelines globally. The total compute spent on FastQC across all platforms is likely orders of magnitude larger.

The memory reduction also has practical consequences: cloud instances can be significantly downsized (a 512 MB task becomes a 30 MB task for short reads), reducing the minimum instance size required and improving bin-packing efficiency in schedulers like Kubernetes and AWS Batch.