Read headers

Many high throughput sequencers produce FASTQ files. While they all adhere to the FASTQ format specification, different machines produce read headers with slightly different formats.

Largely, there are 5 styles of formats:

Illumina-like: @<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>(:<UMI>(+<UMI2>)?)? <read_or_track>:<is_filtered>:<control_number>(:<sample_number_or_barcode>(+<barcode2>)?)?
Oxford Nanopore (v26.01 and later): @<read_id>(\t<key>:<type>:<value>)*
Oxford Nanopore (v25.09 and earlier): @<read_id>(\s<key>=<value>)*
MGI/BGI: @<instrument><lane><column><row>_<count>/<read>
Unstructured: @<read_id>

"Illumina-like" is most common, with variations as to the particular order, meaning, or allowable values across different machines and platforms. "Oxford Nanopore" (both versions) are commonly found performing custom processing on reads, when custom information needs to be stored in the header alongside the sequence. "Unstructured" is common when downloading processed FASTQ files from the Sequence Read Archive (SRA), European Nucleotide Archive (ENA), and other repositories, since the original header is replaced.

Below is a detailed, referenced list of these different formats.

Element Biosciences

Element Biosciences has machines that can produce FASTQ files both for DNA sequences and for cytoprofiling. The two FASTQs have slightly different header formats.

SequencingCytoprofiling

The header format depends on the sample indexing strategy.

No IndexSingle IndexDual Index

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<sample_number>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC1>+<BC2>

Element	Description	Matching Regex
`<instrument>`	Instrument name	`[A-Za-z0–9_]+`
`<run>`	Run name	`[A-Za-z0–9_-]+`
`<flow_cell>`	Flow cell ID from the barcode scan. If the barcode scan fails during the run and no barcode is present, then the run ID replaces the flow cell ID.	`[A-Za-z0–9]+`
`<lane>`	Lane number	`[12]`
`<tile>`	Tile number	`\d+`
`<x>`	X coordinate of the polony	`(0+)?\d+`
`<y>`	Y coordinate of the polony	`(0+)?\d+`
`<UMI>`	UMI sequence with a plus sign that separates the Read 1 and Read 2 sequences, if applicable	`[ACGTN]+`
`<read>`	Read number	`[12]`
`<is_filtered>`	A legacy filtering value of `N` that exists only for backward compatibility and does not change	`N`
`<control_number>`	A legacy control number of `0` that exists only for backward compatibility and does not change	`0`
`<BC>`, `<BC1>`, `<BC2>`	Sample barcode sequence(s)	`[ACGT]+`

The header format is:

@<instrument>:<run>:<flow_cell>:<well_index>:<z_tile>:<x>:<y>:<cell_id>:<nuclear_status>:<batch_id> <read>:<is_filtered>:<control_number>:<BC>

Element	Description	Matching Regex
`<instrument>`	Instrument name	`[A-Za-z0-9_-]+`
`<run>`	Run name	`[A-Za-z0-9-]+`
`<flow_cell>`	Flow cell ID, `R<run>`, or `UNKNOWN_FLOWCELL`	`[A-Za-z0-9-]+`
`<well_index>`	Well ID	`\d+`
`<z_tile>`	The tile plus the z slice in the format of `SRRCCZZ` (surface, row, row, col, col, z, z)	`\d{7}`
`<x>`	X coordinate of the polony, 0-padded	`(0+)?\d+`
`<y>`	Y coordinate of the polony, 0-padded	`(0+)?\d+`
`<cell_id>`	Unique ID of the cell	`\d+`
`<nuclear_status>`	The cellular location of a polony, whether the polony is in the nucleus or not. `0` is cellular and `1` is nuclear.	`[01]`
`<batch_id>`	The on-instrument cycling batch	`\w+`
`<read>`	Always `1` for cytoprofiling	`1`
`<is_filtered>`	A legacy filtering value of `N` that exists only for backward compatibility and does not change	`N`
`<control_number>`	A legacy control number of `N` that exists only for backward compatibility and does not change	`0`
`<BC>`	Always `1` for cytoprofiling	`1`

Illumina

The header format is:

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<sample_number>

Element	Description	Matching Regex
`<instrument>`	Instrument ID	`[A-Za–z0–9_]+`
`<run>`	Run number on instrument	`\d+`
`<flow_cell>`	Flow cell ID	`[A-Za–z0–9]+`
`<lane>`	Lane number	`\d+`
`<tile>`	Tile number	`\d+`
`<x>`	X coordinate of cluster	`\d+`
`<y>`	Y coordinate of cluster	`\d+`
`<read>`	Read number. `1` can be single-end or Read 2 of paired-end. `0` for Read 1 of paired-end.	`[01]`
`<is_filtered>`	`Y` if the read is filtered (did not pass), `N` otherwise	`[YN]`
`<control_number>`	0 when none of the control bits are on, otherwise it is an even number	`\d+`
`<sample_number>`	Sample number from sample sheet	`\d+`

On HiSeq X systems, control specification is not performed and this number is always 0. If the read is identified as a control, the number is greater than zero, and the value specifies what type of control it is. The value is the decimal representation of a bit-wise encoding scheme. In that scheme bit 0 has a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on.

MGI / BGI

The header format is:

@<instrument><lane><column><row>_<count>/<read>

Element	Description	Matching Regex
`<instrument>`	Instrument ID	`CL\d+`
`<lane>`	Lane number	`L\d`
`<column>`	DNA ball column index	`C\d{3}`
`<row>`	DNA ball row index	`R\d{3}`
`<count>`	Read index, starting at 0	`\d+`
`<read>`	`1` for forward read, `2` for reverse read	`[12]`

Oxford Nanopore

Oxford Nanopore's <read_id> is unique among sequencing platforms, in that the ID is not constructed out of instrument, run, or other metadata. The <read_id> is a universally unique identifier (UUID), as are other metadata elements in the header.

26.01 and Later25.09 and Earlier

@<read_id>(\t<key>:<type>:<value>)*

Key:Type	Description	Matching Regex
`read_id`	UUID for the read	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`RG:Z`	Read group ID the read originates from	`([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?`
`DT:Z`	Protocol start time of the sequencing run, formatted as RFC3339.	`\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z\|+\d{2}:\d{2})`
`ch:i`	Number of the channel the read was acquired on.	`[1-9][0-9]*`
`st:Z`	Read start time. If this read is a split from a parent, this is the start time for the split read, formatted as RFC3339	`\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z\|+\d{2}:\d{2})`
`PU:Z`	Unique identifier for the flow cell	`[A-Z0-9_-]+`
`LB:Z`	Sample library identifier, set by the user when loading samples. Absent if not set.	`[a-zA-Z0-9_\.-]+`
`SM:Z`	Barcode for the identified read. Only used when barcoding.	`barcode([0-9]+)`
`al:Z`	User-specified identifier used for the barcode, if available, otherwise the arrangement name. Included only if data is present and the arrangement is not `unclassified`.	`unclassified\|[A-Za-z0-9-_.]+`
`pi:Z`	Parent read ID for a split read	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`DS:Z`	GPU device name(s) used during GPU-based calling	`.*`
`ns:i`	Number of samples in the signal	`[1-9][0-9]*`
`qs:f`	Read mean basecall Q-score	`[0-9]+(\.)?[0-9]*`
`mx:i`	Mux the read originated	`[1-9][0-9]*`
`rn:i`	Channel the read originated	`[1-9][0-9]*`
`ts:i`	Number of samples trimmed from the start of the signal	`[1-9][0-9]*`
`sm:f`	Scaling median	`[0-9]+(\.)?[0-9]*`
`sd:f`	Scaling dispersion (also sometimes referred to as "MAD" or "spread")	`[0-9]+(\.)?[0-9]*`
`sv:Z`	Scaling method used by the base caller	`(med_mad\|quantile)`
`du:f`	Duration of the read (in seconds)	`[0-9]+(\.)?[0-9]*`
`dx:i`	Duplex read indicator. `1` for duplex reads, `0` for simplex reads without duplex offspring, `-1` for simplex reads with duplex offspring.	`(-1\|0\|1)`
`pt:i`	Estimated number of bases in the polyA/T tail. Only used if `poly_a_tail_estimation` is set.	`[1-9][0-9]*`
`pa:B:i`	Integer array of PolyA/T tail range information. Only used if `poly_a_tail_estimation` is set.	See Section 4.2.4 "Auxillary data encoding" in the SAM Specification for details about its encoding

This <key>:<type> format is borrowed from the SAM Specification.

@<read_id>(\s<key>=<value>)*

Element	Description	Matching Regex
`read_id`	UUID for the read	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`runid`	UUID for the sequencing protocol	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`ch`	The number of the channel the read was acquired on. The first channel is 1.	`[0-9]+`
`start_time`	The time the read started in RFC3339 format.	`((?:(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}(?:.\d+)?))(Z\|[+-]\d{2}:\d{2})?)`
`flow_cell_id`	The human-readable identifier for the flow cell	`[A-Z0-9_-]+`
`protocol_group_id`	Set by the user in the GUI as "Experiment ID".	`[a-zA-Z0-9_\.-]+`
`sample_id`	Set by the user in the GUI as "Sample ID".	`[a-zA-Z0-9_\.-]+`
`barcode`	The barcode assigned to this read by the basecaller (eg: `barcode01`). `unclassified` if no barcode was detected.	`unclassified\|barcode([0-9]+)`
`barcode_alias`	The user-supplied alias for the barcode. Empty if barcoding is not running. The same as `barcode` if the user did not supply an alias.	`unclassified\|[A-Za-z0-9-_.]+`
`parent_read_id`	The `read_id` of the read which was the source of this FASTQ entry. This may be the same as the FASTQ entry id if no read splitting was performed for this read, or will be a new globally unique UUID value if this read was split out of another read by the basecaller.	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`basecall_model_version_id`	The unique identifier for the basecall model used to generate this FASTQ file, as supplied by the basecaller	`[a-z0-9_@\.]+`
`basecall_gpu`	A string description of the connected GPU.	`.*`

These keys may not necessarily be in the order listed and should be treated as an unordered set of values. There may be additional patterns in the read header, so pattern-matching on only these keys, exclusively, should be avoided. See Sequencing Summary documentation for details on mapping these IDs across file formats.

Pacific Biosciences

PacBio's obc2fastq tool can output different read header formats, depending on the indexing strategy and command line invocation¹:

No IndexSingle IndexDual Index

Single UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>

Single UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>:<BC>

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>:<BC>

Single UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>:<BC1>+<BC2>

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>:<BC1>+<BC2>

Element	Description	Matching Regex
`<instrument>`	Instrument ID	`[A-Z0-9]+`
`<run>`	Run number	`\d+`
`<flow_cell>`	Flow cell ID	`[A-Za-z0-9-]+`
`<lane>`	Lane number	`[12]`
`<swathtile>`	Swath tile	`\d+`
`<x>`	X coordinate of the polony	`\d+`
`<y>`	Y coordinate of the polony	`\d+`
`<UMI>`, `<UMI1>`, `<UMI2>`	Unique molecular index (single UMI, or UMI pairs) for the molecule	`[ACGTN]+`
`<track>`	Track the read comes from, related to the cycle mask	`[1-4]`
`<is_filtered>`	A legacy filtering value of `N` that exists only for backward compatibility and does not change	`N`
`<control_number>`	A legacy control number of `N` that exists only for backward compatibility and does not change	`N`
`<BC>`, `<BC1>`, `<BC2>`	Sample barcode sequence(s)	`[ACGTN]+`

Singular Genomics

The header format depends on the sample indexing strategy.

No IndexSingle IndexDual Index

No UMISingle UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>

No UMISingle UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC>

No UMISingle UMIDual UMI

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>

Element	Description	Matching Regex
`<instrument>`	Instrument ID	`[A-Z0-9-]+`
`<run>`	Run number	`[0-9]+`
`<flow_cell>`	Flow cell ID number from the EPROM	`[A-Z0-9]+`
`<lane>`	Lane number the molecule was sequenced on	`[1-4]`
`<tile>`	Tile number	`[0-9]+`
`<x>`	X coordinate of the sequencing cluster within the tile	`[0-9]+`
`<y>`	Y coordinate of the sequencing cluster within the tile	`[0-9]+`
`<UMI>`, `<UMI1>`, `<UMI2>`	Unique molecular identifier(s)	`[ACGT]+`
`<read>`	Read number for paired-end sequencing. If single-end, this value is `1`	`[12]`
`<is_filtered>`	Did this read fail filtering?	`[NY]`
`<control_number>`	Boolean value if this read is part of an internal control (`0` for no, `1` for yes, typically a PhiX molecule)	`[01]`
`<BC>`, `<BC1>`, `<BC2>`	Sample barcode sequence(s)	`[ACGT]+`

Ultima Genomics

Ultima machines natively output sequencing files in the CRAM format, and normally skip FASTQs. However, FASTQs can be created from the output CRAM file. The header format is:

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<UMI1>+<UMI2>

Element	Description	Matching Regex
`<instrument>`	Instrument ID	`[A-Za-z0-9_-]+`
`<run>`	Run number on instrument	`\d+`
`<flow_cell>`	Flow cell ID	`[A-Z]+`
`<lane>`	Lane number	`[1-4]`
`<tile>`	Tile number	`\d+`
`<x>`	X coordinate of cluster	`\d+`
`<y>`	Y coordinate of cluster	`\d+`
`<read>`	Read number for paired-end reads	`[12]`
`<is_filtered>`	`Y` if the read is filtered (did not pass), `N` otherwise	`[YN]`
`<control_number>`	`0` when none of the control bits are on, otherwise it is an even number	`\d+`
`<UMI1>`, `<UMI2>`	Unique molecular identifiers from the 5' and 3' ends, respectively	`[ACGTN]+`

References

Element Biosciences

bases2fastq documentation

Illumina

MGI/BGI

Oxford Nanopore

Pacific Biosciences

Singular Genomics

Ultima Genomics

Qiagen CLC Genomics Workbench: Ultima Genomics Data
European Nucleotide Archive: Experiment ERX11081003. Inspecting uploaded FASTQs from the experiment.

obc2fastq v6.1 documentation actually states <UMI>+<track> for single UMI reads, but this is most likely a typo. v6.0 states <UMI> <track>, and this format is consistent with the dual UMI read. ↩