Read headers
Many high throughput sequencers produce FASTQ files. While they all adhere to the FASTQ format specification, different machines produce read headers with slightly different formats. Here is a detailed, referenced list of these different formats.
Element Biosciences
Element Biosciences has machines that can produce FASTQ files both for DNA sequences and for cytoprofiling. The two FASTQs have slightly different header formats.
The header format depends on the sample indexing strategy.
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<sample_number>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC1>+<BC2>
The values are described in the table below.
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument name | [A-Za-z0–9_]+ |
<run> |
Run name | [A-Za-z0–9_-]+ |
<flow_cell> |
Flow cell ID from the barcode scan. If the barcode scan fails during the run and no barcode is present, then the run ID replaces the flow cell ID. | [A-Za-z0–9]+ |
<lane> |
Lane number | [12] |
<tile> |
Tile number | \d+ |
<x> |
X coordinate of the polony | (0+)?\d+ |
<y> |
Y coordinate of the polony | (0+)?\d+ |
<UMI> |
UMI sequence with a plus sign that separates the Read 1 and Read 2 sequences, if applicable | [ACGTN]+ |
<read> |
Read number | [12] |
<is_filtered> |
A legacy filtering value of N that exists only for backward compatibility and does not change |
N |
<control_number> |
A legacy control number of 0 that exists only for backward compatibility and does not change |
0 |
<BC>, <BC1>, <BC2> |
Sample barcode sequence(s) | [ACGT]+ |
The header format is:
@<instrument>:<run>:<flow_cell>:<well_index>:<z_tile>:<x>:<y>:<cell_id>:<nuclear_status>:<batch_id> <read>:<is_filtered>:<control_number>:<BC>
The values are described in the table below.
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument name | [A-Za-z0-9_-]+ |
<run> |
Run name | [A-Za-z0-9-]+ |
<flow_cell> |
Flow cell ID, R<run>, or UNKNOWN_FLOWCELL |
[A-Za-z0-9-]+ |
<well_index> |
Well ID | \d+ |
<z_tile> |
The tile plus the z slice in the format of SRRCCZZ (surface, row, row, col, col, z, z) |
\d{7} |
<x> |
X coordinate of the polony, 0-padded | (0+)?\d+ |
<y> |
Y coordinate of the polony, 0-padded | (0+)?\d+ |
<cell_id> |
Unique ID of the cell | \d+ |
<nuclear_status> |
The cellular location of a polony, whether the polony is in the nucleus or not. 0 is cellular and 1 is nuclear. |
[01] |
<batch_id> |
The on-instrument cycling batch | \w+ |
<read> |
Always 1 for cytoprofiling |
1 |
<is_filtered> |
A legacy filtering value of N that exists only for backward compatibility and does not change |
N |
<control_number> |
A legacy control number of N that exists only for backward compatibility and does not change |
0 |
<BC> |
Always 1 for cytoprofiling |
1 |
Illumina
The header format is:
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<sample_number>
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument ID | [A-Za–z0–9_]+ |
<run> |
Run number on instrument | \d+ |
<flow_cell> |
Flow cell ID | [A-Za–z0–9]+ |
<lane> |
Lane number | \d+ |
<tile> |
Tile number | \d+ |
<x> |
X coordinate of cluster | \d+ |
<y> |
Y coordinate of cluster | \d+ |
<read> |
Read number. 1 can be single-end or Read 2 of paired-end. 0 for Read 1 of paired-end. |
[01] |
<is_filtered> |
Y if the read is filtered (did not pass), N otherwise |
[YN] |
<control_number> |
0 when none of the control bits are on, otherwise it is an even number | \d+ |
<sample_number> |
Sample number from sample sheet | \d+ |
On HiSeq X systems, control specification is not performed and this number is always 0. If the read is identified as a control, the number is greater than zero, and the value specifies what type of control it is. The value is the decimal representation of a bit-wise encoding scheme. In that scheme bit 0 has a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on.
MGI / BGI
The header format is:
@<instrument><lane><column><row>_<count>/<read>
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument ID | CL\d+ |
<lane> |
Lane number | L\d |
<column> |
DNA ball column index | C\d{3} |
<row> |
DNA ball row index | R\d{3} |
<count> |
Read index, starting at 0 | \d+ |
<read> |
1 for forward read, 2 for reverse read |
[12] |
Oxford Nanopore
Oxford Nanopore's <read_id> is unique among sequencing platforms, in that the ID is not constructed out of instrument, run, or other metadata.
The <read_id> is a universally unique identifier (UUID), as are other metadata elements in the header.
@<read_id>(\t<key>:<type>:<value>)*
| Key:Type | Description | Matching Regex |
|---|---|---|
RG:Z |
Read group ID the read originates from | ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))? |
DT:Z |
The protocol start time of the sequencing run, formatted as RFC3339. | \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z|+\d{2}:\d{2}) |
ch:i |
Number of the channel the read was acquired on. | [1-9][0-9]* |
st:Z |
Read start time. If this read is a split from a parent, this is the start time for the split read, formatted as RFC3339 | \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z|+\d{2}:\d{2}) |
PU:Z |
Unique identifier for the flow cell | [A-Z0-9_-]+ |
LB:Z |
Sample library identifier, set by the user when loading samples. Absent if not set. | [a-zA-Z0-9_\.-]+ |
SM:Z |
Barcode for the identified read. Only used when barcoding. | barcode([0-9]+) |
al:Z |
User-specified identifier used for the barcode, if available, otherwise the arrangement name. Included only if data is present and the arrangement is not "unclassified". | unclassified|[A-Za-z0-9-_.]+ |
pi:Z |
Parent read ID for a split read | [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} |
DS:Z |
GPU device name(s) used during GPU-based calling | .* |
ns:i |
The number of samples in the signal | [1-9][0-9]* |
qs:f |
Read mean basecall Q-score | [0-9]+(\.)?[0-9]* |
mx:i |
The mux the read originated | [1-9][0-9]* |
rn:i |
The channel the read originated | [1-9][0-9]* |
ts:i |
Number of samples trimmed from the start of the signal | [1-9][0-9]* |
sm:f |
Scaling median | [0-9]+(\.)?[0-9]* |
sd:f |
Scaling dispersion (also sometimes referred to as "MAD" or "spread") | [0-9]+(\.)?[0-9]* |
sv:Z |
Scaling method used by the base caller | (med_mad|quantile) |
du:f |
duration of the read (in seconds) | [0-9]+(\.)?[0-9]* |
dx:i |
Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring. |
(-1|0|1) |
pt:i |
Estimated number of bases in the polyA/T tail. Only used if poly_a_tail_estimation is set. |
[1-9][0-9]* |
pa:B:i |
Integer array of PolyA/T tail range information. Only used if poly_a_tail_estimation is set. |
See Section 4.2.4 "Auxillary data encoding" in the SAM Specification for details about its encoding |
@<read_id>(\s<key>=<value>)*
| Element | Description | Matching Regex |
|---|---|---|
read_id |
UUID for the read | [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} |
runid |
UUID for the sequencing protocol | [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} |
ch |
The number of the channel the read was acquired on. The first channel is 1. | [0-9]+ |
start_time |
The time the read started in RFC3339 format. | ((?:(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}(?:.\d+)?))(Z|[+-]\d{2}:\d{2})?) |
flow_cell_id |
The human-readable identifier for the flow cell | [A-Z0-9_-]+ |
protocol_group_id |
Set by the user in the GUI as "Experiment ID". | [a-zA-Z0-9_\.-]+ |
sample_id |
Set by the user in the GUI as "Sample ID". | [a-zA-Z0-9_\.-]+ |
barcode |
The barcode assigned to this read by the basecaller (eg: barcode01). unclassified if no barcode was detected. |
unclassified|barcode([0-9]+) |
barcode_alias |
The user-supplied alias for the barcode. Empty if barcoding is not running. The same as barcode if the user did not supply an alias. |
unclassified|[A-Za-z0-9-_.]+ |
parent_read_id |
The read_id of the read which was the source of this FASTQ entry. This may be the same as the FASTQ entry id if no read splitting was performed for this read, or will be a new globally unique UUID value if this read was split out of another read by the basecaller. |
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} |
basecall_model_version_id |
The unique identifier for the basecall model used to generate this FASTQ file, as supplied by the basecaller | [a-z0-9_@\.]+ |
basecall_gpu |
A string description of the connected GPU. | .* |
These keys may not necessarily be in the order listed and should be treated as an unordered set of values. There may be additional patterns in the read header, so pattern-matching on only these keys, exclusively, should be avoided. See Sequencing Summary documentation for details on mapping these IDs across file formats.
Pacific Biosciences
PacBio's obc2fastq tool can output different read header formats, depending on the indexing strategy and command line invocation1:
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>:<BC1>+<BC2>
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument ID | [A-Z0-9]+ |
<run> |
Run number | \d+ |
<flow_cell> |
Flow cell ID | [A-Za-z0-9-]+ |
<lane> |
Lane number | [12] |
<swathtile> |
Swath tile | \d+ |
<x> |
X coordinate of the polony | \d+ |
<y> |
Y coordinate of the polony | \d+ |
<UMI>, <UMI1>, <UMI2> |
Unique molecular index (single UMI, or UMI pairs) for the molecule | [ACGTN]+ |
<track> |
Track the read comes from, related to the cycle mask | [1-4] |
<is_filtered> |
A legacy filtering value of N that exists only for backward compatibility and does not change |
N |
<control_number> |
A legacy control number of N that exists only for backward compatibility and does not change |
N |
<BC>, <BC1>, <BC2> |
Sample barcode sequence(s) | [ACGTN]+ |
Singular Genomics
The header format depends on the sample indexing strategy.
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument ID | [A-Z0-9-]+ |
<run> |
Run number | [0-9]+ |
<flow_cell> |
Flow cell ID number from the EPROM | [A-Z0-9]+ |
<lane> |
Lane number the molecule was sequenced on | [1-4] |
<tile> |
Tile number | [0-9]+ |
<x> |
X coordinate of the sequencing cluster within the tile | [0-9]+ |
<y> |
Y coordinate of the sequencing cluster within the tile | [0-9]+ |
<UMI>, <UMI1>, <UMI2> |
Unique molecular identifier(s | [ACGT]+ |
<read> |
Read number for paired-end sequencing. If single-end, this value is 1 |
[12] |
<is_filtered> |
Did this read fail filtering? | [NY] |
<control_number> |
Boolean value if this read is part of an internal control (0 for no, 1 for yes, typically a PhiX molecule) |
[01] |
<BC>, <BC1>, <BC2> |
Sample barcode sequence(s) | [ACGT]+ |
Ultima Genomics
Ultima machines natively output sequencing files in the CRAM format, and normally skip FASTQs. However, FASTQs can be created from the output CRAM file. The header format is:
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<UMI1>+<UMI2>
| Element | Description | Matching Regex |
|---|---|---|
<instrument> |
Instrument ID | [A-Za-z0-9_-]+ |
<run> |
Run number on instrument | \d+ |
<flow_cell> |
Flow cell ID | [A-Z]+ |
<lane> |
Lane number | [1-4] |
<tile> |
Tile number | \d+ |
<x> |
X coordinate of cluster | \d+ |
<y> |
Y coordinate of cluster | \d+ |
<read> |
Read number for paired-end reads | [12] |
<is_filtered> |
Y if the read is filtered (did not pass), N otherwise |
[YN] |
<control_number> |
0 when none of the control bits are on, otherwise it is an even number |
\d+ |
<UMI1>, <UMI2> |
Unique molecular identifiers from the 5' and 3' ends, respectively | [ACGTN]+ |
References
Element Biosciences
Illumina
- Illumina Connected Software
- Detect the Illumina sequencer model
- Biostars post
- 10X Genomics supernova
- NovaSeq 6000 Integrations v3
- The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
MGI/BGI
- Qiagen CLC Genomics Workbench: MGI/BGI Data
- European Nucleotide Archive: Project PRJEB15427
- European Nucleotide Archive: Project PRJEB19426
Oxford Nanopore
- Oxford Nanopore Output Specifications: FASTQ (v26.01)
- Oxford Nanopore Output Specifications: FASTQ (v25.09)
- Oxford Nanopore Output Specifications: Patterns
- Oxford Nanopore Output Specifications: Sequencing Summary
Pacific Biosciences
Singular Genomics
- Singular Genomics FASTQ Data Format
- Singular Genomics What Are FASTQ Files?
- How are PhiX reads processed by G4?
- singular-demux
- Demultiplexing Guide for G4 Sequencing Platform
Ultima Genomics
- Qiagen CLC Genomics Workbench: Ultima Genomics Data
- European Nucleotide Archive: Experiment ERX11081003. Inspecting uploaded FASTQs from the experiment.
-
obc2fastqv6.1 documentation actually states<UMI>+<track>for single UMI reads, but this is most likely a typo. v6.0 states<UMI> <track>, and this format is consistent with the dual UMI read. ↩