Skip to content

Read headers

Many high throughput sequencers produce FASTQ files. While they all adhere to the FASTQ format specification, different machines produce read headers with slightly different formats. Here is a detailed, referenced list of these different formats.

Element Biosciences

Element Biosciences has machines that can produce FASTQ files both for DNA sequences and for cytoprofiling. The two FASTQs have slightly different header formats.

The header format depends on the sample indexing strategy.

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<sample_number>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<control_number>:<BC1>+<BC2>

The values are described in the table below.

Element Description Matching Regex
<instrument> Instrument name [A-Za-z0–9_]+
<run> Run name [A-Za-z0–9_-]+
<flow_cell> Flow cell ID from the barcode scan. If the barcode scan fails during the run and no barcode is present, then the run ID replaces the flow cell ID. [A-Za-z0–9]+
<lane> Lane number [12]
<tile> Tile number \d+
<x> X coordinate of the polony (0+)?\d+
<y> Y coordinate of the polony (0+)?\d+
<UMI> UMI sequence with a plus sign that separates the Read 1 and Read 2 sequences, if applicable [ACGTN]+
<read> Read number [12]
<is_filtered> A legacy filtering value of N that exists only for backward compatibility and does not change N
<control_number> A legacy control number of 0 that exists only for backward compatibility and does not change 0
<BC>, <BC1>, <BC2> Sample barcode sequence(s) [ACGT]+

The header format is:

@<instrument>:<run>:<flow_cell>:<well_index>:<z_tile>:<x>:<y>:<cell_id>:<nuclear_status>:<batch_id> <read>:<is_filtered>:<control_number>:<BC>

The values are described in the table below.

Element Description Matching Regex
<instrument> Instrument name [A-Za-z0-9_-]+
<run> Run name [A-Za-z0-9-]+
<flow_cell> Flow cell ID, R<run>, or UNKNOWN_FLOWCELL [A-Za-z0-9-]+
<well_index> Well ID \d+
<z_tile> The tile plus the z slice in the format of SRRCCZZ (surface, row, row, col, col, z, z) \d{7}
<x> X coordinate of the polony, 0-padded (0+)?\d+
<y> Y coordinate of the polony, 0-padded (0+)?\d+
<cell_id> Unique ID of the cell \d+
<nuclear_status> The cellular location of a polony, whether the polony is in the nucleus or not. 0 is cellular and 1 is nuclear. [01]
<batch_id> The on-instrument cycling batch \w+
<read> Always 1 for cytoprofiling 1
<is_filtered> A legacy filtering value of N that exists only for backward compatibility and does not change N
<control_number> A legacy control number of N that exists only for backward compatibility and does not change 0
<BC> Always 1 for cytoprofiling 1

Illumina

The header format is:

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<sample_number>
Element Description Matching Regex
<instrument> Instrument ID [A-Za–z0–9_]+
<run> Run number on instrument \d+
<flow_cell> Flow cell ID [A-Za–z0–9]+
<lane> Lane number \d+
<tile> Tile number \d+
<x> X coordinate of cluster \d+
<y> Y coordinate of cluster \d+
<read> Read number. 1 can be single-end or Read 2 of paired-end. 0 for Read 1 of paired-end. [01]
<is_filtered> Y if the read is filtered (did not pass), N otherwise [YN]
<control_number> 0 when none of the control bits are on, otherwise it is an even number \d+
<sample_number> Sample number from sample sheet \d+

On HiSeq X systems, control specification is not performed and this number is always 0. If the read is identified as a control, the number is greater than zero, and the value specifies what type of control it is. The value is the decimal representation of a bit-wise encoding scheme. In that scheme bit 0 has a decimal value of 1, bit 1 a value of 2, bit 2 a value of 4, and so on.

MGI / BGI

The header format is:

@<instrument><lane><column><row>_<count>/<read>
Element Description Matching Regex
<instrument> Instrument ID CL\d+
<lane> Lane number L\d
<column> DNA ball column index C\d{3}
<row> DNA ball row index R\d{3}
<count> Read index, starting at 0 \d+
<read> 1 for forward read, 2 for reverse read [12]

Oxford Nanopore

Oxford Nanopore's <read_id> is unique among sequencing platforms, in that the ID is not constructed out of instrument, run, or other metadata. The <read_id> is a universally unique identifier (UUID), as are other metadata elements in the header.

@<read_id>(\t<key>:<type>:<value>)*
Key:Type Description Matching Regex
RG:Z Read group ID the read originates from ([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})([a-z0-9@\.])+(([A-Za-z0-9@\.]+))?
DT:Z The protocol start time of the sequencing run, formatted as RFC3339. \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z|+\d{2}:\d{2})
ch:i Number of the channel the read was acquired on. [1-9][0-9]*
st:Z Read start time. If this read is a split from a parent, this is the start time for the split read, formatted as RFC3339 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(.\d+)?(Z|+\d{2}:\d{2})
PU:Z Unique identifier for the flow cell [A-Z0-9_-]+
LB:Z Sample library identifier, set by the user when loading samples. Absent if not set. [a-zA-Z0-9_\.-]+
SM:Z Barcode for the identified read. Only used when barcoding. barcode([0-9]+)
al:Z User-specified identifier used for the barcode, if available, otherwise the arrangement name. Included only if data is present and the arrangement is not "unclassified". unclassified|[A-Za-z0-9-_.]+
pi:Z Parent read ID for a split read [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
DS:Z GPU device name(s) used during GPU-based calling .*
ns:i The number of samples in the signal [1-9][0-9]*
qs:f Read mean basecall Q-score [0-9]+(\.)?[0-9]*
mx:i The mux the read originated [1-9][0-9]*
rn:i The channel the read originated [1-9][0-9]*
ts:i Number of samples trimmed from the start of the signal [1-9][0-9]*
sm:f Scaling median [0-9]+(\.)?[0-9]*
sd:f Scaling dispersion (also sometimes referred to as "MAD" or "spread") [0-9]+(\.)?[0-9]*
sv:Z Scaling method used by the base caller (med_mad|quantile)
du:f duration of the read (in seconds) [0-9]+(\.)?[0-9]*
dx:i Duplex read indicator. 1 for duplex reads, 0 for simplex reads without duplex offspring, -1 for simplex reads with duplex offspring. (-1|0|1)
pt:i Estimated number of bases in the polyA/T tail. Only used if poly_a_tail_estimation is set. [1-9][0-9]*
pa:B:i Integer array of PolyA/T tail range information. Only used if poly_a_tail_estimation is set. See Section 4.2.4 "Auxillary data encoding" in the SAM Specification for details about its encoding
@<read_id>(\s<key>=<value>)*
Element Description Matching Regex
read_id UUID for the read [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
runid UUID for the sequencing protocol [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
ch The number of the channel the read was acquired on. The first channel is 1. [0-9]+
start_time The time the read started in RFC3339 format. ((?:(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2}(?:.\d+)?))(Z|[+-]\d{2}:\d{2})?)
flow_cell_id The human-readable identifier for the flow cell [A-Z0-9_-]+
protocol_group_id Set by the user in the GUI as "Experiment ID". [a-zA-Z0-9_\.-]+
sample_id Set by the user in the GUI as "Sample ID". [a-zA-Z0-9_\.-]+
barcode The barcode assigned to this read by the basecaller (eg: barcode01). unclassified if no barcode was detected. unclassified|barcode([0-9]+)
barcode_alias The user-supplied alias for the barcode. Empty if barcoding is not running. The same as barcode if the user did not supply an alias. unclassified|[A-Za-z0-9-_.]+
parent_read_id The read_id of the read which was the source of this FASTQ entry. This may be the same as the FASTQ entry id if no read splitting was performed for this read, or will be a new globally unique UUID value if this read was split out of another read by the basecaller. [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
basecall_model_version_id The unique identifier for the basecall model used to generate this FASTQ file, as supplied by the basecaller [a-z0-9_@\.]+
basecall_gpu A string description of the connected GPU. .*

These keys may not necessarily be in the order listed and should be treated as an unordered set of values. There may be additional patterns in the read header, so pattern-matching on only these keys, exclusively, should be avoided. See Sequencing Summary documentation for details on mapping these IDs across file formats.

Pacific Biosciences

PacBio's obc2fastq tool can output different read header formats, depending on the indexing strategy and command line invocation1:

@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI> <track>:<is_filtered>:<control_number>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<swathtile>:<x>:<y>:<UMI1>+<UMI2> <track>:<is_filtered>:<control_number>:<BC1>+<BC2>
Element Description Matching Regex
<instrument> Instrument ID [A-Z0-9]+
<run> Run number \d+
<flow_cell> Flow cell ID [A-Za-z0-9-]+
<lane> Lane number [12]
<swathtile> Swath tile \d+
<x> X coordinate of the polony \d+
<y> Y coordinate of the polony \d+
<UMI>, <UMI1>, <UMI2> Unique molecular index (single UMI, or UMI pairs) for the molecule [ACGTN]+
<track> Track the read comes from, related to the cycle mask [1-4]
<is_filtered> A legacy filtering value of N that exists only for backward compatibility and does not change N
<control_number> A legacy control number of N that exists only for backward compatibility and does not change N
<BC>, <BC1>, <BC2> Sample barcode sequence(s) [ACGTN]+

Singular Genomics

The header format depends on the sample indexing strategy.

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y>:<UMI1>+<UMI2> <read>:<is_filtered>:<is_control>:<BC1>+<BC2>
Element Description Matching Regex
<instrument> Instrument ID [A-Z0-9-]+
<run> Run number [0-9]+
<flow_cell> Flow cell ID number from the EPROM [A-Z0-9]+
<lane> Lane number the molecule was sequenced on [1-4]
<tile> Tile number [0-9]+
<x> X coordinate of the sequencing cluster within the tile [0-9]+
<y> Y coordinate of the sequencing cluster within the tile [0-9]+
<UMI>, <UMI1>, <UMI2> Unique molecular identifier(s [ACGT]+
<read> Read number for paired-end sequencing. If single-end, this value is 1 [12]
<is_filtered> Did this read fail filtering? [NY]
<control_number> Boolean value if this read is part of an internal control (0 for no, 1 for yes, typically a PhiX molecule) [01]
<BC>, <BC1>, <BC2> Sample barcode sequence(s) [ACGT]+

Ultima Genomics

Ultima machines natively output sequencing files in the CRAM format, and normally skip FASTQs. However, FASTQs can be created from the output CRAM file. The header format is:

@<instrument>:<run>:<flow_cell>:<lane>:<tile>:<x>:<y> <read>:<is_filtered>:<control_number>:<UMI1>+<UMI2>
Element Description Matching Regex
<instrument> Instrument ID [A-Za-z0-9_-]+
<run> Run number on instrument \d+
<flow_cell> Flow cell ID [A-Z]+
<lane> Lane number [1-4]
<tile> Tile number \d+
<x> X coordinate of cluster \d+
<y> Y coordinate of cluster \d+
<read> Read number for paired-end reads [12]
<is_filtered> Y if the read is filtered (did not pass), N otherwise [YN]
<control_number> 0 when none of the control bits are on, otherwise it is an even number \d+
<UMI1>, <UMI2> Unique molecular identifiers from the 5' and 3' ends, respectively [ACGTN]+

References

Element Biosciences

  1. bases2fastq documentation

Illumina

  1. Illumina Connected Software
  2. Detect the Illumina sequencer model
  3. Biostars post
  4. 10X Genomics supernova
  5. NovaSeq 6000 Integrations v3
  6. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

MGI/BGI

  1. Qiagen CLC Genomics Workbench: MGI/BGI Data
  2. European Nucleotide Archive: Project PRJEB15427
  3. European Nucleotide Archive: Project PRJEB19426

Oxford Nanopore

  1. Oxford Nanopore Output Specifications: FASTQ (v26.01)
  2. Oxford Nanopore Output Specifications: FASTQ (v25.09)
  3. Oxford Nanopore Output Specifications: Patterns
  4. Oxford Nanopore Output Specifications: Sequencing Summary

Pacific Biosciences

  1. obc2fastq v6.1 documentation
  2. obc2fastq v6.0 documentation
  3. PacBio BAM format specification

Singular Genomics

  1. Singular Genomics FASTQ Data Format
  2. Singular Genomics What Are FASTQ Files?
  3. How are PhiX reads processed by G4?
  4. singular-demux
  5. Demultiplexing Guide for G4 Sequencing Platform

Ultima Genomics

  1. Qiagen CLC Genomics Workbench: Ultima Genomics Data
  2. European Nucleotide Archive: Experiment ERX11081003. Inspecting uploaded FASTQs from the experiment.

  1. obc2fastq v6.1 documentation actually states <UMI>+<track> for single UMI reads, but this is most likely a typo. v6.0 states <UMI> <track>, and this format is consistent with the dual UMI read.