Quality scores
Given an assertion, \(A\), the quality score, \(Q(A)\), is a function of the probability that the true base call is different from the assertion, i.e. \(\mathbb{P}(\neg A)\). The most common relationship between \(Q(A)\) and \(\mathbb{P}(\neg A)\) is:
where \(P(\neg A)\) is the estimated probability of an assertion \(A\) being wrong. This is called the "Phred scale" and is sometimes denoted as \(Q_{Phred}(A)\).
Sequencing instruments from the early 2000s used a slightly different scale that made use of the log odds:
$$ Q(A) = -10 \log_{10} \frac{ \mathbb{P}(\neg A) }{ 1 - \mathbb{P}(\neg A) } $$ This is called the "Solexa scale" and is sometimes denoted as \(Q_{Solexa}(A)\).
In FASTQ files, quality scores are encoded a a single byte ASCII character to match the length of the sequence line.
The Q-score is equal to the ASCII character code value plus an offset, that differs across manufacturers and instruments.
The scale name is abbreviated as the Q-score transformation name, plus the ASCII value offset for a 0 quality score.
For example, \(Q_{Phred}\) = ASCII value - 33 is the "Phred33" scale.
| Platform | Scale | Offset | ASCII | Q |
|---|---|---|---|---|
| Element Biosciences | Phred | 33 | [!, Y] |
[0, 56] |
| Solexa | Solexa | 64 | [;, ~] |
[-5, 40] |
| Illumina 1.2 and earlier | Solexa | 64 | [;, ~] |
[-5, 40] |
| Illumina 1.3, 1.4 | Phred | 64 | [@, ~] |
[0, 62] |
| Illumina 1.5, 1.6, 1.7 | Phred | 64 | [C, ~] |
[3, 62] |
| Illumina 1.8 and later | Phred | 33 | [!, ~] |
[0, 93] |
| MGI / BGI | Phred | 33 | [!, ~] |
[0, 93] |
| NCBI / Sanger | Phred | 33 | [!, ~] |
[0, 93] |
| Oxford Nanopore | Phred | 33 | [!, ~] |
[0, 93] |
| Pacific Biosciences | Phred | 33 | [!, ~] |
[0, 93] |
| Singular Genomics | Phred | 33 | [!, ~] |
[0, 93] |
| Ultima Genomics | Phred | 33 | [!, ~] |
[0, 93] |
References
- Peter A. J. Cock, et al., "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants", Nucleic Acids Research, 2009
- Illumina: Connected Software
- Qiagen CLC Genomics Workbench: Quality scores on the Illumina platform
- Element Biosciences:
bases2fastq - Pacific Biosciences:
obc2fastqReference Guide v6.0 - Oxford Nanopore: Data Analysis
- Singular Genomics: FASTQ Data Format