We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.
Accepté le :
DOI : 10.1051/ro/2015039
Keywords: High-throughput DNA sequencing, data compression, FASTQ files
Chlopkowski, Marek 1 ; Antczak, Maciej 1 ; Slusarczyk, Michal 1 ; Wdowinski, Aleksander 1 ; Zajaczkowski, Michal 1 ; Kasprzak, Marta 1, 2
@article{RO_2016__50_2_351_0,
author = {Chlopkowski, Marek and Antczak, Maciej and Slusarczyk, Michal and Wdowinski, Aleksander and Zajaczkowski, Michal and Kasprzak, Marta},
title = {High-order statistical compressor for long-term storage of {DNA} sequencing data},
journal = {RAIRO - Operations Research - Recherche Op\'erationnelle},
pages = {351--361},
year = {2016},
publisher = {EDP Sciences},
volume = {50},
number = {2},
doi = {10.1051/ro/2015039},
mrnumber = {3479875},
language = {en},
url = {https://www.numdam.org/articles/10.1051/ro/2015039/}
}
TY - JOUR AU - Chlopkowski, Marek AU - Antczak, Maciej AU - Slusarczyk, Michal AU - Wdowinski, Aleksander AU - Zajaczkowski, Michal AU - Kasprzak, Marta TI - High-order statistical compressor for long-term storage of DNA sequencing data JO - RAIRO - Operations Research - Recherche Opérationnelle PY - 2016 SP - 351 EP - 361 VL - 50 IS - 2 PB - EDP Sciences UR - https://www.numdam.org/articles/10.1051/ro/2015039/ DO - 10.1051/ro/2015039 LA - en ID - RO_2016__50_2_351_0 ER -
%0 Journal Article %A Chlopkowski, Marek %A Antczak, Maciej %A Slusarczyk, Michal %A Wdowinski, Aleksander %A Zajaczkowski, Michal %A Kasprzak, Marta %T High-order statistical compressor for long-term storage of DNA sequencing data %J RAIRO - Operations Research - Recherche Opérationnelle %D 2016 %P 351-361 %V 50 %N 2 %I EDP Sciences %U https://www.numdam.org/articles/10.1051/ro/2015039/ %R 10.1051/ro/2015039 %G en %F RO_2016__50_2_351_0
Chlopkowski, Marek; Antczak, Maciej; Slusarczyk, Michal; Wdowinski, Aleksander; Zajaczkowski, Michal; Kasprzak, Marta. High-order statistical compressor for long-term storage of DNA sequencing data. RAIRO - Operations Research - Recherche Opérationnelle, Special issue: Research on Optimization and Graph Theory dedicated to COSI 2013 / Special issue: Recent Advances in Operations Research in Computational Biology, Bioinformatics and Medicine, Tome 50 (2016) no. 2, pp. 351-361. doi: 10.1051/ro/2015039
, , , , , , and , A map of human genome variation from population-scale sequencing. Nature 467 (2010) 1061–1073. | DOI
, , , , , , , , and , Whole genome assembly from 454 sequencing output via modified DNA graph concept. Comput. Biol. Chem. 33 (2009) 224–230. | DOI
and , A general purpose lossless data compression method for GPU. J. Parallel Distrib. Comput. 75 (2015) 40–52. | DOI
et al. Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 41 (2013) D30–D35. | DOI
, , , and , The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38 (2010) 1767–1771. | DOI
and , Compression of DNA sequence reads in FASTQ format. Bioinform. 27 (2011) 860–862. | DOI
and , Data compression for sequencing data. Algorithms Mol. Biol. 8 (2013) 25. | DOI
, , and , SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinform. 28 (2012) 3051–3057. | DOI
. A method for the construction of minimum-redundancy codes. Proc. of the IRE 40 (1952) 1098–1101. | Zbl | DOI
Inc. Illumina, CASAVA v1.8 changes. [on-line] http://support.illumina.com/documentation.html, January (2011).
Inc. Illumina, BaseSpace user guide. [on-line] http://support.illumina.com/documentation.html, May (2013).
, , and . Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40 (2012) e171. | DOI
, , , and . Compressing genomic sequence fragments using SlimGene. J. Comput. Biol. 18 (2011) 401–413. | MR | DOI
M. Nelson. [on-line] http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/.
and , DSRC 2 - industry-oriented compression of FASTQ files. Bioinform. 30 (2014) 2213–2215. | DOI
D.S.H. Rosenthal, D. Rosenthal, E.L. Miller, I. Adams, M.W. Storer and E. Zadok, The economics of long-term digital storage. In The Memory of the World in the Digital Age: Digitization and Preservation, September (2012).
D. Salomon, Data Compression: The Complete Reference. With contributions by Giovanni Motta and David Bryant. Springer, London (2007). | MR
, A mathematical theory of communication. The Bell Syst. Tech. J. 27 (1948) 379–423, 623–656. | MR | Zbl | DOI
, , , , , , and , Preprocessing and storing high-throughput sequencing data. Comput. Methods Sci. Technol. 20 (2014) 9–20. | DOI
, , , , , and , DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res. 30 (2002) 27–30. | DOI
. A technique for high-performance data compression. Computer 17 (1984) 8–19. | DOI
, and , Arithmetic coding for data compression. Commun. ACM 30 (1987) 520–540. | DOI
and . A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23 (1977) 337–343. | MR | Zbl | DOI
and , Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24 (1978) 530–536. | MR | Zbl | DOI
Cité par Sources :





