Data Formats in Bioinformatics & Where to Find Them Data format • definition of the structure of data within a database or file system that gives the information its meaning Type of data formats • Sequence formats • Alignment formats • Generic feature formats • Annotation formats • Protein structure formats • Index formats • ... • Text formats (Human readable) • Binary formats • Flat formats • Open formats • Vendor-lock formats Sequence formats • Fasta • Plain sequence • Contains header • .fasta, .fa, .fna, .faa, .frn Sequence formats • FastQ • Sequencing format • Contains quality string (Phred Score) • .fastq, .fq Sequence formats • GenBank • Contains addition information about the sequence • .genbank, .gb Alignment formats • SAM • Sequence Alignment map • Containsadditional information • BAM • Binaryversion of SAM • CRAM • BAM with loseless compression • .sam, .bam,.cram Alignment formats • Clustal • Clustal omega • Multiple sequence alignment • .clusta, .aln Generic feature formats • GTF • GFF • GFF3 • Describing genes and other features • Beware of version! • .gtf, .gff, .gff3 Annotation formats • VCF • Variant calling format • Information about SNP • Genotyping projects • .vcf • BCF • Binary version of vcf Protein structure format • PDB • Information about protein atoms • .pdb Index formats • For quicker searching in bioinformatics formats • Software dependent • Binary structure as hash table, suffix tree, k-mer composition, … • .fai, .bai, .crai, .index, ... Data Compression • Text formats are commonly compressed • .gz • .bz2 • .tar • .zip • ... Text format X binary format • Pros: • Cons: Open format X Vendor-lock format • Pros: • Cons: Why so many formats? Why so many formats? • Compatibility • Speed • Readability • Storage efficiency • Structuring needs • Important metadata • Transformers • Versioning • Source! Example Human reference genome • GRCh38.p14 • GRCh37 • Hg19 • GCA_000001405.29 • Which one to use? • What is the difference? • What are the implications? Data sources • Your own laboratory • Publicly available databases • Accessible via internet browser • Dedicated API NCBI • https://www.ncbi.nlm.nih.gov/ • National Center for Biotechnology Information • Aggregation of several data sources into one project • Genbank • Refseq • SRA • PubMed • ... EMBL-EBI • https://www.ebi.ac.uk/ • European Molecular Biology Laboratory - European Bioinformatics Institute • https://www.ebi.ac.uk/services/data-resources-and-tools • ENA – European Nucleotide Archive • https://www.ebi.ac.uk/ena/browser/home UCSC • https://genome.ucsc.edu/ • University of California, Santa Cruz, Genomic Institute UniProt • https://www.uniprot.org/ • Several aggregated data sources about proteins