BORIS JURIČ MACHINE LEARNING APLICATIONS ON RNA-SEQ DATA RNA-sequencing • Presence and quantity • Gene expression, transcriptome assembly, differential analysis • Disease biomarkers, diagnostics • Cancer research • Data contaminated • High sequence similarity ~85% PDX Standard methods • Xenome - kmer index table • NGS Disambiguate - two alignments • Pure alignments methods • Convolutional neural network • encoding  Our approach A = [1,0,0,0] T = [0,1,0,0] C = [0,0,1,0] G = [0,0,0,1] ACC…TGG   [1,0,0,0]   [0,0,1,0]   [0,0,1,0]        …   [0,1,0,0]   [0,0,0,1]   [0,0,0,1] Model assisted alignment • Goal is gene expression quantification • Model predicts gene, alignment only local • Output is n-dimensional space • Kmer content defines distance • Similar genes closer together • Random projection to reduce number of dimensions • Relative distance preserved Search space and metric AAAAAAAA AAAAAAAT ... GGGGGGGG DDX11L1 1 0 ... 1 WASH7P 0 1 ... 0 MIR6859-1 0 0 ... 1 MIR1302-2HG 1 0 ... 0 … … … … … High dimensional regression • Model predicts coordinates • Tree search the matrix • Alignment on N closest neighbours So far • Works on small amount of genes • Uneven gene size, insufficient metric • Reference data structure needs improvement Current work • Clustering kmers from the whole genome • Generating search space in similar fashion • Model predicts cluster, best alignment is chosen Thank you for attention