Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned (5 cent scientific cloud Acceleration of 3D reconstruction in cryo-EM David Střelák, Carlos Oscar Sorzano, Jose Maria Carazo, Jiří Filipovič fall 2019 CO EVROPSKÁ UNIE EVROPSKÝ FOND PRO REGIONÁLNÍ ROZVOJ INVESTICE DO VAŠÍ BUDOUCNOSTI 2007-13 OP Výzkum a vývoj pro inovace David Střelák, Carlos Oscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 1/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C©TH Visualization of Molecules scientific cloud Several methods are able to visualize molecules on atomic level ► X-ray diffraction ► nuclear magnetic resonance (NMR) ► cryo-electron microscopy (cryo-EM) Cryo-EM has some superiority over other methods ► catches molecules in natural environment (diffraction needs crystalization) ► usable for large molecules (NMR is restricted to smaller proteins) David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 2/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C©TH Cryo-electron microscopy scientific cloud Rapidly-developed recently ► in 2012, there was only four structures at near-atomic resolution ► in 2015, 115 structures was discovered ► this progress is allowed by direct electron detectors, viterious ice and image reconstruction methods In 2017, Nobel price in chemistry was given for cryo-EM ► Jacques Dubochet, Joachim Frank, Richard Henderson ► Joachim Frank got his price for image processing method allowing to obtain 3D structure from electron microscope data David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 3/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C6TK Cryo-electron microscopy scientific cloud Illustration: ©Martin Högbom/The Royal Swedish Academy of Sciences David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 4/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction cent scientific cloud Specimens in the Ice h I n i i i I 11 I i ii 11 n11 I David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 5/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C©TH Image Analysis in Cryo-EM scientific cloud Reconstruction of 3D volume is challenging ► electron beam causes damages, so it must be weak, so a noise-to-signal distance is very low ► surrounding water adds another source of noise ► specimens are captured in random positions, possibly with conformational changes ► when captured multiple times, the image is moving and deforming David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 6/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction cent scientific cloud Image of Specimens David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, J in Filipovic Acceleration of 3D reconstruction in cryo-EM 7/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C©TH Aligning Images scientific cloud David Střelák, Carlos Oscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 8/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Introduction Image Reconstruction C©TH 3D Volume Reconstruction scientific cloud 2D image 3D volume David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 9/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation cent scientific cloud Our Focus Image reconstruction is very computationally-demanding ► requires thousands of CPU hours at least ► 3D reconstruction is one of main bottlenecks We focus on acceleration of 3D volume reconstruction in Xmipp software ► software developed in Spanish National Center for Biotechnology (CNB-CSIC) ► production use, not a prototype-toy David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 10/39 Introduction 3D Reconstruction State-of-the-art Our Algorithm Evaluation Conclusion GPU Implementation Lessons Learned (5 C©TH Getting 3D Volume from Images? scientific cloud Central slice theorem ► let / be real-space projection image, which has concentrated information about 3D volume v ► let / be a Fourier transform of image / and V be Fourier transform of v ► / forms a slice of V with the same orientation as / holds with respect to v, moreover, slice / is going through center of V So, we need to transform our images into Fourier space, create 3D Fourier volume and transform the volume back to real space. David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 11/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH 3D Volume Reconstruction scientific cloud David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 12/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation C©TH 3D Volume Reconstruction scientific cloud We need to guess orientation of each 2D image ► computed iteratively ► bottleneck is creating 3D volume from 2D images We have accelerated the creation of 3D volume on GPUs. David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 13/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation C©TH State-of-the-art scientific cloud Multiple papers deal with GPU acceleration of 3D reconstruction, all implementing a scatter method ► GPU thread are associated to 2D pixels of the image ► each thread computes projection of the pixel into volume (resulting in floating-point position) ► the pixel value is put into multiple voxels (integer position) using interpolation David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 14/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C6TK State-of-the-art scientific cloud David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 15/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation cent scientific cloud State-of-the-art Drawbacks of scatter pattern ► race conditions in writing (distances within a voxel up to a/3x longer than distance between two pixels), requires atomic writes ► some wrong optimizations removing atomics have been published ► frequent writing into 3D domain with poor spatial locality David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 16/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation C©TH The Gather Pattern scientific cloud The image value is computed for each 3D voxel ► so each voxel is written only once ► no race conditions in reading ► image data are interpolated (we obtain floating-point position in the image), so they are accessed multiple times ► much better memory locality (we are now repeating accesses into 2D image, not 3D volume) David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 17/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 cent scientific cloud The Gather Pattern i 1 : David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 18/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH The Gather Pattern scientific cloud Projecting 3D voxels to 2D image ► when going into image space, we get position in the image and z-distance from the image ► a lot of voxels does not hit the image (z-distance is too high, or they are out of image boundaries) ► we have (D(n ) pixels, but (D(n ) voxels - a lot of them is not used David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 19/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH The Gather Pattern scientific cloud Projection planes optimization ► we look at the image from some plane orthogonal to coordinate axes (XY, XZ, YZ), which maximizes projected image size ► the iteration space is reduced to the projection plane ► for each point of the projection plane, we compute the distance of the image and start to process voxels from there David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 20/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation C6TK The Gather Pattern scientific cloud David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 21/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH Basic GPU Implementation scientific cloud The gather pattern can be rewritten for GPU directly ► one GPU thread is assigned to one point of projection plane Optimization opportunities ► the advanced interpolation method may be computationally demanding ► GPU cache system is limited in maintaining data locality David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 22/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation cent Interpolation scientific cloud Our computation is a kind of stencil, but with floating-point positions ► interpolation coefficients cannot be precomputed easily David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 23/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH Interpolation scientific cloud We have implemented two strategies ► on-the-fly interpolation ► precomputed table for very fine steps (originally in Xmipp) ► can be cached or preloaded into shared memory David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 24/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH Explicit Caching of Image Data scientific cloud A thread block accesses only a part of the image ► can be cached in fast shared memory ► however, its size may vary depending on image rotation ► we upper-bound image size to \\/2\/3{b -\- 2/)], where b is thread block size and / is interpolation radius ► shared memory is allocated to upper-bound prior GPU kernel execution ► for each image, AABB is computed and proper size is preloaded in shared memory David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 25/39 Introduction 3D Reconstruction State-of-the-art Evaluation Our Algorithm Conclusion GPU Implementation Lessons Learned (5 C©TH Additional Optimizations scientific cloud Register consumption optimization ► many parameters into templates or macros ► allows to increase GPU parallelism CPU-GPU load balancing ► CPU prepares images for GPU, one core is not powerful to do so ► multiple threads are preparing images and sharing GPU time, also allows copy and computation overlay David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 26/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation cent scientific cloud Autotuning parameter values BLOCK.DIM ATOMICS GRID_DIM_Z PRECOMPJNT SHAREDJNT SHAREDJMG TILE.SIZE 8, 12, 16, 20, 24, 28, 32 0, 1 1, 4, 8, 16 0, 1 0, 1 0, 1 1, 2, 4, 8 David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 27/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned State-of-the-art Our Algorithm GPU Implementation cent scientific cloud Architecture Process #0 distribute tasks Process #1 (batches of samples) Thread Manager distribute samples CPU CPU Thread Thread #1 #2 -—1 0> fNl 01 E ro Q. £ E 03 Q. £ a ro QJ 10 t tží t GPU GPU Kernel Kernel #1 #2 update 3D regular grid Process #m Thread Manager distribute samples CPU CPU CPU Thread Thread • • • Thread #1 #2 #n iH a) ri 01 r-l QJ E ro Q. E E 03 Q. £ E ro Q. £ fOJ OJ (0 OJ ro t Í/5 t l-IX- < GPU GPU GPU Kernel Kernel • • • Kernel #1 #2 #n update 3D regular grid reduce partial grids Process #0 Obrázek: Architecture of 3D Fourier Reconstruction. David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 28/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Testbed Setup Results cent scientific cloud Evaluation Tested on a GPGPU cluster node Processor performance memory BW 2x Xeon E5-2650 v4 lx Tesla P100 4x Tesla P100 845GFIops 9,519 TFIops 38,076 TFIops 154GB/s 732GB/s 2928 GB/s theoretical speedup (1 GPU) 11.3x 4.75x David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 29/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Testbed Setup Results cent Evaluation scientific cloud execution time speedup over original 2x CPU 155m n/a lx GPU 13m35s 11.4X 4x GPU 4m53s 31.7X David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 30/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Testbed Setup Results cent scientific cloud Performance Portability Tabulka: Performance portability of 3D Fourier Reconstruction P100 GTX1070 GTX750 GTX680 Tesla P100 100% 95% 44% 96% GTX 1070 88% 100% 31% 50% GTX 750 65% 67% 100% 94% GTX 680 71% 72% 71% 100% We can gain over 3x speedup when tuning for each GPU architecture. David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 31/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned Testbed Setup Results cent scientific cloud Performance Portability Tabulka: Sensitivity on input images in 3D Fourier Reconstruction (GTX 1070) 128x128 91x91 64x64 50x50 32x32 128x128 100% 100% 77% 70% 32% 91x91 100% 100% 76% 68% 33% 64x64 94% 94% 100% 91% 67% 50x50 79% 78% 98% 100% 86% 32x32 65% 67% 80% 92% 100% We can gain over 3x speedup when tuning for specific input size. David Stře lák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 32/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned cent scientific cloud Conclusion We have implemented fast, production-ready algorithm ► significant performance boost ► implemented in Xmipp from beginning Advantages over state-of-the-art ► gather approach is already significantly faster (about 2x on Pascal architecture) ► we suppose it will be even faster with further architectures (bigger flops-to-memory gap, higher parallelism) ► we do not rely on HW implementation of atomics (negligible slowdown when e.g. result is stored in double-precision) David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 33/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned C©TH Programming Complexity scientific cloud The basic idea of the algorithm is pretty simple ► we are just putting 2D slices into 3D space, right? Indexing hell ► Fourier space is symmetric, we need to deal with its boundaries ► going from 3D integer position to 2D real position with handling of 3D symmetry, 2D symmetry and space padding in both 3D and 2D is challenging ► because of real position, it is not always clear if we compute correctly (e.g. how to trace boundary conditions?) ► the indexed space is extremely large ► gnuplot seems really good tool David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 34/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned (5 C6TK Debuging with gnuplot scientific cloud David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 35/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned (5 C6TK Debuging with gnuplot scientific cloud "AABB_cpu" - "BBgpu" - David Střelák, Carlos Óscar Sorzano, José Maria Carazo, Jiří Filipovič Acceleration of 3D reconstruction in cryo-EM 36 / 39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned cent scientific cloud Finding Bugs It is not simple to determine what is a bug in noisy data ► just by moving from scatter to gather, we already compute something else Errors in some part of pipeline are difficult to interpret ► e.g. bad indexing in Fourier space looks really weird in real space ► too long chain: 2D real —>► 2D Fourier —>► 3D Fourier —>► 3D real David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 37/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned (5 CGrrc Sphere scientific cloud David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 38/39 Introduction 3D Reconstruction Evaluation Conclusion Lessons Learned (5 cent scientific cloud Sphere with Indexing Bug r Volume: ../27.7/spere_mpi.vol (64 x 64 x 64) File Display Tools Metadata Help ^ | iQQ^ £j| | 43^ F±| Cols TIpH Rows "1^ 0 j\ 1 23456789 10 11 slice 1 slice 2 slice 3 slice 4 slices slice 6 slice 7 slice 8 slice 9 slice 10 slice 11 slice 12 slice 13 slice 14 slice 15 slice 16 slice 17 slice 18 slice 19 slice 20 slice 21 slice 22 slice 23 slice 24 slice 25 slice 26 slice 27 slice 28 slice 29 slice 30 slice 31 slice 32 slice 33 slice 34 slice 35 slice 36 slice 37 slice 38 slice 39 slice 40 slice 41 slice 42 slice 45 slice 46 slice 47 slice 48 slice 49 slice 50 slice 51 slice 52 slice 53 slice 54 David St re la k, Carlos Oscar Sorzano, Jose Maria Carazo, Jin Filipovic Acceleration of 3D reconstruction in cryo-EM 39/39