Global Memory Access Optimization Matrix Transposition Instruction Speed OOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOO OOOOOOOO GPU Hardware Performance Jiří Fi li povič Fall 2013 Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization •oooooooooooooooooo Matrix Transposition ooooooooooooooooo Global Memory Access Optimization Instruction Speed OOOOOOOO Performance of global memory becomes a bottleneck easily • global memory bandwdith is low relatively to arithmetic performance of GPU (G200 > 24 FLOPS/float, G100 > 30) • 400-600 cycles latency The throughput can be significantly worse with bad parallel access pattern • the memory has to be accessed continuously (coalescing) • use of just certain subset of memory regions should be avoided (partition camping) Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed OÄOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOO OOOOOOOO Continous Memory Access (C. C. < 2.0) GPU memory needs to be accessed in larger blocks for efficiency • global memory is split into 64 B segments • two of these segments are aggregated into 128 B segments } 64B aligned segment J128B aligned segment ................. Half warp of threads Global Memory Access Optimization 00*0000000000000000 Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Continous Memory Access (C. C. < 2.0) A half of a warp can transfer data using single transaction or one to two transactions when transactions when transferring a 128 B word » it is necessary to use large words • one memory transaction can transfer 32 B, 64 B, or 128 B words • GPUs with c. c. < 1.2 a the accessed block has to begin at an address dividable by 16x data size • k-th thread has to access k-th block element • some threads needn't participate • if these rules are not obeyed, each element is retrieved using a separate memory transaction Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooo»ooooooooooooooo Matrix Transposition OOOOOOOOOOOOOOOOO Continous Memory Access (C. C. < 2.0) Instruction Speed OOOOOOOO GPUs with c. c. > 1.2 are less restrictive • each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions • order of threads can be arbitrarily permuted w.r.t. transferred elements Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization oooo«oooooooooooooo Matrix Transposition OOOOOOOOOOOOOOOOO Continous Memory Access (C. C. < 2.0) Instruction Speed OOOOOOOO Threads are aligned, element block is continous, order is not permuted - continuous access on all GPUs á J J 1 1 I 1 1 1 1 r T 1 1 1 1 Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOO0OOOOOOOOOOOOO Matrix Transposition OOOOOOOOOOOOOOOOO Unaligned Memory Access (C. C. < 2.0) Instruction Speed OOOOOOOO Threads are not aligned, element block is continuous, order is not permuted - one transaction on GPUs with c. c. > 1.2 if ff 'I'l'I'I'H'l GPU Hardware Performance Global Memory Access Optimization OOOOOOÄOOOOOOOOOOOO Matrix Transposition OOOOOOOOOOOOOOOOO Unaligned Memory Access (C. C. < 2.0) Instruction Speed OOOOOOOO Similar case may result in a need for two transactions —J f 77 tttf 'li 77 i/|/|/|/|/i l\ l\iyir\IA\ Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooo»ooooooooooo Matrix Transposition ooooooooooooooooo Instruction Speed oooooooo Unaligned Memory Access Performance (C. C. < 2.0) Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. > 1.2) two transfers -W-GTX280 ? -«-FX5600 2 4 S 8 10 12 14 16 Offset Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed oooooooo«oooooooooo ooooooooooooooooo oooooooo Interleaved Memory Access Performance (C. C. < 2.0) The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. > 1.2 - the effect is rather dramatic 120-1— —I 0 2 4 6 a 10 12 14 16 18 Stride Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooo»ooooooooo ooooooooooooooooo oooooooo Global Memory Access with Fermi (C. C. > 2.0) Fermi has LI and L2 cache • LI: 256 B per row, 16 kB or 48 kB per multiprocesor in total • L2: 32 B per row, 768 kB on GPU in total What are the advantages? • more efficient programs with unpredictable data locality • unaligned access - no slowdown in principle » interleaved access - data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (LI cache may be turned of to avoid overfetching) Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOÄOOOOOOOO Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Partition camping • relevant for c. c. 1.x • processors based on G80 have 6 regions, G200 have 8 regions of global memory • the memory is split into 256 B regions • even access among the regions is needed for maximum performance • among individual blocks • block are usually run in order given by their position in the grid • if only part of regions is used, the resulting condition is called partition camping • generally not as critical as the continuous access • more tricky, problem size dependent, disguised from fine-grained perspective Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooo«ooooooo Matrix Transposition ooooooooooooooooo HW Organization of Shared Memory Instruction Speed OOOOOOOO Shared memory is organized into memory banks, which can be accessed in parallel • c. c. 1.x 16 banks, c. c. 2.x 32 banks, memory space mapped in an interleaved way with 32 b shift • to use full memory performance, we have to access data in different banks • broadcast implemented - if all threads access the same data Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOOOÄOOOOOO Bank Conflict Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Bank conflict • occurs when some threads in warp/half-warp access data in the same memory bank (except for when accessing exactly the same data) • memory access gets serialized • performance drop is proportional to number of parallel operations that the memory has to perform to serve a request • there is a difference if some threads access different data in a single bank and the same data in a single bank Jiří Filipovič GPU Hardware Performance Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOOOOOÄOOOO Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization 000000000000000*000 Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Broadcast Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOOOOOOOÄOO Access Patterns Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Alignment is not needed, bank conflicts not generated int x = s[threadldx.x + offset]; Interleaving does not create conflicts if c is odd int x = s[threadldx.x * c]; Access to the same variable never generates conflicts on c. c. 2.x, while on 1.x only if thread count accessing the variable is multiple of 16 int x = s[threadldx.x % c]; Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooo«o ooooooooooooooooo oooooooo Other Memory Types Transfers between host and GPU memory • need to be minimized (often at cost of decreasing efficiency of computation on GPU) • may be accelerated using page-locked memory • it is more efficient to transfer large blocks at once • computations and memory transfers should be overlapped Texture memory • designed to reduce number of transfers from the global memory • works well for aligned access • does not help if latency is the bottleneck • may simplify addressing or add filtering Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed oooooooooooooooooo» ooooooooooooooooo oooooooo Other Memory Types Constant memory • as fast as registers if the same value is read • performance decreases linearly with number of different values read Registers • read-after-write latency, disguised if at least 192 threads are running for c. c. 1.x or at least 768 threads are running for c. c. 2.x • possible bank conflicts even in registers • compiler tries to avoid them • we can make life easier for the compiler if we set block size to multiple of 64 Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition »OOOOOOOOOOOOOOOO Instruction Speed OOOOOOOO Matrix Transposition From theoretical perspective: • a trivial problem • trivial parallelization • trivially limited by the memory throughput (no arithmetic ops done) __global__ void mtran(float *odata, float* idata, int n){ int x = blockldx.x * blockDim.x + threadldx.x; int y = blockldx.y * blockDim.y + threadldx.y; odata[x*n + y] = idata[y*n + x]; Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed OOOOOOOOOOOOOOOOOOO o«ooooooooooooooo OOOOOOOO Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition 0*000000000000000 Instruction Speed OOOOOOOO When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Access to odata is interleaved After transposition modification: odata[y*n + x] = idata[y*n + x]; the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s. Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo oo»oooooooooooooo oooooooo On Removing Interleaving The matrix can be processed per block • we read the block into he shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo oo»oooooooooooooo oooooooo On Removing Interleaving The matrix can be processed per block • we read the block into he shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving What size of blocks should be used? • lets consider square blocks • for aligned reading, the row size has to be multiple of 16 • we can consider block sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations • best size can be determined experimentally Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooo«ooooooooooooo Instruction Speed oooooooo Block Transposition __global__ void mtran_coalesced(float *odata, float *idata, int n) __shared__ float tile[TILE_DIM][TILE_DIM]; int x — blockldx.x * TILE_DIM + threadldx.x; int y — blockldx.y * TILE_DIM + threadldx.y; int index_in = x + y*n; x — blockldx.y * TILE_DIM + threadldx.x; y — blockldx.x * TILE_DIM + threadldx.y; int index_out = x + y*n; for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) tile[threadldx.y+i][threadldx.x] = idata[index.in+i*n]; __syncthreads(); for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i]; } Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition OOOO0OOOOOOOOOOOO Instruction Speed OOOOOOOO The highest performance was measured for 32 x 32 block size and 32 x 8 thread block size - 75.1 GB/s Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition OOOO0OOOOOOOOOOOO Instruction Speed OOOOOOOO The highest performance was measured for 32 x 32 block size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still far from simple copying Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition OOOO0OOOOOOOOOOOO Instruction Speed OOOOOOOO The highest performance was measured for 32 x 32 block size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still far from simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition OOOO0OOOOOOOOOOOO Instruction Speed OOOOOOOO The highest performance was measured for 32 x 32 block size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still far from simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere • if we only copy within the blocks, we get 94.9GB/s • something is still sub-optimal Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooooo«ooooooooooo Instruction Speed OOOOOOOO Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile[threadldx.y+i] [threadldx. x ] = idata[index_ in+i*n]; Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooooo«ooooooooooo Instruction Speed oooooooo Shared Memory When reading from the global memory, we write into the shared memory row-wise tile[threadldx.y+i][threadldx.x] = idata[index.in+i*n]; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition ooooo«ooooooooooo Instruction Speed oooooooo Shared Memory When reading from the global memory, we write into the shared memory row-wise tile[threadldx.y+i][threadldx.x] = idata[index.in+i*n]; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict A solution is padding: __shared__ float tile[TILE_DIM][TILE_DIM + 1]; Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition OOOOOO^OOOOOOOOOO Instruction Speed OOOOOOOO Now our implementations shows 93.4 GB/s. • as good as simple copying • it seems we can't do much better for given matrix • beware of different input data sizes (see partition camping) Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOOOOOOOOOO Matrix Transposition OOOOOOOÄOOOOOOOOO Instruction Speed OOOOOOOO Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo oooooooo«oooooooo oooooooo Performance 500 1000 1500 2000 2500 3000 3500 4000 velikost matice Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed OOOOOOOOOOOOOOOOOOO oooooooootooooooo OOOOOOOO Performance Drops The performance drops for some size and the behavior is regular Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Drops Matrix Transposition 0OOOOOOOO«OOOOOOO Instruction Speed OOOOOOOO The performance drops for some size and the behavior is regular • for matrices sized multiple of 512, we only get 19 GB/s • for other matrices sized multiple of 256, we only get 35 GB/s • for other matrices sized multiple of 128, we only get 62 GB/s Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Drops Matrix Transposition 0OOOOOOOOO*OOOOOO Instruction Speed OOOOOOOO One memory region has width of 2 blocks (256 B / 4B per float, 32 floats in a block). If we analyze block placement w.r.t. matrix size, we learn that • with multiple of 512 size, the blocks are in the same columns in the same region • with multiple of 256 size, each column is at most in two regions • with multiple of 128, each column is at most in four regions We have discovered partition camping. Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooo«ooooo oooooooo How to Remove Partition Camping? We can pad "fake data" and avoid bad matrix sizes. • it makes further work on the algorithm more complicated • it occupies more memory Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooo«ooooo oooooooo How to Remove Partition Camping? We can pad "fake data" and avoid bad matrix sizes. • it makes further work on the algorithm more complicated • it occupies more memory We can change the mapping of thread blocks id's on matrix blocks • diagonal mapping ensures access to different regions int blockIdx_y = blockldx.x; int block!dx_x = (blockldx.x+blockldx.y) % gridDim.x; Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition oooooooooooo«oooo Instruction Speed oooooooo Performance New implementation gives 80GB/s • performance doesn't drop where we saw it previously • for matrix size of multiple of 128 still worse then the original implementation • the algorithm is more complex • we can use it only for the problematic data sizes For given problem, there may not be (and often there is not) an ideal algorithm for the whole input data size range. It is necessary to benchmark as not all the problems are easily revealed just by looking at the code. Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed OOOOOOOOOOOOOOOOOOO 0000000000000*000 oooooooo Performance 500 1000 1500 2000 2500 3000 3500 4000 velikost matice Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Matrix Transposition 0OOOOOOOOOOOOO«OO Instruction Speed OOOOOOOO 1500 2000 2500 velikost matice ig ► * -00.0 Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Performance Summary Matrix Transposition 0OOOOOOOOOOOOOO»O Instruction Speed OOOOOOOO All optimizations were only toward better accommodation of HW properties • still we got 17.6x speedup • when creating an algorithm, it is necessary to understand HW limitations • otherwise we wouldn't have to develop specifically for GPUs -developing a good sequential algorithm would have been just fine... Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition OOOOOOOOOOOOOOOO* Instruction Speed OOOOOOOO Optimizations Effects Beware of optimization effects • if we took 4096 x 4096 matrices instead of 4000 x 4000, the memory bank conflict removal would have been just marginal • after removing partition camping, the memory bank conflicts would have appeared • thus it makes sense to go from more general/substantial optimizations to the less general ones • if some (provably correct) optimization does not result in performance increase, we need to analyze, what the algorithm is limited by Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Processing of Instructions Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed •OOOOOOO Processing of instructions on a multiprocessor (c. c. 1.x) • there are 8 SP cores and 2 SFU cores • if the SP and SPU instruction processing is not overlapped, the multiprocessor can process up to 8 instructions per cycle • one warp is thus done in 4 or more cycles • some instructions are significantly slowe • instruction processing knowledge helps us to design optimal code Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed OOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOO o»oooooo Floating Point Operations GPU is graphical HW primarily • graphical operations mostly use floating point numbers • efficiently implemented in GPUs • newer GPUs (c. c. > 1.3) can work in double precision while older ones in single precision only • some arithmetic operations are used very frequently in graphics a GPU implements them in HW • HW implementation provides lower precision (not in issue for lots of applications) • differentiated using prefix Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo oo«ooooo Aritmetic Operations Floating point operations (throughput on an MP) • addition, multiplication 8 (1.x), 32 (2.0), 48 (2.1) • multiplication and addition may be combined into a single MAD instruction for c. c. 1.x • lower precision • 1 cycle speed on SP • —fadd-rn() and —fmuLrn() may be used to enforce avoiding MAD instruction during compilation • MAD is replaced by FMAD for c. c. 2.x (same speed, higher precision) • 64b versions 1/8 (1.3), 1/2 (2.0), 1/12 (2.1) • inverse value 2 (1.x), 4 (2.0) a 8 (2.1) • division is relatively slower (by 1.23 on average for c. c. 1.x) • faster variant __fdividef(x, y) 1.6 (c. c. 1.x) • inverted square root 2 (1.x), 4 (2.0) a 8 (2.1) • type conversion 8 fee. 1.x), 16 fee. 2.x) Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo ooo»oooo Aritmetic Operations Floating point operations • sinf(x), -cosf(x), -expf(x) 2 (c.c. 1.x), 4 (c.c. 2.0), 8 (c.c. 1.2) • sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower • other operations with different speed and precision trade-offs are implemented, see CUDA manual Integer operations • addition as for the floating point ops • multiplication on c. c. 1.x 2 instructions on an MP • __mu/24(x, y) a __umul24(x, y) 8 instructions • multiplication on c. c. 2.x is as fast as floating point ops, 24-bit version is slow • division and modulo is very slow, but if n is power of 2, we can utilize • i/n is equivalent to / >> log2(n) _. ;o/„ :_____:..^i„„4- 4-„ :t.f„ i\_-g><^><^> j Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization OOOOOOOOOOOOOOOOOOO Loops Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOÄOOO Small loops have significant overhead • jumps need to be implemented • it is necessary to update control variable • significant part of instructions may be pointer arithmetics Loop unrolling is an option • partially may be done by the compiler • we can do manual unrolling or use #pragma unroll Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Other Instructions Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOO0OO Other common instructions are done at the basic speed (i.e., correspond to number of SPs) • comparison • bit operations • memory access instructions (given the limitations discussed earlier and memory latency/bandwidth) • the offset may be register value + constant • synchronization (unless we get blocked) Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization Matrix Transposition Instruction Speed ooooooooooooooooooo ooooooooooooooooo oooooo«o Beware of Shared Memory If memory bank conflict is avoided, the shared memory is as fast as registers But beware • instructions can work with only one operand in the shared memory • if more than one operands in shared memory are used for one instruction, explicit load/store is necessary • MAD instructions run slower (c.c. 1.x) • a + s[i] 4 cycles per warp • a + a * s[i] 5 cycles per warp • a + b * s[i] cycles per warp • these details are not published by nVidia (revealed through measurements) • may change with future GPU generations, interested only for really critical code Jiří Filipovič GPU Hardware Performance Global Memory Access Optimization ooooooooooooooooooo Matrix Transposition OOOOOOOOOOOOOOOOO Instruction Speed OOOOOOO« C for CUDA Compilation Device code can be compiled into PTX assembler and binary files • PTX is intermediate code, does not correspond directly to GPU instructions • easier to read • harder to figure out what really happens on GPU • native GPU code compiler is to be released Binary files may be disassembled using decuda tool • third party product • may not work completely reliably • still quite useful Jiří Filipovič GPU Hardware Performance