SIMD + NativeAOT + Dynamic PGO Jiří Činčura (engineer, x̄ size) | Karel Zikmund (manager, tall) A Modern Processor SISD • Single Instruction Single Data • Instruction Level Parallelism (ILP) • Parallel processing SIMD • Single Instruction Multiple Data SIMD SIMD NativeAOT • Ahead-of-Time compilation • JIT, startup time, memory/working set • Benefits • Faster startup time • Smaller memory footprint • Self-contained executable • Restricted environments (i.e. iOS) • Limitations • No dynamic loading • No runtime code gen (System.Reflection.Emit, limited reflection) • Bigger application binary C# -> IL -> CPU • C# code is compiled into IL (MSIL, CIL) • Stack based • Object oriented • C# -> IL = Roslyn compiler (C# compiler) • IL -> CPU = RyuJIT (JIT) • Straightforward assembly code is not the fastest one • JIT must generate great code and do it fast • PGO • Profile Guided Optimization • Static • Collect data from representative run and store along the executable • Is the data up-to-date? • Dynamic • In-proc, no training or special builds • Uses Tiered Compilation by instrumenting code in initial tiers • Collected data used later for better or more optimization Benefits • About 15% or up Key Optimizations • Guarded Devirtualization (GDV) • At virtual and interface call sites, introduce tests for specific types • If the test succeeds, we know exactly which method will be called • Also try and inline the method • If the method is on a value class, inline the unboxing stub and the method • If source is a box attempt to optimize away the box too • If the test fails, just do the normal virtual / interface call • .NET 8: extends GDV to handle some delegate invokes as well • Opt-in: Multiple guesses GDV GDV void RegisterUser(IUserService service, User user) { service.Register(user); // virtual call } void RegisterUser(IUserService service, User user) { CORINFO_HELP_CLASSPROFILE32(service.GetType()); service.Register(user); } void RegisterUser(IUserService service, User user) { if (service is UserServiceImpl impl) impl.Register(user); // direct call, can be inlined else service.Register(); // still virtual (fallback) } GDV void RegisterUser(IUserService service, User user) { service.Register(user); // virtual call } void RegisterUser(IUserService service, User user) { CORINFO_HELP_CLASSPROFILE32(user.GetType()); service.Register(user); } void RegisterUser(IUserService service, User user) { if (service is UserServiceImpl impl) impl.Register(user); else if (service is GenericUserService1 impl) impl.Register(user); else if (service is GenericUserService2 impl) impl.Register(user); else service.Register(); } Profile-Driven Inlining • Use profile data to ensure key methods are inlined • Relaxed thresholds for IL size and number of basic blocks • Waste less energy on (semi-) cold call sites Profile-Driven Inlining bool IsPrimitiveType(Type type) => type == typeof(bool) || type == typeof(char) || type == typeof(sbyte) || type == typeof(byte) || type == typeof(short) || type == typeof(ushort) || type == typeof(int) || type == typeof(uint) || type == typeof(long) || type == typeof(ulong) || type == typeof(float) || type == typeof(double) || type == typeof(nint) || type == typeof(nuint); Profile-Driven Inlining for (int i = 0; i < 100; i++) { Test(); Thread.Sleep(16); } [MethodImpl(MethodImplOptions.NoInlining)] static bool Test() => IsPrimitiveType(typeof(T1)) && IsPrimitiveType(typeof(T2)); static bool IsPrimitiveType(Type type) => type == typeof(bool) || type == typeof(char) || type == typeof(sbyte) || type == typeof(byte) || type == typeof(short) || type == typeof(ushort) || type == typeof(int) || type == typeof(uint) || type == typeof(long) || type == typeof(ulong) || type == typeof(float) || type == typeof(double) || type == typeof(nint) || type == typeof(nuint); ; Assembly listing for method Program:Test[int,float]():bool ; Tier-1 compilation ; No PGO data sub rsp, 40 mov rcx, 0x11B802000B8 ; 'System.Int32' call [Program:IsPrimitiveType(System.Type):bool] test eax, eax je SHORT G_M27198_IG05 mov rcx, 0x11B80205090 ; 'System.Single' call [Program:IsPrimitiveType(System.Type):bool] nop add rsp, 40 ret G_M27198_IG05: xor eax, eax add rsp, 40 ret ; Total bytes of code 53 Inliner: too many IL bytes Profile-Driven Inlining for (int i = 0; i < 100; i++) { Test(); Thread.Sleep(16); } [MethodImpl(MethodImplOptions.NoInlining)] static bool Test() => IsPrimitiveType(typeof(T1)) && IsPrimitiveType(typeof(T2)); static bool IsPrimitiveType(Type type) => type == typeof(bool) || type == typeof(char) || type == typeof(sbyte) || type == typeof(byte) || type == typeof(short) || type == typeof(ushort) || type == typeof(int) || type == typeof(uint) || type == typeof(long) || type == typeof(ulong) || type == typeof(float) || type == typeof(double) || type == typeof(nint) || type == typeof(nuint); ; Assembly listing for method Program:Test[int,float]():bool ; Tier-1 compilation ; Optimized with Dynamic PGO mov eax, 1 ret ; Total bytes of code 6 Inliner: • Inline candidate has 13 foldable branches. • Inline has 28 foldable intrinsics. • Callsite has profile data: 1.0. Instrumentation Overhead • Dynamic PGO startup improvements in NET 8 · Issue #76969 • Sparse , scalable edge profiles enabled for all methods • GDV random state now in TLS • Scalable profile mode • More cases where we bypass instrumentation • Enable intrinsic expansion in Tier0 Class and Method Profiles • Reservoir sampling used to create approximate histograms of target classes (for virtual/interface calls) and target methods (for indirect/delegate calls) • Fixed-sized table per site (currently 32 entries (was 8)) • One global table per site • Each call adds entry to table, until table is full, then • Each call may randomly replace some table entry, with probability • This also keeps contention low • When optimizing, this data is used to drive GDV, testing for the most likely outcome(s) Profiling Blocks Dense • Each block gets a counter • Quite a bit of redundancy • Simple diamond: four blocks, two independent counts Sparse • Subset of edges get counters • Need to add in pseudo- edges • Block counts reconstructed via “simple” math Scalable Counters • Pre .NET 8, instrumentation was using a shared counter for the sparse edge counts, and not interlocking (“racy”) updates. • When app is heavily multithreaded: • Heavy contention on some counters (very slow Tier0-instr code) • Poor accuracy as many updates are lost due to races • Interlocked adds fix the accuracy issue, but contention is even worse • Not feasible to shard the counters (i.e., TLS) both because of the space required and the need to aggregate across shards Blue and red lines show the ratio of a “racy” contended counter’s value to the true value. Note it can lose upwards of 90% of the counts (this was on a 12 core machine) Scalable Counters • .NET 8 introduces scalable counters • Use interlocked add for first 2N counts • Add randomly after that… • Add by 2 with probability ½ for count in (2N, 2N+1] • Then by 4 with probability ¼ for count in (2N+1, 2N+2] • With suitable threshold (N=13) count value is very likely within 2% of true value • Number of writes to “hot” (potentially contended) counters drops dramatically Scalable Counters Deviation of scalable counter from true value. Counts exactly up to 8192, then randomly for higher values 5-95 spread about +/- 2% Scalable Counters Tier0 Tier0 + Instr Tier1 Impact of improvements to instrumentation on Tech Empower RPS / Latency Randomness • Instrumentation relies quite a bit on randomness • GDV profiles use randomness for Reservoir Sampling to build approximate histograms • Count profiles use randomness to improve scalability • PGO data will likely not be the same from one run to the next • But typically, there are enough observations that the overall behavior is still stable and repeatable • There is already a fair amount of non-determinism when running code, but now the jitted codegen depends on it in a fundamental way. • If you suspect a bug, try running with DOTNET_TieredPGO=0 PGO, BDN & PerfView You can get ETL traces from BDN via -p ETW (on windows) and view these with PerfView. Note the same method now appears as two entries. From the time chart we can see that the QuickJitted (Tier0) version ran early on, and then the Tier1 version took over. Make sure to filter the profile to just the time that BDN is actually making measurements. PGO, BDN & PerfView Open the events view, select BD’s WorkloadActual events, verify the intervals show consistent times, pick one and set the time limits on your profile view. Here: 3582..3837