Fri Jul 03 21:07:50 BST 2009
Heat + Programming dont mix
I have spent the week playing with the symtab code so that ustack() can display the user stack traces. I found various issues with the hacks to get Linux process control to work without radically modifying the existing code - still more to do, but at least I can concentrate on the symtab.
I tripped over a bug in a couple of the ELF functions, where there is a Solaris v. Linux incompatibility in the error return values.
I keep finding code where it tries to open /proc/pid/pstatus which doesnt exist on Linux, and various issues in finding the DYNAMIC/PROCEDURE_LINKAGE_TABLE. At the moment, its displaying the function names fine, but the module (library) names are garbage, probably because its expecting to find the shlib name but I havent stored it anywhere and its pointing to free memory.
I just ran valgrind on dtrace and thats helped track a few uninitiatlised variables, but valgrind doesnt understand the dtrace ioctl()s so any return from an ioctl() taints the output, unless/until I teach valgrind how to interpret these.
I spoke to Adam Leventhal about SDT probes to understand some more of the internals. An interesting point he mentioned to be was how SDT works in Solaris: as the kernel boots up, it scans itself for the SDT probes and readies the breakpoints to be inserted. So there is a mapping of probes, just like for a USDT application, which makes perfect sense.
I mentioned the trickyness of doing Linux SDT probes in the absence of source code changes to the kernel, and I know it can be done, but it may require case-by-case analysis to determine how best to patch the kernel to get the probe points. When I have finished/improved user space symbol and process handling then I can go back to that to play, or, I could just use dtrace to analyse more of the kernel itself.
More, when theres more to write about.
Mon Jun 29 22:40:24 BST 2009
rtdb framework
The rtdb functions are interfaces to the runtime linker (ld.so.1), and, although very nice, rely on intimate behavior of the Solaris linker. This doesnt exist on Linux (i.e. the corresponding functions). So, copying the code into dtrace means copying more and more dependencies (avlist, linked list, msg locales and other stuff), for little benefit.
dtrace uses these functions in a very specific way: get the symtab of the target process we are tracing, along with the symtab for the loaded shared libraries.
I am going to draw a line and see how much I can do without dragging it in. (I dragged it in and have kicked it out again, as I just spend more and more time porting Solaris to Linux, which isnt the end goal).
The end goal is making the PID provider and user space stack traces "as they should be".
This will likely take a while, so will update periodically if I feel what I have is no worse than before.
Sun Jun 28 12:07:05 BST 2009
dtrace progress - symtabs
I am currently looking at this:
$ dtrace -n pidXXX::: -p XXXI tried this on my MacOS system, and was intrigued by the fact that for a sample Perl app, tens of thousands of new probes sprang into life. It looks to me that you can DOS attack a kernel with these privs, since if you do this on lots of processes, you can eat the probe memory that dtrace will set aside, and either run out, or affect performance of a system.
At the moment I am knee deep in more ELF/dynamic stuff, so that we can get the symtab of a running process so that the PID provider is more usable.
Thu Jun 25 23:16:47 BST 2009
SDT probes - what?
Thats how I read the SDT: SDT can provide a probe like "received_packet" and provide an argument which represents the packet so you can dissect it.
But, the question is - are they useful ?!
I dont really understand the probes despite staring at the code for a while. I understand lots of the technicalities, but not the rationale. Is my first paragraph spot on? Feel free to send me feedback about why they are a *must*.
Why?
Well, many of the probes in Solaris relate to Solaris internals. The concepts of scheduling on solaris dont match the Linux kernel. Solaris has a process and a lwp (lightweight kernel thread). In Linux, all threads are really processes.
So, if you have a D script written for Solaris, it wont work on Linux, unless I provide as close an emulation as possible. I have found the FBT is more than enough to keep me entertained, but I am trying to find if we need SDT.
There are a lot of values exposed in /proc such as statistic counters. And there is a lot of code in the kernel which increments those counters. But the counters on their own are not directly interesting (you can put an FBT on the functions that manipulate those counters). So, maybe I am missing something, like, with dtrace/linux today, you cannot easily inspect processes, io, vm, packets, etc.
Tue Jun 23 23:56:25 BST 2009
fixed the 32b problems?
I will have to analyse this more to see why that hook (which shouldnt fire, and we do put it back on a rmmod) causes a problem.
Tue Jun 23 22:35:39 BST 2009
32b drat
After trying a few experiments, it appears that reloading the dtrace driver will panic/crash/reboot the 32b kernel. (After 3 times for my test machine, and in vmware, a reboot occurs, indicating a likely triple-fault).
I suspect maybe on driver unload, something is not being undone which happened on a load (maybe reset/unhooking the interrupt vectors).
I am investigating.
SDT Progress
Ive done some research on how to get SDT into the kernel without touching the kernel source. I was hoping for key subsystems like the scheduler, VM, NFS, that we would find a structure containing counters which are incremented at key parts of the driver, and the ones exposed in /proc. If we did, we could modify the instruction provider to look for these increments, and auto-create the probes.
What I have found so far in looking around, is that some/all drivers have either a disconnected adhoc collection of counters or have per "instance" counters. (I found references to zones in the MM code), so it wont be as easy as I hoped, but I am continuing to look for a pattern.
Mon Jun 22 21:33:02 BST 2009
dtrace -p now works
It still concerns me that you can kill -9 the dtrace and the child can be left stuck in an indeterminate state. Whilst thinking about this, I have a possible solution, namely to let the dtrace driver know what we are doing, and should the dtrace process die, we could force a SIGCONT (PTRACE_CONT) on the debuggee, so all is not lost, and we dont need to do what Solaris does in the /proc filesystem.
So, next up is either ustack() (and user space symbol tables), or the SDT driver. I am still a little confused by SDT and the "transform" keyword in a D script which provides struct-level access to kernel and user space params, but I know what I am expecting to see/work, so I just need to play.
SDT will be interesting - I have a plan to use the Instruction Provider to disassemble the kernel and intercept ADD instructions which apply to a global memory area corresponding to a struct of interest. I hope this will work for some/most of the desired areas, and if so, we have a way to intercept processes which trigger various kernel counters.
One thing to note with dtrace -c/-p - the way dtrace works is to get the process going and then to kick off the kernel rules engine. The kernel doesnt really know whats going in user space - you can elect to monitor probes for the process or any sibling (like truss -f or strace -f) by virtue of your predicates on the probes you write. This really is very powerful, since dtrace can (in theory) do everything strace and truss can do, but via lower level primitives.
Dtrace emulating truss is available as some scripts on the internet show, but some aspects of the way this is done is a little "clunky". I will experiment at a later date to see if we can more closely emulate strace/truss so that dtrace can be a one-stop-shop for these kinds of things.
New release available today whilst I go off and do some more real work.
Sun Jun 21 21:28:00 BST 2009
One step closer - dtrace -c works
You can now do this:
$ dtrace -n 'syscall::mu*:/pid==$target/{printf("%d",pid);}' -c df
dtrace: description 'syscall::mu*:' matched 6 probes
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 5874396 5097744 478248 92% /
varrun 255460 72 255388 1% /var/run
varlock 255460 0 255460 0% /var/lock
udev 255460 44 255416 1% /dev
devshm 255460 0 255460 0% /dev/shm
dtrace: pid 7689 has exited
CPU ID FUNCTION:NAME
0 87377 munmap:entry 7689
0 87378 munmap:return 7689
0 87377 munmap:entry 7689
0 87378 munmap:return 7689
0 87377 munmap:entry 7689
0 87378 munmap:return 7689
You may see some debug printf's I have left in there, but next
thing is to tackle the symtab (stack/ustack) stuff, and consider
library probes.
The -c stuff (and -p, which I havent tested yet) may have some issues. Theres a horrible sleep(1) in the child after a fork() to let the parent catch up with the child. I found the Linux kernel seemed to be broken in some areas (I believe threads which inherit ptrace() children have problems).
The sleep can be solved easily with some form of shm mutex or maybe even a futex, but I havent tried.
What is worrying is that in Solaris /proc fs, you can signal a child process to continue on its own if you, the parent, die. On Linux, this isnt there, so, consider:
$ dtrace -n ... -p <pid>If you kill -9 the dtrace process, then the target process may be left in an indeterminate state. This is true for strace. dtrace and strace can work hard to intercept SIGINT/SIGHUP/SIGTERM/etc, but cannot do anything about SIGKILL. I can think of a not-nice to partially solve this (or maybe we could put something into the kernel to handle this), but that is a reason why /proc/pid/ctl wins on Solaris.
Fri Jun 19 19:54:57 BST 2009
Linux user threads - bug ?
I found something interesting, which had been confusing me, having spent so long inside the kernel.
In Unix, we have the ptrace() system call - which is the basis of all debuggers. You can attach to a process and do things like set breakpoints or intercept events of interest, like signals.
The way the works is in one of two ways: if you are a debugger (which dtrace, gdb, strace, etc all are), then you fork yourself. The child notifies the kernel it is happy to be traced (via ptrace(PTRACE_ME)), and then forks+exec's the target process.
The parent debugger attaches to the target pid (it knows the pid, because we just forked). It does this via ptrace(PTRACE_ATTACH), and from then on can peek/poke the target process, or continue after an event.
So, here is the bug. In order to ptrace a process you need to attach to it. Two debuggers (eg gdb + strace) cannot attach to the same process at the same time.
Now, consider this. You are a process. You create a new thread. This thread forks() + execs the target. The new thread tries to attach to the process, but fails, because the master thread is considered the 'parent' of the child, and the thread you spawned is considered to be a distinct process - not a thread of the main process.
The issue here is that in Linux, threads are implemented as if you had forked a new process, but the thread shares the address space of the parent. This is not true of a proper multithreaded and POSIX compliant system. E.g. in Solaris, a thread is really a separate 'slice' of a process, and it shares the process id of its parent.
Linux tries to pretend threads exist, but this funky emulation seems to break how ptrace() works.
This is why I have had a hard time getting userland dtrace to work properly in this area - as I have been trying to understand what dtrace is doing and why the target process was stuck in the wrong state.
Now I understand, hopefully the "-c" and "-p" switches to dtrace can be made to work, and this will be a significant feature addition to Linux/dtrace.
Mon Jun 15 23:00:18 BST 2009
Next up...
$ dtrace -n syscall:::/pid==$target/{} -c "sleep 100"
This is how to trace the syscalls for a specific process we want to
launch - one of the last major features of Linux Dtrace which is missing.
Interestingly, I seem to be hitting an issue with pthreads vs fork/waitpid semantics...Time to read more on who gets the signals on Linux, vs solaris...
Mon Jun 15 21:58:37 BST 2009
Dependencies
I am not going to respond to emails for trivial support issues.
Thank you
Fri Jun 12 19:55:29 BST 2009
dtrace and the CALL instruction .. fixed
Amazingly, I was staring at a solution in the Linux kernel, but my brain has been hazy the last few days. I had implemented the Instruction Provider which has been a great help to find lots of samples of instructions I care about and try and get a feeling for what is going on.
The issue I was seeing is that when we take the INT3 and INT1 handler - for the initial breakpoint trap and then the single step trap, we would expect the kernel RSP to have moved, because we had just stepped a CALL instruction. But I wasnt seeing this. The "regs" structure on the stack at the point of exception for the same. This didnt make sense.
I hacked it for one 64b kernel, but the others hated my hack. (My hack involved looking at the stack dumps and trying to 'find' the magic values I wanted),
It worked fine on 32b kernels. Imagine an interrupt from kernel space taking place. The cpu pushes RFLAGS, RCS, RIP, in that order onto the existing stack. At this point, our code kicks in and pushes the full register set on to the stack (giving us a "struct pt_regs" structure we can point to and manipulate before returning from the interrupt).
Just above the flags should be the stack where we interrupted. This *is* true on a 32b cpu but not on a 64b cpu. I *think* the reason is that on 64b cpus, Linux sets up a TSS task switch so that on an interrupt, we have a private kernel stack, and this would hopefully avoid stack overflows if we interrupted a deeply nested part of the kernel.
That is why the 'regs' structure is always at the same address, and what we have in the r_rsp field is a POINTER to the original stack, not the stack itself!
A quick experiment and I could run:
$ dtrace instr::*call*:to trap every call instruction in the kernel and it worked. In addition
$ dtrace fbt:::works flawlessly on all three key 64b kernels I was trying, and I hadnt even broken the 32b kernel in fixing this.
Theres still a bogus issue or two to track down. Ctrl-C-ing dtrace can cause kernel problems - not sure why. If you Ctrl-C the dtrace binary, it sends an ioctl to the kernel to ask it to pull apart your probes rather than just exiting. Dont fully understand why they do that but it maybe for when you launch a binary from dtrace and it needs to kill or detach.
So, if this done, I can hopefully return back to user space and get userspace apps to be traced as well, and then we are done....
The Instruction Provider driver is hopefully going to be useful to implement a proper set of probes for the things that avoid patching kernel source.
Mon Jun 08 23:33:43 BST 2009
Instruction Provider now works
$ dtrace -n instr::*-nop: 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86742 mutex_trylock-nop:0xffffffff8045b1b7 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86742 mutex_trylock-nop:0xffffffff8045b1b7 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 ...And another,
$ dtrace -n instr::*-lock: 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c
Mon Jun 08 21:43:41 BST 2009
The Instruction Provider
Its hopefully useful and innovative, but my prime goal here was to provide a way to debug targetted opcodes, which are not necessarily in the first location of a function.
I ran a trial - I went from 25,000+ probes to 300,000+ probes. I quickly crashed the kernel (hey, it was a first effort), but hope to debug quickly.
I will probably make it a load-time option to enable it as it really can be destructive to the system under test with so many probes firing. But, if it works, it will also be a good stress of dtrace on linux.
More later!
Sat Jun 06 20:55:00 BST 2009
E8 again...
This is what a relative call does:
E8 nn nn nn nn CALLR offsetWe have a 32-bit relative offset from the next instruction. As a normal subroutine call, this is what should happen: decrement RSP, move return address to (RSP).
Now this is very strange: on the 2.6.9 kernel, we single step the call. The initial breakpoint hits and we look at the RSP. As we single step over the call, we expect the RSP to have decremented by 8 (64-bit return addr).
And it does.
But there is a gap between the RIP/CS/EFlags for the trap exception and the return address of the stepped over instruction. Look at the following debug output:
INT3 PC:ffffffff80110a48 REGS:0000010008eedea8 CPU:0 BEFORE: Regs @ 0000010008eedea8..0000010008eedf50 CPU:0 r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00 rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038 r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100 rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0 rip:ffffffff80110a49 cs:0000000000000010 eflags:0000000000000047 rsp:0000010008eedf58 ss:0000000000000018 00000000006f2840 00000000006f0a00 INT3 ffffffffa0236cff called CPU:0 good finish int1 PC:ffffffffa025aff0 regs:0000010008eedea8 CPU:0 AFTER: Regs @ 0000010008eedea8..0000010008eedf50 CPU:0 r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00 rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038 r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100 rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0 rip:ffffffffa025aff1 cs:0000000000000010 eflags:0000000000000047 rsp:0000010008eedf50 ss:0000000000000018 ffffffffa0236d05 00000000006f2840Here we get an INT3 trap and you can see RSP is set to 0000010008eedf58. The "Regs @" entry in the first case shows the extent of the 'struct pt_regs'. Note that between the printed rsp and the end of the regs area is a difference of 8 bytes. This shouldnt be there.
After the INT3 breakpoint trap, we single step (int1), and look again at the Regs@ and RSP field. The regs are at the same location - even although we just executed a call instruction and pushed the return address on the stack. In the INT1 register dump, RSP is correctly decremented by 8. Here we have no gap, but for INT3 we do have a gap.
I have been reading and re-reading exception handling on the web and Intels docs and there is no reason for the gap.
What is puzzling is that it works on the other kernels, but INT3 is pushing two extra words on the stack - more than I expect.
Another interesting issue is that when I look at the kernels I have and search for E8 call instructions at the first instruction of a probe, only this one seems to have one. Later kernels (or GCC's) dont seem to emit the instruction, so, if I dont understand what is going on, there is a chance that you will hit one and panic your kernel.
Strange. I am going to put out a new release (at least this fixes the compiler issues people have been complaining about, and hope no-one has an E8 in their kernel).
Fri Jun 05 22:44:39 BST 2009
0xfa and 0xfb - STI and CLI
For instance, CLI, which clears the interrupt enable flag will ignore interrupts over the following instruction (as will STI, or, maybe only STI does).
What was happening if process 1 -- init -- would die, and the kernel would scream at me.
I have solved this by pure emulation - no point in single stepping these instructions, and just handle without a single step - which is better from a performance point of view.
I am running on 3 64-bit vmware kernels. 2.6.27.8 runs beautifully. 2.6.27-7-generic - an Ubuntu kernel - runs flawlessly but strangely slowly when all probes are enabled. I would expect both to run at the same speed, so either the first is running fast when it shouldnt or maybe the latter is flawed. (I think the slowdown may be due to calls to mcount which is doubling the overhead per function in the kernel).
The other is 2.6.9 - AS4 kernel. Just shown that to hang, so I need to debug that before making a release.
(32-bit kernels appear to work fine, and the compile issues are resolved).
I have added a special flag to FBT which is interesting/useful.
$ load.pl -opcode
will prefix each probe name with the first byte of the opcode at the probe, so that it is easier to diagnose where the flaws are. Single stepping the breakpoint for a probe works, but many instructions have to be handled specially, such as jumps, calls and rets. So being able to find the offending instruction or scenario is helpful.
This relates back to a prior blog entry where I talked about how nice it would be to have an instruction prober where we could probe by instruction type, rather than function. E.g. imagine probing by virtue of every LOCK instruction. Or REP or CLI. Get the picture?
How about JMP/JMPNE/JMPEQ instructions? That could be ideal for low level kernel profiling -- how many times is a jump taken in *this* function.
This is easy to do - just need a variation of the FBT disassembler which doesnt try to instrument the entry/exit of a function, but the body.
I may try and get this in on the release after this one, just to see what it looks like. Stay tuned.
Hoping to release this weekend or tonight if I can resolve the AS4 issue.