Fri Nov 20 21:42:29 GMT 2009

Tail recursion woes


Dr. Dobbs article

This is driving me nuts. GCC does tail recursion optimisation. That is very nice, and means that if we have something like:

int func(...)
{
	do-stuff();

	func2(...);
}
then the func2() call can be converted into a 'goto'.

The problem is that this means that if we put a breakpoint in gdb, or a printf, in the func2, we lose the stack frame for func(), and it appears that func2 is being called from the caller of func(), rather than func() itself.

I wish they wouldnt do these "nice" things. Makes debugging a pain, and am tempted to go back to earlier/very old versions of GCC to stop this warfare, where "what used to work", stops working and you have to fight the toolchain.


Posted by Paul Fox | Permalink

Fri Nov 20 20:40:52 GMT 2009

Motif finishing up


After my last write up, of finishing the Motif rewrite for CRiSP, I have made more progress. This centers around the things I had forgotten.

For example, the 'Scale' widget which is used in the color selector dialog needed to be implemented (from scratch). That took about a day (nice when things go according to plan).

Then I hit some issues with the default size of a combo field. That is fixed.

Next up was the protocol manager for a shell widget. What is that you say? Think of XmAddWMProtocolCallback(). Without this, if you click on the window manager "X" at the top right of a window, then your app is quickly terminated (the TCP connection to the X server is severed).

Took me a while to figure out / remember how this all worked. But, suffice to say, that unless you post an Atom/Property on the window (WM_DELETE_WINDOW), then the window manager will not ask politely, but just brute force the termination of the app.

But to put a property on the application shell window is not quite so easy, especially when we normally call the XmAddWMProtocolCallback() against a *widget* and not a *window*. A widget may not exist on screen at the time we call it, and that is why there is complexity in the Motif library - you are allowed to register interest in these window manager protocol messages, before the window is 'realised' (i.e. before being mapped to the screen). When the window is mapped, the appropriate property is posted on the desktop to allow the window manager to see what is going on.

Of course, if you try to do something "later" in an application means creating some form of data structure for later use, and then, making sure we dont suffer a memory leak.

My implementation of window manager protocols isnt perfect, but sufficient for what I need.

Why do I bother? I dont know, but what I know is that CRiSP with Motif, statically linked, occupies 3MB of code memory. With the new Motif replacement, it is about 2MB.

When CRiSP was first written this was about the size of a large floppy disk (and 40MB - thats megabytes, not gigabytes) was huge. Now the size of CRiSP fits in the L2/L3 cache of a cpu.


Posted by Paul Fox | Permalink

Sun Nov 15 18:51:03 GMT 2009

Menu updates


My experiments to replace the Motif library code in CRiSP with native code hit a major stumbling block. The first few days of effort were extremely good - lots of progress and deterministic behavior.

But the menuing code has taken about 3 months - I hope this is now complete. Why?

Any number of reasons:

  • I am losing my 'touch'
  • It is difficult
  • There are lots of fiddly bits to get right

You choose. The issue with menus is the way input focus moves around from one widget to another. There are lots of scenarios to get right, and fixing one of them, would result in some existing feature suddenly breaking - like a see-saw as the code came together.

One problem I hit is that XtAddGrab/XtRemoveGrab doesnt handle double registration of a widget particularly well.

The code, whilst trying to be purist and object oriented, in the end had to be a little dirty - one class having too much intimate knowledge of what it is dealing with (a menu has menu items, which is mostly separators and buttons, for instance).

Heres some scenarios to consider:

  • Click and reclick the menu bar button (should dismiss menu)
  • Use keyboard to navigate a menu and popup sub menus, and then popdown submenus.
  • Use ESC to popdown a menu
  • Click outside the menu to dismiss it; click in another window to dismiss it too
  • Click on a menu item, and have the menu popdown, and the callback invoked

May seem like simple stuff, but getting it all working is difficult, especially when grabs are put in place - suddenly input goes to the wrong widget, and everything you had working, stops working.

Why bother ? Why not just use Qt or Gtk ?

Because I dont want to use them. As good as those toolkits are, they are not available everywhere, and I dont want dependencies on other toolkits - toolkits which have a very active development life.

This is akin to the dtrace problem: do you develop software for the latest and greatest kernel/distro out there, or do you go back to old releases and ensure your software works with them?


Posted by Paul Fox | Permalink

Mon Oct 19 22:33:04 BST 2009

Menus...implementing them (ipod web app)


Still busy working on my Motif replacement for CRiSP. Much of the grunt work was done a long while ago, but menus were the thing I put off until a couple of months back.

Implementing menus is frustrating. (The input field widget took about 2 days to implement from start to finish, and the result is more functional than what it replaces). By contrast, menus have taken a couple of months. Much of the original code worked u front.

What makes menus tricky are a number of things but mainly related to input focus and 'appearing to do the right thing'. Menus can popup with a mouse, and then be navigated into sub menus with the mouse or keyboard. I delayed getting the input focus problems solved until late in the implementation, which, in turn, broke much of the existing logic. (Events firing to the menu button popping up the menu or the menu or menu items, etc). Just as bad was restoring the status quo when selecting a menu item or dismissing the menu.

To make life palatable, I have a nice suicide-timer - after 25s, a child does a "kill -9" on the parent, which avoids the tedium of trying to unwedge the X server.

Its all very close, and am just busy ensuring all combinations work properly. (Mainly nested menus as they popup/get dismised).

Many of the problems are my own misunderstandings: I spent a good few days trying to get XGrabPointer() to work for me, only to eventually realise what I needed was XtAddGrab(). [Someone on a forum mentioned that if you weren't confused by XGrabPointer/XGrabPointer, then you didnt know what you were doing!]

XGrabPointer is to do with freezing the X server, typically used for drawing type apps. For menus, you sort of need a mixture of the two. (You want to intercept a menu dismissal when you click outside of your own application).

(I was debating using the iPod Touch to fix my X server hanging problems, e.g. have a web server running, which responded to trivial HTTP requests, and on receipt of a request, could douse the rogue hung application).

Ok, here it is - an ipod touch web server to help debug X11 apps...

#! /usr/bin/env perl                                                          
                                                                              
use strict;                                                                   
use warnings;                                                                 
                                                                              
use IO::Socket;                                                               
                                                                              
sub main                                                                      
{                                                                             
   my $sock = IO::Socket::INET->new (                                    
      LocalPort => 8080,                                                 
      Type      => SOCK_STREAM,                                          
      ReuseAddr => 1,                                                    
      Listen    => 10);
   while (my $client = $sock->accept()) {                                
           print "Killing hung app...\n";                                
           my $str = `ps ax | grep /home/fox/crisp_v9.5/bin.linux-x86_32/crisp`;
           chomp($str);
           next if !$str;
           $str =~ m/^ *(\d+) /;
           next if !defined($1);
           kill(-9, $1);
   }
}
main();
0;

Posted by Paul Fox | Permalink

Sat Oct 10 15:53:04 BST 2009

dtrace update


Not much to report; i have hopefully fixed compile issues on 2.6.31 kernels (havent proved on 2.6.32). It does get tiresome the tweaks from one kernel to another to prove it compiles properly.

Some people are reporting issues on GCC 4.4 and later glibc's. If you get these issues, let me know. glibc seems determined to break existing apps (GCC isnt quite so bad).

User space stack tracing is probably hosed because getting the user space stack, depending on the trap context is fiddly and something I hadnt finished.

Need to catch up on my crisp Motif work (it all works, but its not quite perfect enough to release the crisp code yet).


Posted by Paul Fox | Permalink

Tue Oct 06 23:16:00 BST 2009

Boo hoo...


Hard drive on my ftp server has gone - so that means some links and crisp downloads are out of action til I have a chance to put the backup into place and populate it with something useful, like, err, maybe an operating system.

As a BTW, I decided to implement 'inertial scrolling' on fcterm and crisp. So, you may find (eventually) a new release with the feature where the faster you scroll the wheel on the mouse, the faster it scrolls. (Works nicely on fcterm, but may need to do more surgery on crisp since it only works for X windows, and needs to work for Mac+Windows too).


Posted by Paul Fox | Permalink

Sun Oct 04 10:08:14 BST 2009

More Apple Insanity


I wrote a short while back about the iPod Touch - nice hardware, shame about the software. I will continue my rant...

Sometimes, my movie playlists will play back to back, but mostly not. Its totally erratic, and simply feels like a 1st grader programming error. Its almost like an uninitialised variable is causing it to work or not. Sometimes switching it on/off helps, or a resync, but who knows.

Then, the other area of total insanity is the app store. The app store integration is brilliant - ability to browse, and waste a few minutes seeing whats available is great. Then you can download free software or pay for software.

But...why on earth does the ipod go into a mode where *no* applications will run? The startup screens almost show but then the app aborts or exits. Who knows what - because Apple decided not to show you a reason why the app quit. Then, some time later, they will all work again. This is clearly a bug in Apples firmware. I have downloaded maybe 15-20 apps, and I can believe some have bugs, but not on the initialisation of the app. Why it should go into "not going to run anything" mode and then recover, I dont know.

I downloaded an RSS reader - great software, but I am fearful it creates a lot of little files for the unread news - maybe thats tickling a bug; maybe the filesystem is fragmented - you cannot see the filesystem so you are SOL to know whats going on.

What is annoying is that lots of people on the web report the same thing and as far as I can tell, noone has clued up on the real causes, and Apple doesnt admit to this issue. Their 'walled garden' is full of street-corner muggers.

At least it plays movies and music - my main desire, but the app crashing is unforgivable. At least, some form of diagnostic tool to figure out what would be nice. Maybe thats a tool that needs to be written, or maybe I will find out on google somewhere what is going on.


Posted by Paul Fox | Permalink

Sat Sep 26 23:09:10 BST 2009

The year is 1992


Around about the year 1991, I acquired my first HP calculator - an HP48SX. If you have never seen or used one of these calcs, read on.

It was the equivalent of an iPod, in its day.

Now armed with an ipod touch, and visiting the ipod store and getting a feel for this new device, I came across an HP48 calculator emulator for the ipod.

Lets go back to 1992. In 1992, I wrote an emulator for the HP48 - made available over Usenet, and, it worked - it could do most HP things. It was based around X11, and the PCs of the day, were very lowly, but, still, the calculator emulator was fast enough.

I still have the source code to this. I believe this code was taken and used as the basis for other calc emulators, and there are now excellent emulators for Windows (havent checked Linux, but should be easy).

Now, lets go back to 2009. The App store version of the emulator is great - full power of an advanced RPN graphing and symbolic algebra calc. The pixels on the ipod are just enough to do just, but only just.

This app is available as source code. See here for the announcement of the i48 app Github entry for the source is http://github.com/dparnell/i48/tree

This is interesting for two reasons. One, this is a simple iPod app, and hence, is a useful example of how to create an app. Complete with XML files and bitmaps for the app store. (Interestingly, the Objective-C code is tiny, which is a testament to a good api on the ipod touch).

The other interesting point is that the emulator code bears the hallmark of my original donation to the community. I dont know if this code was based on my original or not - I started to lose interest in the HP as CRiSP took more of my time in the early 90's, and was vaguely aware of a Windows port (bear in mind this would be Windows 3.x or Windows 95, which I detested beyond belief - you needed to reboot the system if any app crashed if you wanted a nice life).

The attributions in the code date back to 1994, but there are signs of similarities. I cannot remember how I got started on the emulation. I know I picked up a possible HP28 (predecessor to the HP48) emulator and got some internal docs from HP at the time (I probably still have the emails dating back then), but its nice to know this excellent piece of hardware and the emulators live on - and now, I can carry on with carrying an HP + IPod together.

Ob complaint: why is the HP50 so expensive in the UK? Its more than 100 GBP vs about $90-100 in the US. Needless to say, HP havent received any money from me because of this obscene pricing, and I suspect, the number sold in the UK is pitiful, which is a shame.

So, what next? Who knows. I would like to write something for the ipod, but I dont have time, and theres a lot of ideas to wade thru from the existing app base on the app store. I just wish the ipod wasnt so locked down (see complaints in the prior blog).


Posted by Paul Fox | Permalink

Fri Sep 25 23:51:19 BST 2009

Apple are insane


I have a new ipod touch - nice device. There are lots of nice things about it, but Apple are totally insane. How they let this out of the labs I really dont know.

(1) Nested folders (playlists) dont appear to work on the ipod touch.

(2) On the touch front screen, is a "Movie" button. Search the web to find out how hard it is to get anything to appear in here. On the touch, movies need to be specified as "Music Video" or "TV Show". What? Duh?

(3) If you make it a "Music Video": a. it appears in the music playlist, not the video one. Are they insane? Yes. I have 92 playlists for music and upwards of 100 playlists for films (can only fit 20-30) on the ipod. So, i have to scroll thru the music to find the films, unless i use an initially letter, e.g. "A" to group them together. Are they insane? Yes.

(3b) A music playlist can be played portrait or landscape. A TV Show can only be played portrait. This wouldnt normally matter except in portait mode, you can see the film title when the popup volume controls appear but not in landscape mode. Are they insane? Yes

(3c) A TV Show playlist will not continue from one part to the next. I record from the tv to DVD across to a PC in 5min fragments. A typical film is 20 5min fragments. So, after each 5min fragment, my tv show skips back to the contents screen instead of continuing on to the next episode/fragment. All the parts of the film are in a playlist. So, Apple, are you insane? Yes.

(4) I knew you could get an onscreen control on the ipod screen to see position/volume controls, and it has taken me absolutely ages to know what to do - tried tapping, double tapping all parts of the screen, but it was so unobvious. Are you insane Apple? Yes.

(5) ipod touch wont charge on some alarm clock devices which let you plug in an ipod. The new 3G touch wont use my expensive and overpriced remote control from the ipod classic. Apple can be so money grabbing - they are a business, after all, but now, with so many ipods out there, and so many 3rd party devices, it is totally unclear what can work with what.

(6) App store apps which crash. I can appreciate software has bugs in it, but when I run an app, which then crashes, I really would like to know this and not have the ipod return back to the menu screen with no clue about what happened. What happened to the "bomb"?

(7) The ipod touch screen gets too smudgy; I can live with that. The screen is too dark - an OLED would be nice, but, watch a film set in a dark room, its almost impossible to see what is happening. To change brightness requires too many actions. Without a fast-fwd or reverse button, skipping over adverts is painful - trying to get your fingers in the right place (assuming you can work out how to popup the on screen display).

(8) Why are the classic and ipod touch/phone so different - I mean the way iTunes treats them. This is insane. They had a very good GUI and linkage with iTunes, but on the touch, its "lets be as different as we can". I could go into more details but.

(9) My ipod touch is much louder than the classic, for which i have spent so much on trying different headphones so I could hear quiet passages of films in a noisy environment.

(10) iTunes (9.0 and 9.0.1) is *insane*. Plug in your ipod. With 2000+ 5min film fragments, visit the device 'Movies' folder. It wants to open every one to display a screen shot image so you can select what to sync. iTunes mushroomed to greater than 1GB of RAM, and spent cpu cycles like no tomorrow. No way to turn off the image display. Fortunately, the "TV Shows" tab doesnt do this. Insane.

(11) ipod classic shows up as a mountable filesystem. ipod touch doesnt. iTunes/Apple have pulled a fast one so you cannot see the device as a mountable filesystem. I presume this is where the touch/phone cracking utilities come in. They have made it hard to do certain things, and "dtrace" (yes, dtrace) isnt able to monitor some aspects of iTunes deliberately (Adam Leventhal reported on this a while back, but dtrace is still broken). This is easy enough to fix/work around, but am curious. This means fixing a broken or ipod touch requires the service of an even better expert than a normal ipod has, and 3rd party tools become fewer and further between.

(12) Microsoft released the Zune HD. Lets hope the Zune really competes with the ipod, because Apple need a kick to be innovative. Apple stopped being innovative when they opened the Apple Store.

Despite my rant above, I am happy with the touch, but it has a long way to go to ensure it works for what it was intended for (music and films).

BTW there is a big difference between a classic and a touch: if your library fits into a classic or touch, life can be easy. If it doesnt, and it wont on a touch, then micromanagement of files is insanely tedious and difficult with iTunes. They have made the classic mistake of creating a GUI and touting how easy it is to use. It isnt. You are simply mortal, Apple.


Posted by Paul Fox | Permalink

Sun Sep 13 18:59:05 BST 2009

Trials, tribulations, horrors...


Horrors...

I've been on holiday (San Francisco, LA, Vegas...), and my first trial is the MGM Grand Hotel. They pulled a fast one. Their customer services leave a lot to be desired, and their web site is appallingly awful.

We booked a 3 day stay, and the price was very good. ... Til we checked out and then we found their prices had doubled and doubled again (we arrived Thu before Labor day and stayed the Fri + Sat night). Nowhere on the web booking did it say that each night was a different price. What is worse...when we arrived, they got me to sign the obligatory credit card form, and hiding (in plain site), was the room rate for each night, going up in exponential fashion. So, partially my fault for not noticing that, but the person (might not have been human, and cannot put down the word I want to describe) didnt point this out. (It would have been too late anyhow).

Other than that - holiday was great.

Trial

I upgraded my main server today to Ubuntu 9.04 - just a short time away from the 9.10 release, but I wasted much of the day getting VMWare Server 1.0.x working on the 2.6.31 kernel. (I gave up with the Ubuntu kernel after it wasnt installed properly after the upgrade). VMWare is a pain - at least there is source code to the drivers, and eventually I got it to compile, but generated a kernel panic when vmware was started (more than likely, my code changes were a little too dirty). Oh well....

Tribulations...

So, I decided to try switching back to VirtualBox (2.1). On startup, it told me 3.0 was available so I have upgraded to that. Interestingly, I noticed that 'rdesktop' works nicely for VirtualBox (my previous complaint was that the X GUI was horrible when dealing with high volume output across a low speed wifi network), and this may help enormously solve that problem, and get me away from VMWare. I really didnt want to suffer the VMWare Server 2.x release, and now I can live in the freeware world for virtualisation.

DTrace for 2.6.31 Kernel

As per normal, the new kernel doesnt compile the dtrace code due to the number of changes in the kernel. Fortunately, the changes look much easier than the ones VMWare had to contend with, so will try and fix this shortly.

CRiSP without Motif

What has been occupying me for the last few weeks was migrating CRiSP away from Motif - its nearly finished - just need to do some final touches to the menu system and menu bar, and it looks/feels much better.

iPod Touch

Had been eagerly waiting for the new ipods to come out (I own an iPod Classic, but not an iPhone - they are simply too horrendously expensive), and although the new iPod isnt technically much better than the older ones, have ordered one - so I can watch films on the way to work (the Classic screen size and volume has always caused me an issue). I am thinking of toying writing some apps for the iPod....maybe I could port CRiSP or DTrace to it :-) (But that might be pointless tho!)

So, if things go quiet for a while...I am busy getting a high score or just hacking on the ipod or crisp ... or dtrace...


Posted by Paul Fox | Permalink

Sat Aug 15 18:43:24 BST 2009

CRiSP + Motif (no dtrace)


I am taking a short rest from dtrace - its been doing my head in (ustack / dwarf; see previous postings).

Am on holiday from next weekend for a couple of weeks, and I want to do something more rewarding, so am switching back to CRiSP for a while to kick some tyres.

First up is more finer control of file auditing - you can tell CRiSP to keep track of files you edit in an audit trail; useful for those times when you forgot where you placed a file.

I've fixed some other customer reports.

I keep on staring at ribbon bars, and before I fully tackle this (theres some pre-alpha code in CRiSP to do this, but its not ready for primetime), I am revisiting the Motif factor. CRiSP is built on Motif and over the years, it has driven me insane. In recent weeks I have fixed some uninitialised memory refs in Motif which could cause core dumps, but I have always had a goal to remove it totally. Many of the widgets are native Xt widgets, and the few remaining just require a bit of debugging to get rid of it totally - thus making the code more supportable, and ready for other things. (And freeing up a fair amount of memory).

CRiSP has some theming support and in getting rid of Motif, it will be easier to complete that, and finally make menu items to have icons in them.

People have also asked for freetype font support (which exists in CRiSP in a semi undocumented fashion). So, if the Motif removal goes well, then freetype can be made available to most of the widgets.


Posted by Paul Fox | Permalink

Sun Aug 09 00:16:25 BST 2009

Painful dwarf


Progress is slow, but positive. Ive spent the last week or two trying to find the user stack and the PC. Its easy to get the user stack, but the PC proved elusive, but I have a hack to find it.

Why?

Imagine the SYSCALL instruction fires. This is a special instruction in the amd/x86 cpus which moves from user mode to system mode, *without* pushing the return address on the stack. The Linux kernel, immediately after the transition (entry_64.S) puts the user space SP into the thread task area, but the PC is hiding. On entry to the kernel side of a syscall, it is in the RCX register, but by the time we hit a probe, e.g. sys_open(), we are miles away and the pt_regs array isnt accurate. At the point of probe, we force a breakpoint trap (luckily, only our code executes at this point, so we dont have to consider nested interrupts and blowing the state areas in the thread stack).

What makes this tricky is getting everything to work at once - anything even slightly wrong just gives bogus results -- stack traces which are not accurate or totally missing.

I am better now - I seem to get the first two stack frames, but the third one is elusive (I am either miscomputing the dwarf frame info or misapplying the result to find the next frame; for a third frame, its frustrating since we have gone thru the same looped code twice, so why the third is problematic is not clear).

The code so far is fairly horrid, with lots of experiments in their, and no 32-bit version yet done. My biggest fear is if any of this is subtly dependent on kernel releases (I think it is not), so that would be one weight off my chest.

(Kernel releases are subtly different in syscall/interrupt handling, and also structure layout for the user/process/thread, but I dont think we care too much, yet).


Posted by Paul Fox | Permalink

Wed Aug 05 23:43:51 BST 2009

slow dwarf


Been busy doing some CRiSP updates over last few days, so backed off a little on dtrace, but trying to get back into the dwarf issues.

Alas, the current Windows CRiSP release has black arrows on the scrollbars... to be fixed this weekend. Nuts.

I am trying to get this to parse properly:

$ build/dwarf /lib/libpthread.so.0
....
CIE length=00000014
  Version:              01
  Augmentation:         "zRS"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address reg:    0x10
  Augmentation Length:   len=0x01 1b
R encoding 1b (kernel)

2c38 FDE len=7c cie=001c pc=e0ff..e109 tpc=ffffffffffffffff
0000: dwarf.c: unsupported DW entry 0xf 12
I am working thru the various opcodes, being able to parse, but no guarantee the semantics are correct (thats the next phase).

libpthread.so.0 is where the open64 syscall is located when I do my ustack() test against the perl interpreter.

In theory the parsing shouldnt matter, as in the kernel, we skip over blocks of the dwarf instructions to find the matching block, but it helps me to relax a little and better understand this stuff so I can tackle why some SYSCALL instruction blocks arent being handled properly.

People are sending me bug reports on 2.6.30.* kernels (fixed an issue with 2.6.30.4, but now theres a 2.6.30.5 - I cannot keep up with these releases and the gratuitous kernel code changes on each release!). So, just trying to stay above water, but progress is slow.


Posted by Paul Fox | Permalink

Tue Aug 04 20:44:37 BST 2009

mail problems


for reasons i dont fully understand, some of my mail is not getting out. my mail macros and bits/pieces are breaking in some areas and i hadnt realised things were not getting out.

If you see no response from me, then this could be the issue - just remail me; if you see dup emails from me, its me attempting to fix the issue.


Posted by Paul Fox | Permalink

Sat Aug 01 13:11:54 BST 2009

dtrace linux status - the dwarfs


I've been slowly getting the DWARF stack dumper to work. It works for some system calls/probes but not for others. At issue appears to be accuracy in the dwarf.c code - looking at the gdb source for stack walking is interesting as it highlights a number of issues, including trampolines and exception stacks.

A particular issue I am having at present is the sys_open syscall. gdb can show a stack trace but my kernel code cannot find the appropriate dwarf frames mirroring where we came from. So I need to put in more effort to work through the use case scenarios.


Posted by Paul Fox | Permalink

Sun Jul 26 12:01:02 BST 2009

Dwarf .. nearly working.


...
  0   3004              sys_nanosleep:entry
              0x7f76eab2e104: libc-2.6.1.so`sleep+0x94
              0x7f76eb55a576: libperl.so.5.8.8`Perl_pp_sleep+0x56
              0x7f76eb51d1ee: libperl.so.5.8.8`Perl_runops_standard+0xe
              0x7f76eb4c7f4a: libperl.so.5.8.8`perl_run+0x30a

  0   2482           sys_rt_sigaction:entry
              0x7f76eab2e17a: libc-2.6.1.so`sleep+0x10a
              0x7f76eb55a576: libperl.so.5.8.8`Perl_pp_sleep+0x56
              0x7f76eb51d1ee: libperl.so.5.8.8`Perl_runops_standard+0xe
              0x7f76eb4c7f4a: libperl.so.5.8.8`perl_run+0x30a
...
The above is the stack trace of Perl, which has no decent frame pointers, yet the stack trace agrees with what gdb sees. (I had to cheat, since 'main()' is missing above).

Its nearly there, but need to resolve some more issues, and then we should have a viable ustack() call even on omit-frame-pointers applications. (Still need to do the 32-bit equivalent of the above).


Posted by Paul Fox | Permalink

Mon Jul 20 23:20:37 BST 2009

Say "goodbye" .. Say "hello"


I have removed the utils/eh.c file.

I have created driver/dwarf.c.

This file is both a userland binary (build/dwarf) and the dwarf decoder subroutine for kernel code to be called from dwarf_isa.c.

Next step is to modify the stack walker to invoke the subroutine and see if we get sensible results from within the dtrace driver.


Posted by Paul Fox | Permalink

Sun Jul 19 19:00:53 BST 2009

And so the gestation of a dwarf begins...


The utils/eh.c seems to be working and am now converting it from a userland dwarf dumper to a subroutine which can be called in the context of walking the stack.

I'll put out periodic releases if anyone is interested (utils/eh.c) which will become driver/dwarf.c when its ready for compiling into the kernel (not far off).

The next step is to change the ustack() code to call this and see what happens...


Posted by Paul Fox | Permalink

Sat Jul 18 19:33:46 BST 2009

Gestation Period is up...I am pregnant with a Dwarf...


Having spent the last week or so on understanding the DWARF .eh_frame and .eh_frame_hdr sections, I now have a simple utility to dump out these sections, according to the DWARF spec. This code is analagous to what the binutils/readelf tool can do, but is the first step to making this work inside the kernel to get stack traces from user space apps.

The code is in utils/eh.c (gcc -o eh eh.c -lelf). Its nothing special, and likely to have a few bugs/quirks in it, but the code can now be copied into a kernel module and invoked as a subroutine, with various changes to handle ELF32 + ELF64 (eh.c only handles ELF64 for now).

The following is the kind of output from the tool:

FDE length=00000024 ptr=0034 pc=00402110..00402199
fde_encoding=27
  Augmentation Length: 0x00
0000: 4a          DW_CFA_advance_loc 10 to 0040211a
0001: 8f 02       DW_CFA_offset: r15 at cfa-16
0003: 86 06       DW_CFA_offset: r6 at cfa-48
0005: 66          DW_CFA_advance_loc 38 to 00402140
0006: 0e 40       DW_CFA_def_cfa_offset: 64
0008: 83 07       DW_CFA_offset: r3 at cfa-56
000a: 8e 03       DW_CFA_offset: r14 at cfa-24
000c: 8d 04       DW_CFA_offset: r13 at cfa-32
000e: 8c 05       DW_CFA_offset: r12 at cfa-40
0010: 00          DW_CFA_nop
0011: 00          DW_CFA_nop
0012: 00          DW_CFA_nop
0013: 00          DW_CFA_nop
0014: 00          DW_CFA_nop
0015: 00          DW_CFA_nop
0016: 00          DW_CFA_nop
It may not make sense without reading the specs or understanding what it is trying to do. (eh.c has various big cribbed comments taken from the DWARF spec). The above is like a virtual machine but is used to track what is in a register (eg the current frame pointer) rather than perform arithmetic or logical operations.

Theres still some way to go - taking a demo program and making it into a re-entrant subroutine (and I may have some concerns about performance after looking at the DWARF frames for a sizable executable, like CRiSP, but we will see what happens).

My initial target is /usr/bin/perl - since having a programming and deterministic environment to test and retest is useful.


Posted by Paul Fox | Permalink

Tue Jul 14 21:18:44 BST 2009

DWARF, and Sun


I have a person in Sun, actively fixing dtrace to help with their work, and this is proving useful - two or more sets of eyes to pick over some of my dirty work. Already he has fed back quite a few things for the 2.6.18 kernel, which is applicable to other kernels too. Hopefully more fixes will be forthcoming, whilst I fight the Elves and Dwarves.

DWARF - one of the most complex unix areas - but a beautiful piece of work, dating back to the early 1990s by AT&T/Sun.

DWARF is the way debug info is stored in executable ELF files. Not something one normally worries about, and the GNU binutils and gdb packages, along with GCC, know how to do this without blinking.

But, hiding in DWARF is the magic for handling stack unwinding. Because -fomit-frame-pointer became popular in the 1990s as GCC was enhanced to allow use of an extra register on the x86 architecture, a way was needed to walk the stack, when the %EBP register no longer helps find the return addresses.

If you look at an ELF executable, e.g.

$ objdump -h /usr/bin/perl
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
...
 15 .eh_frame_hdr 00000034  0000000000400eb4  0000000000400eb4  00000eb4  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 16 .eh_frame     000000ac  0000000000400ee8  0000000000400ee8  00000ee8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
...
you will see the above two sections. This is the sections for unwinding the stack, typically needed for C++ exceptions, but also for omit-frame-pointer (FPO) code. The DWARF spec, e.g. http://refspecs.freestandards.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/ehframechpt.html will tell you more than you ever wanted to know about this.

The specification, like most specifications, is opaque in many areas, and I am busy writing a disassembler to more fully understand it. (Not useful to anyone else but me). I did find this:

$ readelf -wf /usr/bin/perl
will disassemble these sections, and I found this: http://www.hpl.hp.com/research/linux/libunwind/ and http://www.nongnu.org/libunwind/ which have code to help more fully understand the spec.

Its a shame these key libs arent a standard part of the distributions, and that the kernel itself hasnt yet stumbled on to this, so I may as well try for them.

The problem being solved here is that ustack() is useless on apps compiled without frame pointers, and many distros do exactly that.

Anyway, .eh_frame_hdr is a mini table which maps a program counter to a block of instructions in eh_frame which describes, amongst other things, what the stack looks like within a basic block of code. So, as the cpu pushes/pops things off the stack, it provides a map of where to find the return address of the function, and that is how gdb works nicely on x86_64 architectures (and many others).

Of course, those libraries are significantly complicated since they support many CPU architectures and scenarios, whereas I am only currently caring about x86 32 and 64 bit machines.


Posted by Paul Fox | Permalink

Sun Jul 12 19:53:53 BST 2009

Hiiiii! Hoooo! Its off to work we go. DWARF


Whilst bumbling around in ELF file format, and after a prompt from Nicolas at Sun, I found out how gdb does its stuff to find stack frames for an omit-frame-pointer.

When code is compiled with GCC, it creates a data structure used for exception handling. I thought this was only used for real C++ apps, but turns out this is there for non-C++ apps also, and is hiding in the ELF sections, loaded into memory:

$  objdump -h /usr/bin/perl
/usr/bin/perl:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
...
 15 .eh_frame_hdr 00000074  000000000040289c  000000000040289c  0000289c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 16 .eh_frame     0000020c  0000000000402910  0000000000402910  00002910  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
...

So, I need to find these sections in the address space of the running application to be able to walk the stack. Hopefully this gives us a workable solution for ustack().

I have some way to go, not only locating the memory regions for the current stack to find the ELF blocks, but potential issues if user space pages are paged out whilst we are walking the procs address space.

Probably at least a couple of weeks away from getting this working.


Posted by Paul Fox | Permalink

Sat Jul 11 16:26:02 BST 2009

Darnit...i must admit defeat and live my life...


Mention omit-frame-pointer to people, and if they 'get it', they will seethe at code compiled this way.

Thats me after about 2 weeks of trying to improve ustack(). On the Ubuntu releases I am playing with everything is either compiled without a frame pointer, or GCC has bastardised the stack like a drunk who has thrown up in the toilet.

I have tried various heuristics to get something to work, but I need to dig deeper. (gdb can do it, so I need to see how its doing it).

Anyway, I got Centos 5 - 2.6.18 installed to fix some issues people had reported on the 2.6.18 kernel.

Someone in Sun has contacted me regarding getting dtrace to work on 2.6.18 for the Lustre project. I find it elating and funny that Sun have come to me for dtrace on Linux, since they want it to help debugging. There were three bugs which the person kindly reported on and he is in business, so thats a good mutual deed. (Thanks Nicolas)

I have some other contributions to fix issues with pid/tid, and I am looking this to see what is wrong in dtrace and fix. (Thanks Mauritz).

I need to do something in the ustack area - theres a few pent up fixes/cleanups in my internal code, but I will look at gdb for some hints and see if I can make some progress.


Posted by Paul Fox | Permalink

Fri Jul 03 21:07:50 BST 2009

Heat + Programming dont mix


We've been having a bit of a heat wave this week in the UK, and its partially muddled my brain - its beginning to cool off, so dtrace is looking more attractive.

I have spent the week playing with the symtab code so that ustack() can display the user stack traces. I found various issues with the hacks to get Linux process control to work without radically modifying the existing code - still more to do, but at least I can concentrate on the symtab.

I tripped over a bug in a couple of the ELF functions, where there is a Solaris v. Linux incompatibility in the error return values.

I keep finding code where it tries to open /proc/pid/pstatus which doesnt exist on Linux, and various issues in finding the DYNAMIC/PROCEDURE_LINKAGE_TABLE. At the moment, its displaying the function names fine, but the module (library) names are garbage, probably because its expecting to find the shlib name but I havent stored it anywhere and its pointing to free memory.

I just ran valgrind on dtrace and thats helped track a few uninitiatlised variables, but valgrind doesnt understand the dtrace ioctl()s so any return from an ioctl() taints the output, unless/until I teach valgrind how to interpret these.

I spoke to Adam Leventhal about SDT probes to understand some more of the internals. An interesting point he mentioned to be was how SDT works in Solaris: as the kernel boots up, it scans itself for the SDT probes and readies the breakpoints to be inserted. So there is a mapping of probes, just like for a USDT application, which makes perfect sense.

I mentioned the trickyness of doing Linux SDT probes in the absence of source code changes to the kernel, and I know it can be done, but it may require case-by-case analysis to determine how best to patch the kernel to get the probe points. When I have finished/improved user space symbol and process handling then I can go back to that to play, or, I could just use dtrace to analyse more of the kernel itself.

More, when theres more to write about.


Posted by Paul Fox | Permalink

Mon Jun 29 22:40:24 BST 2009

rtdb framework


I'm busy at the moment trying to get the rtld/rtdb functions to work. Its a difficult decision - do I drag in more and more Sun/Solaris code, so that there is a one-to-one mapping of functions and intent, or do I stop here, and start writing my own code.

The rtdb functions are interfaces to the runtime linker (ld.so.1), and, although very nice, rely on intimate behavior of the Solaris linker. This doesnt exist on Linux (i.e. the corresponding functions). So, copying the code into dtrace means copying more and more dependencies (avlist, linked list, msg locales and other stuff), for little benefit.

dtrace uses these functions in a very specific way: get the symtab of the target process we are tracing, along with the symtab for the loaded shared libraries.

I am going to draw a line and see how much I can do without dragging it in. (I dragged it in and have kicked it out again, as I just spend more and more time porting Solaris to Linux, which isnt the end goal).

The end goal is making the PID provider and user space stack traces "as they should be".

This will likely take a while, so will update periodically if I feel what I have is no worse than before.


Posted by Paul Fox | Permalink

Sun Jun 28 12:07:05 BST 2009

dtrace progress - symtabs


I have put out a new release which is better at handling stacks for 32+64b platforms and whether they are compiled with/without frame pointers. Its not perfect - the later your kernel, the more trustworthy the stack will be, since in the worst case, we have to examine the stack, word-by-word, to find likely looking return addresses (the same as the kernel does), since GCC over-optimises frame pointers.

I am currently looking at this:

$ dtrace -n pidXXX::: -p XXX
I tried this on my MacOS system, and was intrigued by the fact that for a sample Perl app, tens of thousands of new probes sprang into life. It looks to me that you can DOS attack a kernel with these privs, since if you do this on lots of processes, you can eat the probe memory that dtrace will set aside, and either run out, or affect performance of a system.

At the moment I am knee deep in more ELF/dynamic stuff, so that we can get the symtab of a running process so that the PID provider is more usable.


Posted by Paul Fox | Permalink

Thu Jun 25 23:16:47 BST 2009

SDT probes - what?


SDT - static probes are high level probes in the kernel, in the sense that they add value compared to FBT. FBT probes can go on any function - you know the function got entered or returned. But finding key datastructures, such as the current "proc" or "timer" or "packet" isnt easy to discern without playing around with stack arguments and type casts to a known type.

Thats how I read the SDT: SDT can provide a probe like "received_packet" and provide an argument which represents the packet so you can dissect it.

But, the question is - are they useful ?!

I dont really understand the probes despite staring at the code for a while. I understand lots of the technicalities, but not the rationale. Is my first paragraph spot on? Feel free to send me feedback about why they are a *must*.

Why?

Well, many of the probes in Solaris relate to Solaris internals. The concepts of scheduling on solaris dont match the Linux kernel. Solaris has a process and a lwp (lightweight kernel thread). In Linux, all threads are really processes.

So, if you have a D script written for Solaris, it wont work on Linux, unless I provide as close an emulation as possible. I have found the FBT is more than enough to keep me entertained, but I am trying to find if we need SDT.

There are a lot of values exposed in /proc such as statistic counters. And there is a lot of code in the kernel which increments those counters. But the counters on their own are not directly interesting (you can put an FBT on the functions that manipulate those counters). So, maybe I am missing something, like, with dtrace/linux today, you cannot easily inspect processes, io, vm, packets, etc.


Posted by Paul Fox | Permalink

Tue Jun 23 23:56:25 BST 2009

fixed the 32b problems?


Just uploaded a new release -- which may fix the problem. Found that if I disable the GPF interrupt hook, the reliability problems disappear. I dont understand how/why - the race conditions that could happen should be very small... but seems to work.

I will have to analyse this more to see why that hook (which shouldnt fire, and we do put it back on a rmmod) causes a problem.


Posted by Paul Fox | Permalink

Tue Jun 23 22:35:39 BST 2009

32b drat


I have had a bug report that builds since 20090617 for 32b kernels are failing to load. Strange, because it worked for me, but I dont have every permutation of kernel and modules.

After trying a few experiments, it appears that reloading the dtrace driver will panic/crash/reboot the 32b kernel. (After 3 times for my test machine, and in vmware, a reboot occurs, indicating a likely triple-fault).

I suspect maybe on driver unload, something is not being undone which happened on a load (maybe reset/unhooking the interrupt vectors).

I am investigating.

SDT Progress

Ive done some research on how to get SDT into the kernel without touching the kernel source. I was hoping for key subsystems like the scheduler, VM, NFS, that we would find a structure containing counters which are incremented at key parts of the driver, and the ones exposed in /proc. If we did, we could modify the instruction provider to look for these increments, and auto-create the probes.

What I have found so far in looking around, is that some/all drivers have either a disconnected adhoc collection of counters or have per "instance" counters. (I found references to zones in the MM code), so it wont be as easy as I hoped, but I am continuing to look for a pattern.


Posted by Paul Fox | Permalink

Mon Jun 22 21:33:02 BST 2009

dtrace -p now works


We can now attach to a running process and run dtrace on it. I hit the same kernel bug - namely, that if a process attaches to a debuggee, and the process creates a thread, the thread cannot "see" the child debuggee via ptrace(). Nuisance, but now I understand it, its totally fine - we just attach/detach in the parent and reattach in the child thread.

It still concerns me that you can kill -9 the dtrace and the child can be left stuck in an indeterminate state. Whilst thinking about this, I have a possible solution, namely to let the dtrace driver know what we are doing, and should the dtrace process die, we could force a SIGCONT (PTRACE_CONT) on the debuggee, so all is not lost, and we dont need to do what Solaris does in the /proc filesystem.

So, next up is either ustack() (and user space symbol tables), or the SDT driver. I am still a little confused by SDT and the "transform" keyword in a D script which provides struct-level access to kernel and user space params, but I know what I am expecting to see/work, so I just need to play.

SDT will be interesting - I have a plan to use the Instruction Provider to disassemble the kernel and intercept ADD instructions which apply to a global memory area corresponding to a struct of interest. I hope this will work for some/most of the desired areas, and if so, we have a way to intercept processes which trigger various kernel counters.

One thing to note with dtrace -c/-p - the way dtrace works is to get the process going and then to kick off the kernel rules engine. The kernel doesnt really know whats going in user space - you can elect to monitor probes for the process or any sibling (like truss -f or strace -f) by virtue of your predicates on the probes you write. This really is very powerful, since dtrace can (in theory) do everything strace and truss can do, but via lower level primitives.

Dtrace emulating truss is available as some scripts on the internet show, but some aspects of the way this is done is a little "clunky". I will experiment at a later date to see if we can more closely emulate strace/truss so that dtrace can be a one-stop-shop for these kinds of things.

New release available today whilst I go off and do some more real work.


Posted by Paul Fox | Permalink

Sun Jun 21 21:28:00 BST 2009

One step closer - dtrace -c works


dtrace -c should now work. It took a lot of energy to understand the control flow and map the Solaris primitives to standard Unix ptrace/wait semantics, but it appears to work.

You can now do this:

$ dtrace -n 'syscall::mu*:/pid==$target/{printf("%d",pid);}' -c df
dtrace: description 'syscall::mu*:' matched 6 probes
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1              5874396   5097744    478248  92% /
varrun                  255460        72    255388   1% /var/run
varlock                 255460         0    255460   0% /var/lock
udev                    255460        44    255416   1% /dev
devshm                  255460         0    255460   0% /dev/shm
dtrace: pid 7689 has exited
CPU     ID                    FUNCTION:NAME
  0  87377                     munmap:entry 7689
  0  87378                    munmap:return 7689
  0  87377                     munmap:entry 7689
  0  87378                    munmap:return 7689
  0  87377                     munmap:entry 7689
  0  87378                    munmap:return 7689
You may see some debug printf's I have left in there, but next thing is to tackle the symtab (stack/ustack) stuff, and consider library probes.

The -c stuff (and -p, which I havent tested yet) may have some issues. Theres a horrible sleep(1) in the child after a fork() to let the parent catch up with the child. I found the Linux kernel seemed to be broken in some areas (I believe threads which inherit ptrace() children have problems).

The sleep can be solved easily with some form of shm mutex or maybe even a futex, but I havent tried.

What is worrying is that in Solaris /proc fs, you can signal a child process to continue on its own if you, the parent, die. On Linux, this isnt there, so, consider:

$ dtrace -n ... -p <pid>
If you kill -9 the dtrace process, then the target process may be left in an indeterminate state. This is true for strace. dtrace and strace can work hard to intercept SIGINT/SIGHUP/SIGTERM/etc, but cannot do anything about SIGKILL. I can think of a not-nice to partially solve this (or maybe we could put something into the kernel to handle this), but that is a reason why /proc/pid/ctl wins on Solaris.

Posted by Paul Fox | Permalink

Fri Jun 19 19:54:57 BST 2009

Linux user threads - bug ?


I have been working on userland dtrace - where you can launch an app from dtrace itself so that you can trace just this app (or attach to an existing app, like strace or truss).

I found something interesting, which had been confusing me, having spent so long inside the kernel.

In Unix, we have the ptrace() system call - which is the basis of all debuggers. You can attach to a process and do things like set breakpoints or intercept events of interest, like signals.

The way the works is in one of two ways: if you are a debugger (which dtrace, gdb, strace, etc all are), then you fork yourself. The child notifies the kernel it is happy to be traced (via ptrace(PTRACE_ME)), and then forks+exec's the target process.

The parent debugger attaches to the target pid (it knows the pid, because we just forked). It does this via ptrace(PTRACE_ATTACH), and from then on can peek/poke the target process, or continue after an event.

So, here is the bug. In order to ptrace a process you need to attach to it. Two debuggers (eg gdb + strace) cannot attach to the same process at the same time.

Now, consider this. You are a process. You create a new thread. This thread forks() + execs the target. The new thread tries to attach to the process, but fails, because the master thread is considered the 'parent' of the child, and the thread you spawned is considered to be a distinct process - not a thread of the main process.

The issue here is that in Linux, threads are implemented as if you had forked a new process, but the thread shares the address space of the parent. This is not true of a proper multithreaded and POSIX compliant system. E.g. in Solaris, a thread is really a separate 'slice' of a process, and it shares the process id of its parent.

Linux tries to pretend threads exist, but this funky emulation seems to break how ptrace() works.

This is why I have had a hard time getting userland dtrace to work properly in this area - as I have been trying to understand what dtrace is doing and why the target process was stuck in the wrong state.

Now I understand, hopefully the "-c" and "-p" switches to dtrace can be made to work, and this will be a significant feature addition to Linux/dtrace.


Posted by Paul Fox | Permalink

Mon Jun 15 23:00:18 BST 2009

Next up...


$ dtrace -n syscall:::/pid==$target/{} -c "sleep 100"
This is how to trace the syscalls for a specific process we want to launch - one of the last major features of Linux Dtrace which is missing.

Interestingly, I seem to be hitting an issue with pthreads vs fork/waitpid semantics...Time to read more on who gets the signals on Linux, vs solaris...


Posted by Paul Fox | Permalink

Mon Jun 15 21:58:37 BST 2009

Dependencies


Can people who download dtrace and find it fails to build, please read the README and figure out what they have missing from their systems in order to build it.

I am not going to respond to emails for trivial support issues.

Thank you


Posted by Paul Fox | Permalink

Fri Jun 12 19:55:29 BST 2009

dtrace and the CALL instruction .. fixed


After a lot of code and stack trace staring - the issue is now fixed for 64b kernels. The issue was around a call instruction. Any probe which started with a call instruction could crash the kernel.

Amazingly, I was staring at a solution in the Linux kernel, but my brain has been hazy the last few days. I had implemented the Instruction Provider which has been a great help to find lots of samples of instructions I care about and try and get a feeling for what is going on.

The issue I was seeing is that when we take the INT3 and INT1 handler - for the initial breakpoint trap and then the single step trap, we would expect the kernel RSP to have moved, because we had just stepped a CALL instruction. But I wasnt seeing this. The "regs" structure on the stack at the point of exception for the same. This didnt make sense.

I hacked it for one 64b kernel, but the others hated my hack. (My hack involved looking at the stack dumps and trying to 'find' the magic values I wanted),

It worked fine on 32b kernels. Imagine an interrupt from kernel space taking place. The cpu pushes RFLAGS, RCS, RIP, in that order onto the existing stack. At this point, our code kicks in and pushes the full register set on to the stack (giving us a "struct pt_regs" structure we can point to and manipulate before returning from the interrupt).

Just above the flags should be the stack where we interrupted. This *is* true on a 32b cpu but not on a 64b cpu. I *think* the reason is that on 64b cpus, Linux sets up a TSS task switch so that on an interrupt, we have a private kernel stack, and this would hopefully avoid stack overflows if we interrupted a deeply nested part of the kernel.

That is why the 'regs' structure is always at the same address, and what we have in the r_rsp field is a POINTER to the original stack, not the stack itself!

A quick experiment and I could run:

$ dtrace instr::*call*:
to trap every call instruction in the kernel and it worked. In addition
$ dtrace fbt:::
works flawlessly on all three key 64b kernels I was trying, and I hadnt even broken the 32b kernel in fixing this.

Theres still a bogus issue or two to track down. Ctrl-C-ing dtrace can cause kernel problems - not sure why. If you Ctrl-C the dtrace binary, it sends an ioctl to the kernel to ask it to pull apart your probes rather than just exiting. Dont fully understand why they do that but it maybe for when you launch a binary from dtrace and it needs to kill or detach.

So, if this done, I can hopefully return back to user space and get userspace apps to be traced as well, and then we are done....

The Instruction Provider driver is hopefully going to be useful to implement a proper set of probes for the things that avoid patching kernel source.


Posted by Paul Fox | Permalink

Mon Jun 08 23:33:43 BST 2009

Instruction Provider now works


Heres a short example:
$ dtrace -n instr::*-nop:
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86742 mutex_trylock-nop:0xffffffff8045b1b7
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86933 lock_kernel-nop:0xffffffff8045c6d6
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86933 lock_kernel-nop:0xffffffff8045c6d6
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86851 _spin_lock-nop:0xffffffff8045c3f5
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86742 mutex_trylock-nop:0xffffffff8045b1b7
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  0  86933 lock_kernel-nop:0xffffffff8045c6d6
  0  86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8
  ...
And another,
$ dtrace -n instr::*-lock:
  0  86883 _spin_trylock-lock:0xffffffff8045c51c
  0  86883 _spin_trylock-lock:0xffffffff8045c51c
  0  86883 _spin_trylock-lock:0xffffffff8045c51c
  0  86883 _spin_trylock-lock:0xffffffff8045c51c
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86925 __reacquire_kernel_lock-lock:0xffffffff8045c660
  0  86883 _spin_trylock-lock:0xffffffff8045c51c
  0  86883 _spin_trylock-lock:0xffffffff8045c51c

Posted by Paul Fox | Permalink

Mon Jun 08 21:43:41 BST 2009

The Instruction Provider


Today I played with the instruction provider - a dtrace probe provider for tracing classes of instructions. Typically these are jump or call or sti/cli instructions. This works like the FBT provider but creates probes based on opcode values. So, for example you can trace every JNE instruction, or only JNE inside a specific function.

Its hopefully useful and innovative, but my prime goal here was to provide a way to debug targetted opcodes, which are not necessarily in the first location of a function.

I ran a trial - I went from 25,000+ probes to 300,000+ probes. I quickly crashed the kernel (hey, it was a first effort), but hope to debug quickly.

I will probably make it a load-time option to enable it as it really can be destructive to the system under test with so many probes firing. But, if it works, it will also be a good stress of dtrace on linux.

More later!


Posted by Paul Fox | Permalink

Sat Jun 06 20:55:00 BST 2009

E8 again...


Dtrace is working beautifully apart from a 2.6.9 kernel (64-bit) I am testing on. One fbt probe uses the E8 instruction (relative call).

This is what a relative call does:

E8 nn nn nn nn  CALLR offset
We have a 32-bit relative offset from the next instruction. As a normal subroutine call, this is what should happen: decrement RSP, move return address to (RSP).

Now this is very strange: on the 2.6.9 kernel, we single step the call. The initial breakpoint hits and we look at the RSP. As we single step over the call, we expect the RSP to have decremented by 8 (64-bit return addr).

And it does.

But there is a gap between the RIP/CS/EFlags for the trap exception and the return address of the stepped over instruction. Look at the following debug output:

INT3 PC:ffffffff80110a48 REGS:0000010008eedea8 CPU:0
BEFORE:
Regs @ 0000010008eedea8..0000010008eedf50 CPU:0
r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00
rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038
r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100
rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0
rip:ffffffff80110a49 cs:0000000000000010 eflags:0000000000000047
rsp:0000010008eedf58 ss:0000000000000018 00000000006f2840 00000000006f0a00
INT3 ffffffffa0236cff called CPU:0 good finish

int1 PC:ffffffffa025aff0 regs:0000010008eedea8 CPU:0
AFTER:
Regs @ 0000010008eedea8..0000010008eedf50 CPU:0
r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00
rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038
r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100
rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0
rip:ffffffffa025aff1 cs:0000000000000010 eflags:0000000000000047
rsp:0000010008eedf50 ss:0000000000000018 ffffffffa0236d05 00000000006f2840
Here we get an INT3 trap and you can see RSP is set to 0000010008eedf58. The "Regs @" entry in the first case shows the extent of the 'struct pt_regs'. Note that between the printed rsp and the end of the regs area is a difference of 8 bytes. This shouldnt be there.

After the INT3 breakpoint trap, we single step (int1), and look again at the Regs@ and RSP field. The regs are at the same location - even although we just executed a call instruction and pushed the return address on the stack. In the INT1 register dump, RSP is correctly decremented by 8. Here we have no gap, but for INT3 we do have a gap.

I have been reading and re-reading exception handling on the web and Intels docs and there is no reason for the gap.

What is puzzling is that it works on the other kernels, but INT3 is pushing two extra words on the stack - more than I expect.

Another interesting issue is that when I look at the kernels I have and search for E8 call instructions at the first instruction of a probe, only this one seems to have one. Later kernels (or GCC's) dont seem to emit the instruction, so, if I dont understand what is going on, there is a chance that you will hit one and panic your kernel.

Strange. I am going to put out a new release (at least this fixes the compiler issues people have been complaining about, and hope no-one has an E8 in their kernel).


Posted by Paul Fox | Permalink

Fri Jun 05 22:44:39 BST 2009

0xfa and 0xfb - STI and CLI


Strange. In 64-bit mode, trying to single step these instructions which enable and disable interrupts doesnt work. I'm sure its me being a little thick and there are a number of gotchas.

For instance, CLI, which clears the interrupt enable flag will ignore interrupts over the following instruction (as will STI, or, maybe only STI does).

What was happening if process 1 -- init -- would die, and the kernel would scream at me.

I have solved this by pure emulation - no point in single stepping these instructions, and just handle without a single step - which is better from a performance point of view.

I am running on 3 64-bit vmware kernels. 2.6.27.8 runs beautifully. 2.6.27-7-generic - an Ubuntu kernel - runs flawlessly but strangely slowly when all probes are enabled. I would expect both to run at the same speed, so either the first is running fast when it shouldnt or maybe the latter is flawed. (I think the slowdown may be due to calls to mcount which is doubling the overhead per function in the kernel).

The other is 2.6.9 - AS4 kernel. Just shown that to hang, so I need to debug that before making a release.

(32-bit kernels appear to work fine, and the compile issues are resolved).

I have added a special flag to FBT which is interesting/useful.

$ load.pl -opcode

will prefix each probe name with the first byte of the opcode at the probe, so that it is easier to diagnose where the flaws are. Single stepping the breakpoint for a probe works, but many instructions have to be handled specially, such as jumps, calls and rets. So being able to find the offending instruction or scenario is helpful.

This relates back to a prior blog entry where I talked about how nice it would be to have an instruction prober where we could probe by instruction type, rather than function. E.g. imagine probing by virtue of every LOCK instruction. Or REP or CLI. Get the picture?

How about JMP/JMPNE/JMPEQ instructions? That could be ideal for low level kernel profiling -- how many times is a jump taken in *this* function.

This is easy to do - just need a variation of the FBT disassembler which doesnt try to instrument the entry/exit of a function, but the body.

I may try and get this in on the release after this one, just to see what it looks like. Stay tuned.

Hoping to release this weekend or tonight if I can resolve the AS4 issue.


Posted by Paul Fox | Permalink

Wed Jun 03 23:48:30 BST 2009

E8 issue - now fixed


I found where on the stack my "return address" was hiding, and being very silly proving to myself what I had done wrong.

Now...need to fix the compile time issues and a new release is forthcoming.

You can cat /proc/dtrace/trace to get some internal trace debug - I need to tone that done to avoid hitting performance too much. (Its not bad as it is, but I can do better).


Posted by Paul Fox | Permalink

Wed Jun 03 23:24:06 BST 2009

E8 nnnnnnnn - CALL Relative


I've been stuck on one instruction all week - the call relative instruction. One function in the kernel has this in the opening position of a function entry, and we copy the instruction to a temp buffer and single step it.

Its not rocket science, but I have been struggling with a lot of sillyness on just a few lines of code.

This instruction has two issues - (1) we need to adjust the return address since we want to return to the original instruction and not the copied one, and (2) its a jump relative.

In my work, I have managed to get one or both of these stupidly wrong. (One issue looks to be not sign-extending a 32-bit displacement to a 64-bit address).

Hopefully get this fixed and can move on

Some people have raised issues about plain compile errors due to <string.h> and memcpy. I hope to fix this too - very annoying that I did something to break what was working fine. (I replaced calls to bzero with a call to memset, and somehow the #define's conflict with glibcs string header.

I noticed a new solaris release has come out (2009/06) and the most notably change for dtrace is the CPC profiler -- http://wikis.sun.com/display/DTrace/cpc+Provider

This looks neat and really want to get that ported, but I need to finish the current workload before taking this on board.

More in a few days.


Posted by Paul Fox | Permalink

Sun May 31 19:24:49 BST 2009

I dont want to release just yet....


Whilst enabling all valid opcodes for FBT tracing is great - it does show up issues which I had been able to ignore before in terms of stability.

There are certain paths of code which can cause probe traps from within the probe handler, and I get differing results from my differing kernels. The best kernels are those that crash on me - allowing me to see an issue, rather than hiding these latent instabilities.

"fbt:::" works, but theres more to it than this, as many probes never fire, or a probe may cause an issue when another probe fires. (The SYSCALL instruction is a prime case - it causes the kernel to enter at 'system_call', but at this time, the stack and state of the kernel is not consistent, and we cannot (yet) probe on that, so I have added it to the toxic list for now).

There are many entries on the toxic list which can be removed, but am working thru the failure scenarios I can see at present.

I've put in an interrupt handler for interrupt #13 (GPF), since when we do things wrong in dtrace, its good to get a chance to shut us up and avoid an infinite cascade of console messages, resulting in a total panic. (If we get a GPF caused by us, we disable all probes to try and give me a chance to debug what is going on; in theory this should never happen, but at this time, can do whilst I iron out some of the thorny issues).

Keep watching the ftp site - I will upload when I feel happy.


Posted by Paul Fox | Permalink

Fri May 29 21:22:55 BST 2009

dtrace progress


The 32-bit dtrace in the kernel is looking good now - the function disassembler now allows all instructions in the first slot of a function, and correctly single steps them. (Not strictly true - since many instructions dont or wont occur in this slot, not for normal C code or even assembler, just as JNZ, for instance). But for my kernel (one of my kernels) the number of probes has jumped from around 23000+ to 24000+ (which illustrates how rare many instructions are, such as a CALL or LOOP or REP instruction in the first slot).

I need to validate the 64b code - I have added a new file (and removed the older cpu_32bit.c cpu_64bit.c) to store this emulation.

Interestingly, in looking at kprobes in the kernel, for hints about what I was doing wrong, I am seeing that I now do much more -- i.e. I believe kprobes cannot handle many functions or instructions in the kernel, so they can borrow from this code if they like, or use it to educate themselves whats missing.

I am happy to share the code and ideas, because this way, dtrace and its competitors can improve, and people who do not contribute to the code, get something-for-nothing - better quality tools. I dont look at dtrace as the competitor to annihilate all others. I am pleased with the quality and thought process Sun put into the construction, and the raw dtrace engine has given me almost zero issues.

Sun had an easy starting point - tight kernel integration and a dumb C compiler that doesnt show up the issues that inlined assembler gives to us Linux people. Maybe they can learn something too.

Someone asked me the other day of the performance differential for Linux dtrace vs Sun/Apple. My response then, as now, is that near zero difference. Just because Linux dtrace is not a part of the kernel, but an addin module doesnt deter dtrace from doing things. (Ok, SDT is going to be a challenge but much less so, I hope, than getting to where we are).

Also, I was asked about the 'alpha' nature of dtrace. I either write optimistic blogs ("It works!"), or short cryptic ("No it doesnt!") entries - depending on how positive I feel.

When dtrace works for more than a few minutes without crashing a kernel, thats good. But we dont have a scheme to handle coverage (maybe I will add that) - so we can tell, of the 25000+ probes, which ones fired and which didnt. Certainly, 20-30+% of the kernel is executing all the time, but the rest may depend, e.g. if a CD is in the drive, or if a TCP packet is dropped, or a user space app core dumps, and so on. And even if a probe fires, it may be handled incorrectly.

The truth is, if it stands up to scrutiny from people using it, and I am not writing about "Oops!" moments, then we are making progress.

Next on my todo fix list is Ctrl-C to dtrace. When I run:

$ dtrace -n fbt:::

it works fine. But Ctrl-C-ing it causes a *long* delay - sometimes 5-10s before the shell comes back. *Sometimes* a kernel GPF is raised, indicating that as the probes are being removed, an interrupt or something is creeping in, firing a probe about to be destroyed, and possibly hanging or causing disruption. For small numbers of probes, the window of opportunity is tiny. But for all probes, its big enough to be a real problem, whilst ensuring you cannot panic a running production system.

Hopefully this will be easy to fix.


Posted by Paul Fox | Permalink

Wed May 27 21:00:32 BST 2009

Nuts - i am wrong


I wrote in the last two articles that Sun's disassembler is wrong in not handling prefix instructions properly, but that is rubbish, on my behalf. It does handle them, and I confused myself because of the changes to fbt_linux.c I am presently working on.

Apologies to Sun - and am glad that I can trust their code!

Now to find why a few F0 instructions arent stepped properly...


Posted by Paul Fox | Permalink

Wed May 27 20:32:41 BST 2009

REPZ/REPNZ prefix


I just wrote about the semantics of the LOCK instruction being a prefix or not, and I now have proof that REPZ/REPNZ should be treated similarly.

Heres a small dump from the dmesg output after loading dtrace when it can process the F0 series of opcodes (F0..FF).

[52358.586721] fbt:F instr ptype_seq_stop:c02c5ae0 size=1 f3 c3 8d b4 26
[52358.589013] fbt:F instr neigh_stat_seq_stop:c02ccb60 size=1 f3 c3 8d b4 26
[52358.601019] fbt:F instr seq_stop:c02e3f50 size=1 f3 c3 8d b4 26
[52358.601484] fbt:F instr seq_stop:c02e41d0 size=1 f3 c3 8d b4 26
[52358.601760] fbt:F instr rt_cpu_seq_stop:c02e49f0 size=1 f3 c3 8d b4 26
[52358.601828] fbt:F instr ipv4_rt_blackhole_update_pmtu:c02e4ae0 size=1 f3 c3 8d b4 26
[52358.615720] fbt:F instr icmp_address:c030e3b0 size=1 f3 c3 8d b4 26
[52358.615743] fbt:F instr icmp_discard:c030e3c0 size=1 f3 c3 8d b4 26
[52358.623705] fbt:F instr xfrm_link_failure:c0322210 size=1 f3 c3 8d b4 26
[52358.639776] fbt:F instr __read_lock_failed:c0347a90 size=1 f0 ff 00 f3 90
[52358.641537] fbt:F instr kprobe_seq_stop:c034a9a0 size=1 f3 c3 8d b4 26
Note the size=1 which shows that Sun's disassembler has mistreated the f3 instruction (REPNZ).

Posted by Paul Fox | Permalink

Wed May 27 20:26:24 BST 2009

LOCK: Prefix or instruction byte?


I think I found a problem with Sun's instruction disassembler. The disassembler is needed to work out how big an instruction is, and is used by FBT, to work out entry and return instructions.

For opcode 0xF0 (LOCK prefix), the disassembler says we have an instruction length of 1 byte, i.e. it treats this as standalone, and not as a prefix.

If we plant an FBT probe on this instruction (and there are a few in the kernel), then when we single step - we will step the LOCK all on its own and not have the following instruction, leading, more than likely to a kernel crash or bad semantics.

I am amending the disassembler to detect for this, and treat LOCK properly as a prefix.

I found this whilst working thru all x86 instruction bytes, so we can enable the entire lot, and not special case "known scenarios".

The other prefixes (such as REP/REPNZ, etc should likely be treated similarly, but I havent found an example in the first instruction of a function where these instructions are fetched).


Posted by Paul Fox | Permalink

Sat May 23 23:21:39 BST 2009

mcount and gcc -pg


Found it. Some kernels are compiled with profiling turned on (-pg), which means even our driver has this enabled. I cant find an easy way to turn it off without interposing my own gcc wrapper, so the easiest thing is to define our own 'mcount' subroutine which does nothing..

This means we wont call into the kernel, and we can now safely do:

$ dtrace -n fbt::mcount:
and see all the calls to it.

So, we are safe again - we can probe all functions.

I am seeing a funny in Ubuntu 8.10/64, whereby if I probe too many functions, I get a kernel trace in /var/log/messages like below, where it looks like as we enable all probes, we fire before we are really ready, and subsequently dont fire any probes at all. Reloading the driver fixes this, but not sure I understand how/why this happens fully to diagnose as yet:

[74745.164017] Call Trace:
[74745.164017]  [] warn_on_slowpath+0x64/0x90                 [74745.164017]  [] ? dtrace_int3_handler+0x1f5/0x2f0 [dtracedrv]
[74745.164017]  [] ? dtrace_int3+0x47/0x53 [dtracedrv]
[74745.164017]  [] ? warn_on_slowpath+0x0/0x90
[74745.164017]  [] smp_call_function_mask+0x22c/0x240
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? dtrace_int3+0x47/0x53 [dtracedrv]
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] smp_call_function+0x20/0x30
[74745.164017]  [] on_each_cpu+0x24/0x50
[74745.164017]  [] flush_tlb_all+0x1c/0x20
[74745.164017]  [] unmap_kernel_range+0x2cd/0x2e0
[74745.164017]  [] remove_vm_area+0x84/0xa0
[74745.164017]  [] ? remove_vm_area+0x0/0xa0
[74745.164017]  [] __vunmap+0x55/0x120
[74745.164017]  [] ? __vunmap+0x0/0x120
[74745.164017]  [] vfree+0x2a/0x30
[74745.164017]  [] kmem_free+0x50/0x70 [dtracedrv]
[74745.164017]  [] dtrace_ecb_create_enable+0x16d/0x20b0 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x9/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_probe+0xa2/0x100 [dtracedrv]
[74745.164017]  [] dtrace_match+0x1f5/0x2e0 [dtracedrv]
[74745.164017]  [] ? dtrace_ecb_create_enable+0x0/0x20b0 [dtracedrv]
[74745.164017]  [] ? error_exit+0x0/0x70
[74745.164017]  [] ? kfree+0x21/0x100
[74745.164017]  [] dtrace_probe_enable+0xb7/0x190 [dtracedrv]
[74745.164017]  [] ? dtrace_match_string+0x0/0x50 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] dtrace_enabling_match+0x9e/0x200 [dtracedrv]
[74745.164017]  [] dtrace_ioctl+0x214e/0x23f0 [dtracedrv]
[74745.164017]  [] ? __mod_zone_page_state+0x9/0x70
[74745.164017]  [] ? __rmqueue_smallest+0x11c/0x1b0
[74745.164017]  [] ? ext3_get_branch+0x21/0x140 [ext3]
[74745.164017]  [] ? put_page+0x20/0x110
[74745.164017]  [] ? prep_new_page+0x103/0x180
[74745.164017]  [] ? buffered_rmqueue+0x1b2/0x2a0
[74745.164017]  [] ? get_page_from_freelist+0x2a6/0x380
[74745.164017]  [] ? find_get_page+0x23/0xb0
[74745.164017]  [] ? find_lock_page+0x37/0x80
[74745.164017]  [] ? mark_page_accessed+0xe/0x70
[74745.164017]  [] ? filemap_fault+0x1a3/0x430
[74745.164017]  [] ? __wake_up_bit+0xd/0x40
[74745.164017]  [] ? page_waitqueue+0xa/0x90
[74745.164017]  [] ? unlock_page+0x32/0x40
[74745.164017]  [] ? __do_fault+0x134/0x440
[74745.164017]  [] ? __inc_zone_page_state+0x2a/0x30
[74745.164017]  [] ? handle_mm_fault+0x1ee/0x470
[74745.164017]  [] ? __up_read+0x8f/0xb0
[74745.164017]  [] ? up_read+0xe/0x10
[74745.164017]  [] ? do_page_fault+0x372/0x750
[74745.164017]  [] dtracedrv_ioctl+0x2d/0x50 [dtracedrv]
[74745.164017]  [] vfs_ioctl+0x85/0xb0
[74745.164017]  [] do_vfs_ioctl+0x283/0x2f0
[74745.164017]  [] sys_ioctl+0xa1/0xb0
[74745.164017]  [] system_call_fastpath+0x16/0x1b
[74745.164017]

Posted by Paul Fox | Permalink

Sat May 23 22:24:55 BST 2009

working again


I put out a new release earlier which seems to work across the various platforms and kernels.

I had disabled many opcodes, and have been adding them back in. RIP relative addressing (x86-64) is used in the kernel and have been getting that to work (again!).

I've hit a hopefully minor issue with 'mcount' - which is in the 2.6.27+ kernels (ftrace facility). This starts with a RIP relative instruction, e.g.

mcount:
       cmpq $ftrace_stub, ftrace_graph_return
       jnz ftrace_graph_caller
       cmpq $ftrace_graph_entry_stub, ftrace_graph_entry
       jnz ftrace_graph_caller       
but I dont believe its single stepping over that initial CMPQ which is causing the issue, but, possibly whoever is calling it, e.g. the interrupt handlers themselves. Hopefully will get to fix this shortly, as that would open up the possibility to enable any instruction at the start of a function in fbt_linux.c.

I've also fixed a couple of crisp bugs/issues today, and I have one more before I put out an update for that.


Posted by Paul Fox | Permalink

Sat May 23 09:58:37 BST 2009

some progress


I *think* I have just put out a stable release -- works on 32b+64b kernels. Some silly re-entrancy issues not being handled properly. So, I need to test it more, but full-fbt tracing seems to be working.

What does this mean? Well, if this stands the test of running on my various kernels and my real non-VMware hardware, I need to start moving along.

Next up is maybe to look at the userland tracing or look at kernel stack trace operations, since thats a mess - due to the fact that kernels may be compiled with or with framepointers, and if you dont have frame-pointers, then a stack trace can only ever be guess of where you are. (The kernel uses '?' to indicate a stack trace is not necessarily valid - it walks the stack, word by word, to see if anything looks like a kernel text address).

I really need to get some CRiSP fixes done this weekend, along with a new driver to provide low overhead TCP port to PID enumeration. I may go onto to describe that in more detail later.


Posted by Paul Fox | Permalink

Wed May 20 22:22:34 BST 2009

some progress


I think my recent instability is being caused by a sillyism in code which I have not released yet (to do with a nested interrupt trap).

I rewrote the trap handlers to clean them up and put in a more powerful state machine, and its dtracing beautifully on the 64b kernel (I need to revalidate 32b and more kernels).

I am moving the debug output in /dev/dtrace to /proc/dtrace:

/home/fox/src/dtrace@vmubuntu: ls /proc/dtrace
total 0
0 ./  0 ../  0 debug  0 security  0 stats  0 trace
The key one is /proc/dtrace/trace - I am trying to move away from printk for kernel debugging, and using an internal printf-to-a-buffer mechanism (like FreeBSD), because debugging the trap handlers is painful if, by virtue of invoking printk, we invoke a recursive fault.

So, /proc/dtrace/trace is log /proc/kmsg, - a private internal memory buffer to log trace info.

More in a while when I feel happier.


Posted by Paul Fox | Permalink

Tue May 19 22:18:20 BST 2009

reliability issues


Its strange. For most of the last year the 64-bit kernel has 'just worked', even where it shouldnt have, and the nice interrupt handling of the kernel shielded me from issues.

In fixing the 32-bit kernel issues, and redoing the INT3 handling via raw interrupt patching, has caused the 64-bit kernel to be unreliable. (Unreliable means within a few seconds of fbt::: probing, we crash the kernel).

I *think* at this point its due to a page fault firing whilst handling the breakpoint and single step trap.

I am therefore revamping this core code to allow a nested page fault, and tidying up the code which had started to become a bit untidy.

Hopefully have an answer in the next day or so...


Posted by Paul Fox | Permalink

Sun May 17 21:08:10 BST 2009

64-bit issues


At present, dtrace cannot trace around the irq_return (IRETQ instruction) in the kernel. I am attempting to fix this, so for now, fbt::: will hang or panic the kernel.

The IRETQ (return from interrupt) can be returning from user mode or kernel mode, but the interrupt handler in my code doesnt/didnt handle this.

More news in a while when I have a fix. (And I need to reverify 32-bit as well).


Posted by Paul Fox | Permalink

Sat May 16 17:34:03 BST 2009

modrm - RIP relative addressing


Oops. I forgot something. 64-bit instructions can use position independent code, via the %RIP register (PC). When we single step such an instruction, we need to ensure we handle the relocation inside the single step buffer.

Looking at kprobes reminded me of this, so we need to do that. (I could do with collecting stats of where such an instruction exists in the first instruction of a function; I could be worrying unnecessarily). I am seeing some 64-bit instability (especially as I bring more probes into scope by enabling more instruction opcodes).

Let me see if this improves things....


Posted by Paul Fox | Permalink

Fri May 15 23:04:04 BST 2009

a problem...


Theres a problem with dtrace which appears if you do this:
$ dtrace -n fbt:::

immediately after loading the driver. If you give the driver a chance to 'breathe', and do this:

$ dtrace -n fbt::sys_chdir:
control-c it, and then redo the big test, then all seems well. I suspect first time page fault or something triggering this, but still diagnosing.

Posted by Paul Fox | Permalink

Fri May 15 22:15:34 BST 2009

Security policy for dtrace now working


I've written a security policy mechanism for dtrace, which supports the documented mechanisms for Solaris dtrace. Since, under Linux, things work differently, this is achieved by loading the security rules into the kernel just after driver load time. This is done via read/write to /dev/dtrace.

Heres an example:

/home/fox/src/dtrace@vmubuntu: cat /dev/dtrace
here=0
cpuid=1
all  priv_kernel
uid  200 priv_proc
gid  201 priv_kernel
all  priv_owner
(The first two entries are just debug). The other entries form a table which is scanned. The first entry -- "all priv_kernel" enables everyone to do everything. This would not normally be put into a system, unless you want to allow anyone to do things.

The next example shows user id 200, has priv_proc (dtrace_proc) privilege which allows users to monitor only their own processes.

Enumerating uid's may be too much, so you can also specify a gid, as in the third example.

So, the format is:

[uid | gid] <nnn> [priv_user | priv_kernel | priv_proc | priv_owner]
or
all [priv_user | priv_kernel | priv_proc | priv_owner]
Multiple privs can be specified on the same line, e.g.
uid 200 priv_proc priv_owner
(Not sure this is necessary since the perms form a hierarchy, each one embracing more priviledge than the prior, but am mirroring the Solaris model).

By default, /etc/dtrace.conf is read and loaded into the kernel. You can cat /dev/dtrace to verify what it has is what you think it has.


Posted by Paul Fox | Permalink

Wed May 13 23:07:44 BST 2009

security time


Now that things seem to have stabilised in dtrace, I am looking at the security model. (I dont claim dtrace is perfect - I have seen some stability issues, such as a kernel panic when putting a real CPU to sleep (32-bit kernel), even when nothing was being traced).

However a defining security model is needed - mostly based on the solaris equivalent, but we dont do things that way on Linux. Here is a comment added to dtrace_linux.h defining the mechanism.

If people object or have extra wish-list items, feel free to suggest them.

/**********************************************************************/
/*   The following implements the security model for dtrace. We have  */
/*   a list of items which are used to match process attributes.      */
/*                                                                    */
/*   We  want  to be generic, and allow the user to customise to the  */
/*   local   security   regime.   We  allow  specific  ids  to  have  */
/*   priviledges and also to do the same against group ids.           */
/*                                                                    */
/*   For  each security item, we can assign a distinct set of dtrace  */
/*   priviledges   (set   of   flags).   These   are  based  on  the  */
/*   DTRACE_PRIV_xxxx  definitions.  On  Solaris, these would be set  */
/*   via  policy  files accessed as the driver is loaded. For Linux,  */
/*   we  try  to generalise the mechanism to provide finer levels of  */
/*   granularity  and  allow  group  ids  or  groups  of ids to have  */
/*   similar settings.                                                */
/*                                                                    */
/*   The  load.pl script will load up the security settings from the  */
/*   file  /etc/dtrace.conf  (if  available). See the etc/ directory  */
/*   for an example config file.                                      */
/*                                                                    */
/*   The format of a command line is as follows:                      */
/*                                                                    */
/*   clear                                                            */
/*   uid 1 priv_user priv_kernel                                      */
/*   gid 23 priv_proc                                                 */
/*   all priv_owner                                                   */
/*                                                                    */
/*   Multiple  priviledge  flags  can be set for an id. The array of  */
/*   descriptors   is  searched  linearly,  so  you  can  specify  a  */
/*   fallback, as in the example above.                               */
/*                                                                    */
/*   There  is  a limit to the max number of descriptors. This could  */
/*   change  to  be  based  on  a dynamic array rather than a static  */
/*   array,  but  it  is  expected  that a typical scenario will use  */
/*   group  mappings,  and  at  most, a handful. (Current array size  */
/*   limited to 64).                                                  */
/**********************************************************************/
BTW, if people wonder how those comments are neatly boxed, is because I use the CRiSP editor to do my work, and a single hilight followed by Ctrl-F reformats in that mode. (It handles re-formatting also, so its not tedious to make changes to the comments). In addition, spell checking inside the comment let me notice that I have mispelled privilege in the above comment (but, fixed in the source).

Posted by Paul Fox | Permalink

Mon May 11 23:25:47 BST 2009

fixed predicates


Thanks to mauritz.sundell for pointing out predicates were broken. I had seen evidence of this myself, where some entry probes didnt have matching return probes, and a simple test, using:
dd if=/dev/zero of=/dev/null
showed this up with a simple dtrace call like:
$ dtrace -n 'syscall::read: /pid == 7331/ {printf("pid=%d", pid);}'

The process shadowing code in dtrace_linux.c (par_alloc) had been set to only monitor a single process during earlier debug sessions. I unset that and fixed some other sillynesses, and now it will create a new proc structure as it sees them. (Theres still no garbage collection for the structs, and the data structure is a horrid linked-list, so this can cause some slowdowns as each probe will go on a search for the process to see if we have seen it before).

That code needs a hash table, or, at least, move the new proc to the front of the chain, or something, but thats relatively minor.

Of course, theres likely to be more bugs, and I hope my recent fix here wont destabilise.

I put more debug trace hooks (which are visible in /dev/dtrace), but will likely remove them or move them around, so dont depend on them for anything consistent.


Posted by Paul Fox | Permalink

Mon May 11 20:51:50 BST 2009

And so the theory comes to pass...


Hopefully fixed the 32-bit instability issue. I modified dtrace_invop to let the underlying callbacks (notable fbt_invop) know we shouldnt execute a probe, if could from a re-entrant INT3 handler. This seems to fix the problem.

If, for any reason, we hit a probe whilst processing a probe, all bets are off. This is most likely because of the way dtrace folds around the kernel, rather than being a part of the kernel source, but its frustrating to have a hung kernel or the CPU taking a triple fault (which simply causes a reboot) and having no way to diagnose.

I note in the FreeBSD version of dtrace, they have an in-memory version of printf() to log trace info, which can then be seen in a stable context. I was going to do that, but havent gotten around to it. (dtrace_linux.c sort of has that - but a minimal "this is what happened when I was blind/deaf").

What it uncovers is that occasionally we get a page fault whilst processing the probe, which was causing the recursion error.

Not sure why -- Linux allows kernel page faults -- but I thought they wouldnt/couldnt happen inside a probe - since the kernel should be entirely memory resident (but maybe I am misreading).

Ideally, if this is the only reentrancy issue, then I need to fix/allow for a page fault to be handled, because this is a very valuable probe. We may or may not be interested in kernel page faults, but we definitely want user page faults (I am not sure we can get one of these in a INT3 probe handler, since the probe will have fired from kernel space, and even if another cpu is doing user-land stuff, that other cpu will see the fault).

It leads to potentially interesting scenarios, such as a probe firing BECAUSE of a userland page fault, which in turn causes the kernel page fault. Only time and a clear head will tell.

Someone has kindly reported a bug where calls to the read() syscall dont always have matching entry/return probes. Every read should return, unless some form of exception happens.

So, lets see if this is easy to fix, and then I can go back to the security wrapper.

By the way, "cat /dev/dtrace" will give some stats - not many, but a key one is "dcnt" which tells you how many times dtrace_probe() called. Useful when trying to see if "anything is happening". Theres a simple script -- utils/dmon -- which monitors this every second (so I can tell if my kernel has hung or is just 'busy').


Posted by Paul Fox | Permalink

Sun May 10 22:01:35 BST 2009

mutexes and semaphores


I've updated the code to use semaphores instead of mutexes as previously noted. Fortunately, this was pretty easy with a bit of work in the linux_types.h include file.

To be honest, I am seeing the same behaviour - 64-bit 2 cpu seems to work fine, but 32-bit 2 cpu, seems to be flaky.

Next .. off to look at the places where I *dont* think the problem is, because, chances are, thats where the problem is. (My normal way of working is when poking a piece of code, and trying everything, chances are high it is not in the area you thought it was).

The main difference between 32 and 64 bit dtrace is really in the interrupt handler. Most other code differences are due to the size of registers or odd portability issues in the kernel...


Posted by Paul Fox | Permalink

Sun May 10 20:59:35 BST 2009

That Coyote moment....


In Roadrunner vs Coyote, we always have that scene where coyote runs over the edge of a cliff, and not until he looks down, does he realise the dire situation he is in, and then *KAPOW* !

Thats just happened to me with mutexes. I have transliterated Solaris kmutex and mutex types into Linux mutexes, but this is illegal. A kernel mutex is part of a process context - it can block and the caller process sleeps, allowing another process to run.

The documentation specifically states you cannot call the mutex API from an interrupt.

Guess what my code does ?! .... *KAPOW*

I believe I should be using semaphores and not mutexes, so am in the process of converting over to see what happens.

I believe this is a part of the puzzle of SMP kernels where under load the kernel will panic, hang or triple-fault the CPU.

More in a few days when I feel happier.

Meep-meep.


Posted by Paul Fox | Permalink

Sat May 09 23:25:55 BST 2009

glibc: I take some of that back


Hmm..my last blog complained about glibc. Thats unfair - for this particular problem.

This is GCC's fault - it compiles in a stack protection mechanism, by default, and looks like it is using %GS register as part of this mechanism.

Heres an example:

int main()
{
	char	buf[10];
	xreadlink("/tmp/x", buf, sizeof buf);
}
and the assembler generated by compiling:
$ gcc -m32 x.c
$ cat x.s
        .file   "x.c"
        .section        .rodata
.LC0:
        .string "/tmp/x"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $36, %esp
        movl    %gs:20, %eax  // HERE
        movl    %eax, -8(%ebp)
        xorl    %eax, %eax
        movl    $10, 8(%esp)
        leal    -18(%ebp), %eax
        movl    %eax, 4(%esp)
        movl    $.LC0, (%esp)
        call    xreadlink
        movl    -8(%ebp), %edx
        xorl    %gs:20, %edx
        je      .L3
        call    __stack_chk_fail
.L3:
        addl    $36, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (Ubuntu 4.3.2-1ubuntu12) 4.3.2"
        .section        .note.GNU-stack,"",@progbits

This means a binary compiled with this gcc wont run on an AS4 (glibc 2.3) system.

The solution is to turn off this 'feature' (which I thought I had done).

Now off to recheck the gcc man page...


Posted by Paul Fox | Permalink

Sat May 09 23:04:02 BST 2009

glibc is brain dead


I really cannot believe what has happened to glibc - it is a real shame. Sun and AT&T had it right, from the start - backwards compatibility. No binary shall be compiled with a dependency on the operating system. Thats why we have shared libraries - the application ensures a compliance with an ABI.

But glibc thinks different. glibc thinks it can break applications and cause a nightmare for application maintainers.

We're talking CRiSP here, and not dtrace.

What has gotten my goat ?

"readlink()" - a relatively rarely used syscall, but yet, glibc developers decided that when you call readlink, it should "optimise" the call and use the %GS register (used for threads - not sure what else.. yet), rather than simple do a function call, exactly as you would expect.

Why do I care? Because glibc stops you creating a binary on one release of Linux and having it run on any other (upwards or downwards).

Recently, I wrote a tool to remove the stupidity that glibc has when it attacks your object and executable files. It works nice. An application, like crisp, which compiled on say, glibc 2.7, can run on 2.3 as well.

Unless we call readlink(), in which case we get a segmentation violation.

Now I know what the dastardly /usr/include files are up to, I shall put a wrapper, and allow my terminal emulator (fcterm) and CRiSP, to be portable.

There...thats off my chest.


Meanwhile...back in dtrace land...

There appears to be a bug in multicpu environments, where cross-cpu function calls is broken - leading to crashes or hangs. The function we call isnt safe from an interrupt routine, so I need to do some engineering work to allow the inner parts of dtrace_probe to work, without the crash.

Not sure why I dont see this on the 64-bit platform, but, you never know what bad things I am doing, which is why we need to all test and prove the validity of the dtrace releases.


Posted by Paul Fox | Permalink

Sat May 09 18:33:48 BST 2009

32-bit problems .. again


I put out a new release today, but 32-bit dtrace seems to have a problem on multicpu boxes. Looks ok on a single cpu.

Interesting that I have been running vmware Ubuntu 32-bit on a single virtual cpu - and didnt detect issues, but now I have it set to 2 cpus, they show up (just like my real machines).

Hope to fix very soon....


Posted by Paul Fox | Permalink

Fri May 08 21:58:26 BST 2009

The SDT Provider


The SDT provider, which presently, does pretty much nothing in Linux dtrace, is potentially one of the more interesting providers to get working.

In reading the recent FreeBSD 7.2 release notes, and looking at other peoples dtrace blogs, mention is made of the various providers which hide in this single/simple source file.

SDT is a provider which lets you compile into the kernel, calls to dtrace. Probes at certain points will compile to a NOP, unless the probe is enabled.

Remember, we are not going to break the GPL terms or the CDDL licensing terms, and, anyway, we are not a part of the kernel source tree, and patching kernel source code is too tedious and error prone. (See previous posting on issues with simply validating compiles).

I have a new, novel, and GPL patent-free idea to resolve this.

SDT is mostly a set of probes on key parts of the various drivers which increment statistics. For example, the NFS driver has a lot of stats exposed via the "nfsstat" tool.

Patching the source code is painful, but that is only one way to get the desired effect.

Whilst working on FBT - an idea struck me, and I think we can using this in SDT. FBT disassembles the kernel, looking for function entry and exit points. What if we looked for something else? How about all "INC" instructions, or "ADD" instructions?

That would make for a very boring set of probes - probes based on instruction types!

But, say we looked at the relocation info whilst we disassembled - we could intercept the memory reads/writes to certain areas of memory - specifically, anywhere that falls inside a struct full of counters. Now we could instrument every/any struct reference in the kernel (easier if we talk about a single static struct, not dynamically allocated structs, such as the process tree). We can see these symbols in /proc/kallsyms, and we can create a table of ones we want to intercept, and create on-the-fly providers.

That is my plan - and see what falls out.

Ideally we could do it with any/every struct like this in the kernel - maybe a piece of Perl code to find them, and some other glue (we may not have kernel source, but we will have the headers).

Anyway, need to finish the cross-compile stuff (2.6.22 kernel is on my bug report list), and then to write the security wrapper for dtrace, so you can lock down or liberate the use of dtrace for use in commercial environments.

Then, we play...


Posted by Paul Fox | Permalink

Fri May 08 21:55:13 BST 2009

cross-compiling


With so many issues to worry about, one issue is validating compilation on 32 and 64 bit kernels. There are a huge number of kernels, and I was hoping that with 2.6.2[6789], 2.6.30 and 2.6.9, that this would be it.

Alas, I get build reports of issues on 2.6.19 and 2.6.22 and probably every other release out there. C'est la via.

Although the included makefile includes a target to build all the kernels in /lib/modules, I have extended it to do cross compilation for 32bit versions of your 64-bit kernels. This is mainly for my benefit - the output of this cross-built driver is almost certainly useless, but at least I get to see some of the compile errors and config issues.

Unlike most systems and tools, dtrace can be a better product because it supports the old systems, which, by their very nature will never get "fixed", because a later kernel is available, but this doesnt help you if you are stuck with an ancient distro.


Posted by Paul Fox | Permalink

Wed May 06 20:31:04 BST 2009

Success ! dtrace_gethrtime - you naughty boy.


Spent last few days trying to figure out why AS4 (2.6.9) kernel was so close, but so far. "dtrace -n fbt::::" (probe on all probable functions in the kernel) would crash.

I tried my normal debug tricks, e.g. divide-by-two search, to find the problematic functions. When this failed, I had to re-engineer the interrupt hooks for INT1+INT3, so we attach directly to the vectors. The beneficial side effect here is we are more decoupled from the kernel and can trace int1/int3 traps from the user or kprobes.

But this just failed.

Annoying!

Very annoying!

Annoying because its so difficult to debug - no way to print out where we got to. This required a lot of "thought-experiments".

I knew it must be something like "a probe inside an interrupt routine", and/or "a probe interrupt invoking a recursive INT3 probepoint", but some test cases showed it not to be so simple. It might run for a few seconds or crash straightaway.

Next up was to deliberately force kernel panics to prove which bits of code were reached.

None of this was really conclusive, except to tell me "something is wrong".

I even read up on the Intel Architecture manual to refresh my mind on finer points of kernel traps. (I now know that when a trap is taken, the callers SS:ESP is pushed onto the stack, but only on the transition from user -> kernel. A kernel trap whilst in kernel mode does not do this. (All makes perfect sense if you do a "thought experiment").

Anyhow, I hit on the perfect test case theory: lets assume its something in dtrace_probe() which is causing the issue. My first reaction was, well:

$ nm -u build/driver/dtracedrv.ko
and stare at the list of functions which are undefined (linked in at module load time). Theres quite a few, but most of the hairy and simple ones (like strchr, memchr, memcmp, sprintf) are not called (or are they?) during probe time, only during the module startup.

Ok, next step, lets short-circuit dtrace_probe(). Success! No kernel hang. Making it return without doing anything proves the interrupt vector assignments work, and normal kernel interrupts and workloads are unaffected.

I added a counter so I could see we got there (cat /dev/dtrace).

Next, started letting more and more of the code to run.

Then I hit on the call to dtrace_gethrtime(). This is an interesting function. dtrace_probe() keeps track of the time since the userland program was started. If the kernel is looking unresponsive, e.g. a bug in dtrace, but, more likely, too complex probing, then it will disassociate the D program and you get a message from dtrace saying that the system appears to be unresponsive: an air bag! Very nice design point.

So, now I knew it was dtrace_gethrtime(). Why? Well, how do we get accurate (nanosecond level) times out of the kernel? Well, in userland we typically use the time() system call or gettimeofday(), and other variants. Internally, the kernel is driven by timer interrupts, and increments an internal counter.

This internal counter is then exposed to those parts of the system via a lock (mutex like) mechanism. Since time is recorded in nanosecond units, a 32-bit entity will overflow in 4 seconds. A 64-bit quantity (typically, seconds + nanoseconds). So, there is no atomic increment, or atomic read. You might read the seconds just as the nanoseconds rollover. From an application point of view, the clock may be seen to go backwards (3.97, 3.98, 3.99 4.99, 4.00, 4.01).

This is no issue for the later 2.6.20+ kernels - we have code to directly read the counter (but am missing the dtrace inspired code to handle correct atomic reads; thats for another day). Anyhow, on 2.6.9, the interface/naming conventions is different.

I cheated. I call do_gettimeofday(), which is effectively the userland system call interface. Well, ... it worked, the day I implemented it ! But its horrible - we cannot call this from an interrupt routine - if the clock is ticking and we hit a probe (very high probability if we plant probes on every function in the kernel), then we deadlock inside the interrupt routine.

So, that explains why we became unresponsive - no panic. Nothing. Niente.

After reviewing the dtrace_gethrtime() code, and do_gettimeofday in the earlier kernel, theres a simple workaround.

And voila:

/home/fox/src/dtrace@vmas4: dtrace -n fbt::: | head -50
dtrace: description 'fbt:::' matched 15754 probes
CPU     ID                    FUNCTION:NAME
  0  12452                   schedule:entry
  0    793           recalc_task_prio:entry
  0    794          recalc_task_prio:return
  0     33                __switch_to:entry
  0    878          remove_wait_queue:entry
  0    879         remove_wait_queue:return
  0  14313        e1000_watchdog_task:entry
  0  14442         e1000_update_stats:entry
  0  14491         e1000_read_phy_reg:entry
  0  14337    e1000_swfw_sync_acquire:entry
  0  14335 e1000_get_hw_eeprom_semaphore:entry
  0  14336 e1000_get_hw_eeprom_semaphore:return
  0  14339    e1000_swfw_sync_release:entry
  0  14492        e1000_read_phy_reg:return
  0  14491         e1000_read_phy_reg:entry
  0  14337    e1000_swfw_sync_acquire:entry
  0  14335 e1000_get_hw_eeprom_semaphore:entry
  0  14336 e1000_get_hw_eeprom_semaphore:return
  0  14339    e1000_swfw_sync_release:entry
  0  14492        e1000_read_phy_reg:return
  0    403  smp_local_timer_interrupt:entry
  0    950               profile_tick:entry
  0    404 smp_local_timer_interrupt:return
  0    125                     do_IRQ:entry
  0    117           handle_IRQ_event:entry
  0    189            timer_interrupt:entry
  0   1122       update_process_times:entry
  ....
and this seems to work nicely.

For those of you with eager eyes, look at the functions we call. (I am ssh'ed into the machine, which is why the ethernet driver is busy).

I just got a kernel hang when piping the dtrace into 'head', so I will fix that, and hope I will do a new release tonight...


Posted by Paul Fox | Permalink

Mon May 04 22:27:15 BST 2009

nearly there...


I'm still fixing some regressions in 32-bit and 64-bit kernels. The 2.6.9 kernel has proven problematic, because, quite simply, the later kernels are so good, they are doing stuff for me I never realised.

The current problem *looks like* double-fault handling. In the context of a INT3 interrupt, we may hit another one, and the kernel panics with a double-fault Oops message.

Later kernels seem to perform some magic with TSS states for processes and kernels, which protects from this. Again, this is difficult to debug, and am looking to understand/fix the latest issues.

More in a few days.


Posted by Paul Fox | Permalink

Sun May 03 23:32:25 BST 2009

why did it work before?


Finally found why 64-bit dtrace worked, using instruction emulation, when it never should have done, and why on a 2.6.9 kernel, it didnt.

Turns out the later kernels are nice and clever. When a kernel mode interrupt/exception occurs, a new stack is given to the interrupt. This new stack does not abut the callers area, so, when the instruction emulator wrote below %RSP, we never clobbered anything.

Kernels 2.6.28 upwards work real nicely, but 2.6.9 lacks this nested kernel facility.

Of course, we need it on the earlier kernels even if they dont support it, but we can emulate that, now we know what is going on.

The magic is in entry_32.S and entry_64.S of the kernel, where it judiciously manages the stack frame entry/exit.

Lets see if I can be as good a citizen...


Posted by Paul Fox | Permalink

Sun May 03 18:39:59 BST 2009

more problems...


The new int1/int3 direct trap handlers nearly work, but not quite...something is causing a panic or kernel GPFs sporadically and until I fix that, the new release is not better than the old one.

I *think* this is due to random interrupts - either interrupts during a INT3 handler or vice versa. Whats strange is that it does work for a bit, but then crashes.

I think this may be due to stack switching - my int handler doesnt do anything special for the stack switch (or touching the %GS register), and maybe we end up with a miscommunication with the kernel - it assumes we havent switched to a debug stack, but in fact we have...sort of.

Difficult to debug because theres not much to print and with the kernel crashing, difficult to pick up the issue.

I am currently debugging on a 32bit 2.6.27 kernel (lot easier than the 64bit 2.6.9 kernel which just goes silent when things die).

Hope to get this fixed very shortly...


Posted by Paul Fox | Permalink

Sat May 02 13:47:00 BST 2009

depression..or why is it so difficult ?!


Got home last night to get cracking on fixing the issues, and was making good progress. I decided to switch to direct INT3 vector interception, since relying on the kernel notifier call chain means we hit lots of issues on older kernels, with non-reentrant trap handlers, and using a binary search to find the functions for which "fbt:::" hangs/crashes the kernel.

Minutes to get this working. Then...nothing.

I kept staring at the code, wondering why it didnt work. (Side note: I was doing this on a 2.6.27 kernel; luckily the panics in the trap handler were recoverable, so I didnt need to reboot or restore my snapshots; I *do* like that in the later kernels - they try really hard to help you with bad drivers. Nice).

Anyway, having gone a few hours with no forward progress, I stumbled on it today (good nights sleep):

	call dtrace_int3_handler
	cmp 1,%rax
	je int3_handled
Call the function, check the return code. Couldnt be easier.

Took me a long time to realise what is wrong in those lines of code above. (See bottom of update for the answer!).

Once I fixed that - it worked - enabling all probes on 2.6.27 was fine. I checked for userland INT3 traps, i.e. a debuggers normal workload, and made a fix to avoid this interception for userland - we can just let the kernel handle user land so that all registers, contexts, and stacks are correct. For kernel trapping, we can short cut some of the complexity in the kernel, with no loss/effect on anyone else.

Next up is to address a compile issue in patching the debug interrupt gate on 2.6.9, since the structures changed a bit. Then I need to do the same assembler for 32-bit cpus, and this phase should be complete.


Answer: "1" means address 0x0000000000000001, not constant 1. It needs to be "$1" in the compare instruction.

Posted by Paul Fox | Permalink

Fri May 01 19:10:40 BST 2009

The INT1/3 traps, re-entrancy, and reachability


I am late in putting out a new release, but am being pedantic to check I am happy before people crash their kernels with known issues.

The key difference I note in 2.6.9 and much later kernels is re-entrancy. Think about an INT3 trap - we are in the kernel, we step on a breakpoint instruction, and trap into the kernel.

Now, whilst processing that INT3 trap, we step on another function - the path to the int3 handler in dtrace and the path out of that handler can double and triple trap. As I mentioned in a prior article, this is handled by detecting the fact we hit a critical path probe, and auto-disable it.

Now, back to 2.6.9 and AS4. The kernel interrupt code doesnt seem to understand double/triple nested faults, and screws the stack causing a crash. This means I am having to validate toxic.c to ensure these unanticipated functions/probes are on the blacklist.

I dont like this - so this is an intermediate solution - ensure:

$ dtrace -n fbt:::
is stable for the kernels I have, and then fix this properly.

My next solution is to overwrite the int3/int1 trap handler vectors so we take control, save all registers, do our stuff, and get out quick.

This means we can put probes on the real int1/int3 handlers, along with the die_chain notifier list, and all the kprobes functions.

The only downside is a little more assembler in dtrace to save the stack and "do the right thing", but means we are more kernel independent.

Hope to release over the weekend.


Posted by Paul Fox | Permalink

Wed Apr 29 22:54:27 BST 2009

toxic ranges


A toxic range is an area we cannot place a probe, such as the internals of dtrace itself, since this can cause a recursive probe issue and crash the kernel.

With the switch to single-stepping in the kernel - this is a problem, because now we have to be more careful about reachability - for instance, calling printk() from inside the trap handler in dtrace, means we must mark as toxic every function printk call.

This is not nice. It potentially wipes out a lot of useful low level probes (spin locks, mutexex, interrupt disablers, etc).

With instruction emulation, this wasn't so bad - we didnt rely on recursive trap handlers and it worked.

kprobes works by marking unprobable functions with a GCC function attribute, and storing the info in a special ELF section of the kernel. (__kprobes macro)). I was going to rely on this to help me out, but (a) thats cheating, and (b) it means I may inherit more toxicity than we actually need.

"printk" as an example is only a problem if trap debugging is left turned on.

I have solved this problem in a novel way. I hereby condemn my solution as GPL/CDDL compatible.

In the event we have a nested single-step trap, we auto remove the nested trap probe point. The kernel remains fully functional - we just disable those problem areas for probing. (Something is written to the kernel logs, e.g. dmesg and /var/log/messages to help analyse this). So, initially the kernel gets into a tailspin and then corrects itself - with no hand tuning or per-kernel worries.

$ dtrace -n fbt:::
now works real well - I've tested on 2.6.29/64bit. Next to test AS4 (the golden nirvana), and then to 32bit.

Assuming all goes well, I will put out a release.

At a later date, I hope to blog about potential new providers we can write - it will be nice to get out of the 'get it working mode' into 'adding value mode'.


Posted by Paul Fox | Permalink

Mon Apr 27 21:06:55 BST 2009

I think I found you Mr. RBP


I wrote in the previous few blogs about issues on AS4 and having trouble finding the RBP register. I was right - sort of. The act of pushing a register involves touch parts of the stack which havent been allocated yet. I knew this because I solved the problem for 32bit kernels by jumping out of the kernel direct to user space and bypassing the last part of the kernel entry/exit points. Once we are in a trap, the word at %ESP isnt available to us because we just took an interrupt and our return address (or EFLAGS) is sitting there.

I dont know why this ever worked on 64b kernels - or why Solaris works - it shouldnt. (Well, normally theres a trap-type field which we overwrite and we get away with it; not for some traps, alas).

I converted dtrace to use single-stepping (TF mask in the CPU flags register), and this works a treat. This means the 32 and 64 bit cpu emulators can be jettisoned and the horror assembly in cpu_32bit.c can evaporate - making us a much cleaner 'citizen'.

I am just finalising the quality of experience for the various kernels, so may be a little while doing this - may have a new release if it hasnt regressed on Ubuntu kernels tonight, but otherwise may need to delay til later in the week.


Posted by Paul Fox | Permalink

Sun Apr 26 22:19:30 BST 2009

Where in the world is RSP ?


Following on from my previous blog - where is the RSP register?!

With FBT, we are in the kernel, we hit a breakpoint (INT3) trap. So thats a trap-within-a-trap.

In 2.6.9, the original syscall that got us into the kernel didnt save a few registers (RBP), but does save RSP in the task structure (%gs:0x18). But on a nested trap, we cannot overwrite the %gs:0x18 pointer.

So, when the INT3 callback is called we dont have a linkage to RSP. My previous blog wondered about where RBP is hanging out. But I think the problem is where is the RSP register from the interrupted stack frame.

Intriguing the mess 2.6.9 got into, but later kernels resolved by ensuring all callbacks had a fully populated pt_regs structure.


Posted by Paul Fox | Permalink

Sun Apr 26 20:29:16 BST 2009

The Story of PUSH %RBP


Over the last two weeks, I have been trying to get FBT to work with 2.6.9 (AS4) kernels. Its interesting doing so - what appears to work in later kernels, doesnt in 2.6.9.

I have some experimental code which works - but not reliably.

One issue is the PUSH %RBP instruction. If I do this:

$ dtrace -n fbt::sys_chdir:entry
the kernel will crash. (fbt::sys_chdir:return doesnt crash - in fact fbt:::return works fine for all functions!)

Ive been scratching my head and trying things out, and I think I understand the story now.

When a system call is executed, the user mode app switches to kernel mode and *nearly* all registers are saved. Not %RBP. RBP is usually the frame pointer - saved as the first instruction in every routine.

In the 2.6.9 kernel, a system call (SYSCALL instruction) doesnt save all registers (but does in later kernels, which is why I didnt see this in the 2.6.19+ kernels).

So, from the entry point at system_call:, we save registers (except RBP and a few others) and dispatch to the first instruction of the syscall handler, which kindly does:

PUSH %RBP
for us which is great.

Problem is, we modified that PUSH instruction into an INT3 (bkpt trap). So now, the distance from the original user land syscall is further - we not take an INT3 trap (which does save all the registers), then dive into C code to handle the notifier chain for INT3 handlers (including us, kprobes, and whoever else cares).

But now, our handler is being passed a struct pt_regs which is effectively two hops away from userland.

I dont think this matters, except, remember the original syscall didnt save %RBP? Well, that means it wont bother restoring it either. Which means that somewhere between SYSCALL -> INT3 handler, we lose/corrupt the RBP register. When dtrace cpu_64bit.c gets to emulate the PUSH RBP register, we are using a bogus value.

Now, this is a problem, because we dont know how we got to where we are -- using FBT on a syscall function is nice and easy and really shows the problem. Doing so from an inner C function in the kernel wont necessarily show the problem.

So, the illusion of not perturbing the C/assembler virtual machine is shattered.

What to do?

Well, now I understand it, I can hopefully fix it. kprobes doesnt worry about this (seemingly) since it uses single-stepping in the kernel to avoid this, but, it too must be careful not to affect what the kernel thinks is going on. Because it steps the PUSH RBP, it will do it in the context of the caller (where RBP will be its original value).

This may mean I too need to migrate to single step rather than instruction emulation. (If I do this, the code will actually be shorter/simpler than with emulation, and I wouldnt need to worry about all the instructions we have yet to implement - which means more probes against more functions).

Its interesting how much work for a legacy operating system there is, but more important is the education that we are doing everything for a reason.

More when I have news...


Posted by Paul Fox | Permalink

Sat Apr 25 15:39:17 BST 2009

to single step or not to single step...


For AS4 with the 2.6.9 kernel, fbt works for :::return probes, but not for :::entry. Very strange as there is little difference here except for the specific instruction we hit.

FBT works in the Sun code by patching the target entry and exit points of a function with breakpoint traps. When the breakpoint is hit, they emulate the overwritten instruction.

In Linux/kprobes, they single step the overwritten instruction, just like a normal debugger. When I originally looked at kprobes I thought it was too complex and would be forever trying to debug strange scenarios. (FYI, I have written Z80, and 80186 and 80386 debuggers before, so I shouldnt be averse to doing this). Anyway, I continued with the Sun approach.

On Linux, things are much more complex than Sun, because the instruction sequences emitted by GCC are much more varied, and the handful of use cases Sun have in dtrace were no good. (Look at fbt_linux.c, cpu_32bit.c and cpu_64bit.c).

With a handful of lines of change to cpu_64bit.c, I am experimenting with CPU single step tracing to see if this is easier and smaller.

One potential issue here is deciding what to single step. If we single step the original patched location, we need to remove the INT3 breakpoint trap, which can lead (on an SMP system) to races where other cpus may skip the trap. I think kprobes works by creating an instruction buffer and single stepping that. Maybe thats a better approach.

At the moment, on AS4, it seems like the :::entry is failing for some non-dtrace reason (i.e. my understanding of what the stack looks like is probably lacking). I will continue to dig.

BTW, for those following the blog, as far as I am concerned, both 32 and 64 bit kernels are supported and working - its purely the legacy 2.6.9 kernel (64b) which I am getting to work at present.

Theres still functionality lacking, but the key ones -- syscall::: and fbt::: seem to work.

BTW#2: scripts/dt.pl is a new script to provide a simple getting started way to use dtrace, so you dont have to cobble your own D scripts together. I hope to evolve it to embed all my (and your) knowledge into it so that people dont have the learning pains I have had with dtrace.

More .. when theres something to report.


Posted by Paul Fox | Permalink

Tue Apr 21 17:23:28 BST 2009

blog.pl update


Whilst my connection is down, and before getting dirty with FBT, made some minor changes to the blogger software - mainly to fix the archives/ links which were double-barrelled, and to ensure the random pictures are on each page.

In case anyone is interested (I know you are not !), the pictures are random bad images from various walking trips. I am not a photographer and couldnt take a good photo if my life depended on it, but pictures which look "pretty" can add a little bit of life to the web site.


Posted by Paul Fox | Permalink

Tue Apr 21 16:20:08 BST 2009

2.6.9 syscalls working now


As I write this, my internet connection is down. Hopefully back in a short while. Happens very occasionally with VirginMedia (previously NTL), and usually when the weather changes (probably affecting the cabling).

Spent last few days getting the 2.6.9 syscalls working, since the assembler code is subtley different in later kernels, and the ways I had done this in systrace.c was a little too delicate. The current way is better - using pattern matching to find the code and avoid duplicating it unnecessarily, along with a little error checking to avoid falling over on the wrong kernels. (I havent validated 32-bit 2.6.9 and wont bother unless theres a call to support that).

I've verified the 64 + 32 bit builds work across the many kernels I have (I dont have all ready at hand, and its a hog to build every 2.6 kernel if no one is really using those kernels anymore).

If the current code wont work, I may have to go further and generate code dynamically at runtime, since the glue of assembler to C that the kernel uses makes it difficult to ensure that under all kernels, compilers and flag optimisations, that things will work.

Next up is to validate the issues with FBT on 2.6.9 since I know that can crash the kernel.

Annoyingly, on the older kernel, it can take 1 minute of kernel cpu time to load the dtrace driver (or rather, for the initial pass at mapping kernel addresses to tracable functions in fbt_linux.c). It takes less than 2s on the later kernels. I assume the later kernels optimised the symtab handling functions.

Also, very annoyingly, under 2.6.9, the kernel clock is almost stopped. A google reference implicates issues with host clock and guest clocks using differing mechanisms or HZ values, which is fair enough. I may need to play with the grub boot options to fix this - its more an annoyance, than anything. (I wander if the extra cpu time is clocked related; can't think why, but then, one never knows when it comes to computers).


Posted by Paul Fox | Permalink

Sun Apr 19 22:25:35 BST 2009

execve() syscall for 2.6.9 kernel


I have a version of dtrace which works for execve() on 2.6.9 (64-bit) kernel, but its kinda ugly, and it breaks the code in systrace.c. I've put up a private build (dtrace-tmp.tar.bz2) which people should ignore. These private builds are works-in-progress for my own benefit, and should await a proper dated release.

Why is it ugly? Well, some syscalls, like execve() wants a copy of the "struct pt_regs" on the stack as an arg (not a pointer, but the actual invoking struct). This is different from the later kernels when pt_regs *is* a pointer.

This shouldnt be a problem, but the C language (even with asm()) makes it difficult to get the stack in the right place, reasonably portably.

The big area of difficulty is knowing what happened from 2.6.9 to 2.6.27 or so - at some point the calling sequence (and syscall assembler wrapper changed), and I can only really test/validate with the kernels to hand.

I'll look at my code more to determine how to normalise/factorise it so that I dont have to have lots of code for each kernel release.

(I havent looked at 32-bit 2.6.9 kernels yet).

Strange how some things are cleaner in 32-bit kernels, and cleaner in 64-bit ones.


Posted by Paul Fox | Permalink

Thu Apr 16 21:00:01 BST 2009

dtrace progress for 2.6.9 (AS4 kernel)


Although AS4 is an old release of the system, its instructive to build for this platform. This was always the driving force for dtrace for linux.

In taking the more-or-less working dtrace for 2.6.27 Ubuntu, and trying it on this old kernel lead to a lot of head scratching, kernel hangs, and even exposing horrible bugs in VMWare Server (1.0.8).

First, VMWares bugs: occasionally, when reverting a snapshot, I would find one or more of my virtual disk slices as owned by 'root', and not me. Bizarre, but have to chown them to allow my reverts to work.

Also, occasionally the kernel would power off the VM and even give rise to situations where I cannot revert a snapshot.

Worse and strange. Having done a revert, I would occasionally find my ssh login session, complete with CRiSP edit session "unreverted", i.e. I could carry on editing despite having reverted a snapshot. This is really horrible and exposes potential issues with VMware.

I dont mind - VMware has boosted my productivity enormously and avoid many long winded reboots, and I still love it.

Back to dtrace on 2.6.9: 2.6.9 is a strange world. Some kernel calls are missing, and the kernel is more delicate when the API contracts are broken, leading to panics and other strangeness. This is good - its helping to refine the source code, help me track some memleaks and device leads when loading/unloading the driver, and generally giving the code a good cook-in (or kick-in).

I have some issues to resolve, e.g. FBT not quite there (hope to fix that this evening), and the odd syscalls (clone/fork/execve/etc) have different calling arrangements compared to 2.6.2*, but then, thats my fault because of the way I had to do this, but at least I know whats involved. Plus timers need to be made to work (no hrtimer's in 2.6.9).

So, for those of you who have tried/failed on 2.6.10+ kernels, these issues may explain the peculiarities.

I'm not tracking the major technical changes from one release to another - only my memory serves to help remind me that things have changed, but theres too many kernels to keep track of and easier to work with extremes - very old and very new kernels - to avoid bad coding or lack of portability.

More later.


Posted by Paul Fox | Permalink

Mon Apr 13 11:03:51 BST 2009

fbt now fixed


$ dtrace -n fbt:::

now works, after guaranteeing I am at the head of the INT3 notifier chain. The kernel API wont let us do this, but by hand-manipulating the list mens we are at the front. (New 20090413 release for download available).

Just dont turn on driver printk() tracing if you are looking at fbt as you could hit the same issue. May need to provide a mechanism to semi-toxic the functions we rely on to debug stuff in the future.

Made a submission to http://freshmeat.net for dtrace, to see if we can get more interest and pick up on the project.


Posted by Paul Fox | Permalink

Mon Apr 13 09:44:11 BST 2009

FBT and the double fault


I wrote recently about some issues if you enable all fbt probes at once:
$ dtrace -n fbt:::

This appears to work on 32-bit but on 64-bit we hit an issue - most likely due to reentrancy. The notifier chain we sit on for the INT3 traps is also used by kprobes (and some other kernel code). This means when a trap occurs, we all get called, with a chance to handle it if it is ours, or pass it on if it is not.

The problem here is "who gets called first?". kprobes wants to go first, but thats a problem since if we are tracing a function kprobes uses, then we get infinite recursion.

Actually, we dont get infinite recursion, because there is a lock in the notifier chain code, and it blocks on itself: result == hung kernel.

Ideally, we should go first, and am experimenting with that approach. We shouldnt be calling anything from probe context which in turn is being probed, else we have the reentrancy issue.

One has to be careful marking certain funcions as toxic, e.g. printk() (which I use for debugging) since that would preclude putting useful probes on these kinds of functions.

The alternate solution I am trying is to take over the INT3 interrupt vector and avoid some kernel code aspects. This is potentially problematic in getting the interrupt code to work so it plays nicely with other citizens (and maybe subject to kernel-isms too).

Stay tuned.

(At the same time as this, I am getting the freetype code in CRiSP to work - it works, but its not visible in the setup menus; I'll detail this more later).


Posted by Paul Fox | Permalink

Sun Apr 12 08:59:01 BST 2009

CRiSP v9.4.1


I'm going to release a new version of CRiSP this weekend, and update the version number. (People will need to get a new license - its been 18+ months since 9.3 came out).

Although the changes arent major at this point, its a good time to cut over to a new version. The major feature for this release will be FreeType font support for the edit window.

Not sure why people want this - as FreeType/TrueType font display involves blurring to create a clearer (?!) image. It does look nice at large/huge fonts.

I need to work on the other GUI controls to let them support the technology too.

Given how diverse Linux platforms are in terms of shared libs and kernels, I have had to use runtime dynamic linking to avoid a startup dependency on something not installed on your system - e.g. "ldd crisp" wont show the dependency on libXft.so and friends, but its there. (As of now, its an environment variable to turn it on, but am about to change that).


Posted by Paul Fox | Permalink

Tue Apr 07 23:58:17 BST 2009

iopl / Gnome desktop fixed


After locating the difference in systrace.c, the gnome desktop started flawlessly whilst running scripts/syscalls3 (which polls all syscalls, and dumps out the stats every 3 seconds). Nicely, performance was almost unnoticably different from an undtraced startup.

Need to do more testing and repeat the exercise with all fbt probes in place.

(This is 64-bit only; need to also repeat for 32-bit Ubuntu).


Posted by Paul Fox | Permalink

Tue Apr 07 23:45:09 BST 2009

iopl kernel segmentation violation


Got it...it seems to happen when iopl() is called from a thread in a process, not in a standalone non-thread app. I have modified the test (utils/iopl.c) to reproduce this.

Next, is to fix it !


Posted by Paul Fox | Permalink

Tue Apr 07 23:14:20 BST 2009

x86-64 running 32-bit binaries


Looks like this doesnt work, or rather, syscall tracing doesnt see them, because there are two syscall tables in the kernel, and currently its not intercepting those.

Should be easy enough.

The issue I am debugging is why:

$ sudo gdm
Fails if we trace the iopl() syscall - individual calls to iopl() work, but when called from the Xorg server, causes a kernel fault. This is a nuisance, since iopl() takes a single argument, but otherwise is not that interesting. systrace.c has special assembler code because of the SYSCALL wrapper in the kernel, but other than that, its nothing special - either it should fail always, or not at all.

But, that, is the nature of debugging - the unexpected happens, until you understand it, and then its blindingly obvious.


Posted by Paul Fox | Permalink

Sun Apr 05 21:30:48 BST 2009

more 64-bit syscall issues


Found that we have issues with sigreturn() and execve() (the exec family), so these may fail strangely (they dont seem to panic the machine, just core dump the caller).

Similar issue to the stuff I just fixed for fork and friends, so hopefully this will only take a short while to fix.


Posted by Paul Fox | Permalink

Sun Apr 05 20:50:39 BST 2009

New release of dtrace


I've hopefully fixed the 64-bit syscalls issues now, as I mentioned in the prior blog. There were some issues with execvp and friends, but looks like I forgot to do something in systrace.c, and thats now fixed.

So - we should be at parity for 32 + 64 bit dtrace.

What of the future? I need to track down an FBT issue on 64-bit, since tracing all functions seems to crash the kernel (should be easy to fix...will try later on).

I though I might mention the impending IBM/Sun takeover - if IBM takes over Sun, then what happens to dtrace? No idea, but am hoping that since IBM is very GPL friendly, the license can change, and if that happens, then this port can become a GPL licensed derivative.

BTW, had a scare this morning on turning on the machine, my 750GB root partition had filled up. I knew I had about 300GB free, but, I found a "wget -r" of some Ubuntu kernels was running infinitely and had taken a week to down load a ton of rubbish (git-links). I quickly killed the rubbish files, and have a bunch of kernels for compiling against now (fixed one portability issue by doing 'make kernels').

BTW#2, I setup a twitter account - http://www.twitter.com/crispeditor. I dont know if I will use it much - I may automate changes to the source to feed twitter, but its worth having just to learn a little bit more about this Interweb thing people keep talking about.

BTW#3: HAPPY BIRTHDAY TO DTRACE FOR LINUX !

Its just about 1 year old now !


Posted by Paul Fox | Permalink

Sat Apr 04 00:29:09 BST 2009

forking crack


Ok, I am on the home straight now. After staring at the assembler hooks for SYSCALL and the path for the fork() code, as it turns into a kernel subroutine call to sys_clone() [fork() on linux is a call to the clone() syscall], we are just about there.

After a lot of false starts, and self-confusion, I have the fork() syscalls (i.e. clone) working, without crashing the system. The bottom line here is that four a few syscalls, the way the stack is setup is different from all other calls. I believe the issue here is the complexity of handling the CPU instruction to handle SYSCALL/SYSEXIT, for which there are various internet references to the difficulty in handling this, since SYSCALL jumps to kernel mode, with no stack saving and no register saving.

Because of this, the kernel goes through hoops to set up the pt_regs struct on the stack, so that the syscalls can return.

But fork() is complicated because we give birth to a new child -- and that child doesnt return the way the parent does - it is simply created and put on the scheduling queue.

I'm amazed that this ever worked on i386, but, reading the documentation on the net shows a lot of permutations for SYSCALL and SYSENTRY on 32 + 64 bit chips, along with 32-bit kernels on 64-bit chips and with variations and bugs on AMD and Intel !

Next up is to tidy the code, and handle the 3 or 4 calls which are similar to fork (clone, fork, vfork, sigaltstack, and iopl).

Look out for a release this weekend.


Posted by Paul Fox | Permalink

Mon Mar 30 21:56:58 BST 2009

linux fork and the x86-64 issue


After a week of grafting over various different assembler constructs to implementing a wrapper around the fork/clone syscall, I have hit a "Wow!" moment - the way this works ... wont.

Implementing a wrapper around a normal piece of C code is easy - just create a new piece of code which has the same calling sequence, no side effects, and returns the output from the function.

On x86-64, fork() is "strange": a child is created as a clone of the parent (complete with the wrapping), but then goes off into the big wide world all on its own, without coming back to Daddy (the wrapper). So the half-clothed child ends up back in user space with corrupted stack, and then core dumps.

The parent is fine - its just these wayward kids which are a problem.

So, now I understand it better, I am off to find a way to ensnare the child proc and see if I can get out of forking issues.


Posted by Paul Fox | Permalink

Sat Mar 28 22:19:46 GMT 2009

forkin hell


dtrace for x86-64 is still held up on the clone/fork syscall. Most syscalls in Unix are read-only from user space into kernel space. A few syscalls modify the incoming arguments - but fork (which is built on top of clone()) is different because of various stack/register manipulations.

On x86-64, syscalls are executed via the SYSCALL instruction which is a special optimisation on later x86 cpus compared to INT 0x80, and the kernel glue does interesting things to implement the semantics of the syscall - especiall fork/clone.

This is causing me problems since my assembler glue doesnt quite match the Linux code (despite poring over it for the last week in excruciating detail, but, obviously not enough).

I can get fork() to nearly work - but that is nowhere near sufficient. Only perfection is acceptable.

Interestingly, fbt calls on the fork code work nicely, e.g.

$ dtrace -n fbt::sys_clone:
will work. (I could cheat and emulate the syscall, but thats cheating on a bad scale).

The assembler glue needs to work.

An interesting item in this arena is that the extra cpu registers in the 64-bit chip causes more work for the kernel, as they need to be saved, but this, in general, is offset by the power of 64-bit addressing and computation, so, most people see it as a win. (I've benchmarked my key application - CRiSP - as 10-20% faster in 64-bit mode compared to the same binary compiled in 32-bit).

More, hopefully before the 1 year birthday is up !


Posted by Paul Fox | Permalink

Sat Mar 21 23:35:54 GMT 2009

dtrace release 20090321


I have just put out the latest dtrace release which should work for all syscalls on 32 and 64 bit kernels. It wont work for all fbt functions on 64-bit kernels, but I need to work thru where the breakage is.

Theres one nasty in this code - the code makes use of pgd_offset_k() in the kernel, but the web implies this is an unsupported function, about to disappear, and so, depending on the way the kernel was built, may cause problems (inability to load). It works fine on Ubuntu 8.10 with the stock kernels, but havent managed to uncover what the replacement idiom is.


Posted by Paul Fox | Permalink

Sat Mar 21 18:52:15 GMT 2009

success - sys_call_table in *Linux* can be modified


I have spent two weeks trying to figure out how to modify the sys_call_table on x86-64 - everything I tried turned out not to work. The Linux kernel developers have stopped you from finding or writing to the sys_call_table, and I am happy with that. We are a guest of the kernel.

But there goal can never work - once we are inside the kernel, the best they can do is make life hard. dtrace is almost a root kit inside the kernel, but its a friendly one - providing facilities to aid the people who support or own the target system. dtrace is a monitoring facility.

The Linux kernel functions - which are subject to change from one kernel to another - simply forbid a read-only page from being made read-writable if its in the ".rodata" section.

Despite direct updating of the page table entries to bypass this, I failed, continuously. I ended up read this: http://www.intel.com/design/processor/applnots/317080.pdf to find the sentence which told me what I was doing wrong.

The experiments of the last two weeks have resulted in a bit of mess in the dtrace driver, so now I can clean that up, and should be in business.

Anyone who really cares about how this is done, can grab the code and find the relevant sections, and, even better, criticise/improve the code.

One thing that stands out about my journey through the kernel is that dtrace may not work for a paravirtualised kernel, or UML linux, but thats relatively minor - if people want to run dtrace on these kernels, we can work out what needs to be fixed.


Posted by Paul Fox | Permalink

Mon Mar 16 23:19:19 GMT 2009

rodata and the sys_call_table in the Linux kernel


The problems I have with the 64-bit dtrace are to do with the fact, and I welcome this, that in Linux, they bundle lots of data into a read-only section. This is good practise, and helps to detect bugs, stop virus writers (not really), and makes sense.

They are good and clever too - the Linux functions for modifying page table attributes preclude you from undoing this, because they check what you are trying to do and turn off the _PAGE_RW bit of the page table entry.

Again, I applaud this - this makes the challenge more interesting. There are a few places on the web and newsgroups which talk about this and how to undo this, e.g. by rebuilding the kernel. (Magic code in the function mark_rodata_ro() and the function static_protection() in pageattr.c). Again, this shows a good plan and one can see the evolution in the kernels.

Of course, that makes the 'game' more challenging, and is forcing me to learn more and understand more about why dtrace works/doesn't work. The earlier Ubuntu release must have predated these kernel changes, and now we just need to find a way around it.

Remember, we are running in kernel space, so anything goes (its akin to running in MSDOS real mode - we can do what we like). We are not malicious or a virus writer, we simply want a tool which plays nice and in a black box fashion, so that the kernel can be probed, even in the face of systemtap or kprobes.

I have tried numerous experiments - all failing, but thats because I am trying the hard way - no debugger. The issue is that in modifying the page table entry to allow read-writability, the kernel hangs. It looks like a kernel "bug" - a page fault to a read-only page causes an infinite trap loop - we keep returning to the faulting instruction.

More when I get a solution. For now - dtrace either works for you, or it doesnt (because the kernel is a later one).


Posted by Paul Fox | Permalink

Sun Mar 15 20:12:21 GMT 2009

dtrace progress for linux 64-bit .. or lack thereof


Very strange week. Having spent a lot of effort on the 32-bit version of dtrace and fixed the many silly things in the port, the 64-bit is proving problematical.

The 64-bit version works fine on Ubuntu 7, but on Ubuntu 8 (and Fedora Core 8), with a variety of kernels, it hangs.

Its not clear what/why - whether its kernel specific, GCC specific or whatever. I've taken to doing some radical surgery just to get something to print out and point at the problem area. I've redone the assembler code to tidy up potential compiler dependencies.

The same code still works on Ubuntu 7, but on on Ubuntu 8, it hangs. I've been poring thru the Linux kernel code, and have some ideas of what could be doing this (e.g. the 2M page size for protected kernel data may be a part of it).

Another strange thing on the 2.6.27-7-generic kernel, is that even if I get it to not hang the system, it runs out of memory with 512MB of space. (I've installed Ubuntu 8 inside VMware, so at least I am seeing the same inside as outside of VMware, and dont have to keep crashing my main machine).

The strange thing is that it hangs on dtrace_casptr - which translates into a cmpxchg() function call (no need for inline assembler, since Linux headers provide this from the C level).

I will keep digging, hopefully something will point to the bad coding issue here.


Posted by Paul Fox | Permalink

Sat Mar 14 22:59:09 GMT 2009

64bit troubles


The 64-bit dtrace doesnt work on a real CPU. Works inside vmware....but first, apologies to vmware. I have enjoyed it since its inception, and have been perplexed by why it works inside but not out.

What is annoying is that most of the development of dtrace occurred in 64bit mode, thinking that porting to 32bit would be simple, and was oblivious to many issues on 32bit. So, over the last few months been perfecting 32-bit dtrace.

Now, I go back to 64-bit dtrace, and it works inside vmware but not outside.

Looks like something strange happens in the kernel - the kernel supports large pages -- 2M and 1G page sizes. The system call table is sitting inside a 2M page group (x86-64 has a three level page table).

So, the hacks I did for 32-bit (not strictly hacks), may or may not work for 64-bit Linux - because I was taking a shortcut, and it turns out bits of the kernel are protected by a 2M page which is marked read-only.

Now I know what/where the issue is, I can go and see if I can fix it.

Annoying, because I have had to repeatedly crash my master machine to test this out, but am hopefully close.

More when I have a fix, but, for now, dont use the 64-bit dtrace. The 32-bit dtrace seems to be fine tho.


Posted by Paul Fox | Permalink

Thu Mar 12 23:35:18 GMT 2009

64bit dtrace -- still not quite there.


I am a little sick of vmware - my workhorse. It fails to emulate a real cpu enough that when I try native dtrace/64 - the machine locks up. I thought it was a gcc issue - thats partially true, but I need to debug these issues. (It loads, dtrace -l works, but any probe placements are dying in dtrace_casptr; if I comment that out, I dont panic my machine).

I am trying out the latest release of VirtualBox to see if I am better placed to virtualise...lets see what happens.


Posted by Paul Fox | Permalink

Wed Mar 11 23:31:38 GMT 2009

64-bit linux dtrace working (I hope)


I've fixed the compiler dependent assembler stuff in dtrace_asm.c so it appears to work in my vmware session. Am about to test on my non-vmware 64-bit hardware, so give it a try...but be careful!

Posted by Paul Fox | Permalink

Tue Mar 10 21:30:39 GMT 2009

i give up..fedora..goodbye


I liked Fedora - I liked the desktop.

I absolutely loathe and detest YUM - the package updater which is broken beyond belief. So I am downloading Ubuntu 8.10/64 to give me a decent platform.

Nothing has installed properly from Fedora. Even downloading the FC10 release was going to take hours, but Ubuntu is a single CD which can bootstrap me, and then I can start using apt-get -- which is what I have been using on my two laptops and the vmware sessions to help debug dtrace.

I hate to give up on fedora - diversity is what counts, to check out configs and other software development tools, but it hurts that my main machine no longer even has GCC on it. I removed the package hoping to be able to rebootstrap myself, but yum refuses to play. Did I say I hate it?

Pirut - and python. <laughs /> No thanks

I want my software to have as few dependencies as possible, not take 10+ minutes to tell me that everything is broke.

I need to get my gcc working again so I can prove the dtrace works for 64-bit, but I fear I may be in a gardened wall where "it just works on Ubuntu, and nothing else". I know what I did wrong in the __asm__ sections of dtrace - I just need a 64-bit machine and compiler to validate my changes.

So, tonights episode is loading ubuntu on top of fedora and see how much I learn to cry :-)

For now, dont touch 64-bit dtrace until I next write up that all is safe. You will very likely panic your system. (32-bit should be good).


Posted by Paul Fox | Permalink

Mon Mar 09 23:13:54 GMT 2009

64-bit oops....


The 64-bit dtrace is broken, or rather, whether it crashes your kernel or not may depend on the version of gcc you use. The __asm__ constructs dont handle the frame pointer which is setup.

I have fixed (some of) them in my release, but am having difficulty upgrading my Redhat FC8 real machine with vmware to validate my testing.

My yum/pirut package updater is broken and building gcc from source requires things which wont build properly, despite upgrading my binutils.

Oh well, need to upgrade stuff as this is an old system now and I hate my main line system not having the tools to get the job done. (If need be I'll revamp with an FC10 at the weekend).


Posted by Paul Fox | Permalink

Sun Mar 08 11:54:33 GMT 2009

thats better...but strange


dtrace -n fbt:::
dtrace: description 'fbt:::' matched 25753 probes
...

This is on a real CPU - seems to work now.

Theres been lots of issues and discoveries to get here in the last day or so.

Firstly, vmware seem to fail me: fbt::: would work fine on my vmware session, but fail on the real cpu. Some of this may be due to kernel or compiler (vmware running 2.6.28.5, real cpu on 2.6.28.2) but I dont think its kernel related - I dont believe that many differences in the areas I am looking at have happened.

It could be compiler and/or kernel options: GCC and Linux attempt to inline lots of stuff.

Here is what I found: despite initially thinking I should have 42,000+ probes, its down to 25,753, as above. Some of this is that many functions are in non-".text" sections. The kernel makes liberal use of attributes to put functions in the following sections:

.init
.init.text
.fini.text

and others. So I added code to restrict functions only in the ".text" section. This means we may miss the chance to probe some module exit code, but this isn't really much of a loss.

Additionally, because .init sections can be jettisoned after module loading, I found numerous cases where two or more functions sat at the same address. This is bad: if FBT latches on to this, we destroy the hash table for FBT and cannot patch/unpatch when dtrace is executed. I added code to detect/disallow two or more probes at the same address. (This isnt an issue for the vmlinux kernel, but is for modules).

Another sanity check is for functions on the notifier chains used by dtrace itself. A probe on one of those would cause infinite trapping/recursion/reboot.

Finally, the __mutex_ functions are put in the toxic.c file, since the mutex functions will call those and dtrace needs these to work. We could inline our own implementation of mutex_lock/mutex_unlock, but we can live without this for now.

I had to reboot my notebook numerous times to get here, and this seems to be stable. I will run it more and see what happens. Theres still a chance that slowpaths or exceptional paths thru some critical functions could cause a kernel crash, so be careful.

TODO items: add more instruction emulations, revisit 64-bit support, and fix the process monitoring code (dtrace -c)


Posted by Paul Fox | Permalink

Sat Mar 07 09:21:26 GMT 2009

Ooops


I just wrote that 'dtrace -n fbt::*:' works. I tried it on my other (real cpu) machine, and it didnt. It definitely works for many/most functions, but some will cause problems - panics or reboots. (This is a machine showing 43,000+ available fbt probes).

More when I think this is resolved.


Posted by Paul Fox | Permalink

Sat Mar 07 09:04:17 GMT 2009

26,000+ probes


I finally managed to allow:
$ dtrace -n fbt::*:

to work without crashing the kernel. Of the many functions in the kernel, some cannot be probed (not too many), since dtrace is relying on them. Some, like do_int3(), is the assembler glue to handle a breakpoint/trace trap.

Others were causing problems because my dtrace, unlike Solaris', calls kmem_alloc() from probe context. I have put a temporary work around for this by doing a static allocation in par_setup_thread(). The problem is that in Solaris, we have call back for process creation and/or the proc structure contains extra fields for dtrace. In Linux dtrace, we dont touch the kernel code, so we need to shadow each process created (or, probed), and this requires some little extra work.

This works now. I havent heavily tortured tested it with the final pieces in place, but now I can.

Theres still some instruction emulations not implemented yet, but these form a very small part of the kernel, e.g. a few thousand functions are unprobable until I implement those instructions.

I'll look to adding more emulations, and then switch back to 64-bit dtrace as that too is missing quite a few emulations.

As again, I implore the community to give this a try. I know of at least one successfully documented user seeing dtrace actually work, and I am hoping from hereon in, it is reliable/stable. (Famous last words...I still consider this an alpha or pre-alpha release).


Posted by Paul Fox | Permalink

Tue Mar 03 15:54:14 GMT 2009

timer probing issue fixed


$ dtrace -n fbt::nr_active:
dtrace: description 'fbt::nr_active:' matched 2 probes

CPU     ID                    FUNCTION:NAME
  0   1005                  nr_active:entry
  0   1006                 nr_active:return
...

This was previously locking the machine up, but is now fixed. In dtrace_probe(), we call dtrace_gethrtime() which eventually calls the code to read the system clock. This is protected by a lock to avoid fluctuation as we read the multiple words of a clock.

We inline some of the code, to bypass the kernel lock (but we will suffer very occasional clock drift until this is fixed).

Now I can proceed getting to the point where:

$ dtrace -n fbt:::
works.

Stay tuned.


Posted by Paul Fox | Permalink

Tue Mar 03 10:37:15 GMT 2009

one down....


The problem of not running on a real cpu for FBT probes was simply the same old problem I had for syscalls: the kernel is write protected for an i386 kernel. So we just make the kernel writeable as we set the probes, and now it works perfectly (well, as good as the vmware version does).

This still leaves one major bad-ism -- some FBT probes crash the kernel - hard. I thought it was due to the derivatives of timer interrupts, but thats not true; I can place probes on some timer interrupt code, and this works, but others do not. E.g. nr_active() is a crashing probe, but the callers to this function do not cause a problem.

Lets see what a bit of mind-thinking brings. (Some bugs seem to only be worked out by thinking about the code paths the kernel, cpu and dtrace take; sometimes, no amount of printing will help, especially when we go blind in some areas and just hard lockup the kernel).


Posted by Paul Fox | Permalink

Mon Mar 02 22:58:51 GMT 2009

problems..problems


As I try to ramp up the kernel probes (as distinct to the module probes), I have found an issue: it looks like any probe called from timer interrupt context will hose the kernel.

Why?

Most likely reason is the short-cut code necessary to handle the INT3 traps. This can be seen by planting a probe on nr_active() which is called from the timer interrupt to recompute the loadavg. Hopefully I can work out how to exit the probe interrupt handler properly, and this will avoid a need to blacklist lots of kernel functions.

Am also still seeing differences between a real CPU and VMware. On a real cpu, if we place lots of probes, we stand a chance of getting a timer interrupt whilst laying out the probes, which causes a kernel GPF. Its possible this is related to the above - i.e. not doing the correct job in the first place, so hopefully one will give insight over the other.

Syscalls work and many many fbt kernel/module probes work - even under heavy load, but some dont - so dont try and lay out everything just yet...


Posted by Paul Fox | Permalink

Sun Mar 01 21:01:05 GMT 2009

dtrace - the next phase


Now that dtrace is working nicely - various stress tests are working well, e.g. "dtrace -n fbt:::", its time to go one step further.

Here is what todays release can do:

/home/fox/src/dtrace@vmub32: load
Syncing...
Loading: build/driver/dtracedrv.ko
Preparing symbols...
Probes available: 23075

Note that figure - we just moved from 5000+ probes in the modules, to now include everything in the kernel.

Heres an example:

/home/fox/src/dtrace@vmub32: more /tmp/probes.current
   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
    4        fbt            kernel                                   entry
    5        fbt            kernel                                   entry
    6        fbt            kernel                                   entry
    7        fbt            kernel                                   entry
    8        fbt            kernel                                   entry
    9        fbt            kernel                                   entry
   10        fbt            kernel                         init_post entry
   11        fbt            kernel                     name_to_dev_t entry
   12        fbt            kernel                     name_to_dev_t return
   13        fbt            kernel                   calibrate_delay entry
   14        fbt            kernel                   calibrate_delay return
   15        fbt            kernel                    dump_task_regs entry
   16        fbt            kernel                    dump_task_regs return
   17        fbt            kernel               select_idle_routine entry
...
/home/fox/src/dtrace@vmub32: dtrace -n fbt:kernel:generic_file_mmap:
dtrace: description 'fbt:kernel:generic_file_mmap:' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0   3222          generic_file_mmap:entry
  0   3223         generic_file_mmap:return
  0   3222          generic_file_mmap:entry
  0   3223         generic_file_mmap:return
  0   3222          generic_file_mmap:entry

Dont try and do:

$ dtrace -n fbt:::

as I now need to strip the kernel symbol table of functions we rely on and would cause re-entrancy problems (my system rebooted itself when I did that!)


Posted by Paul Fox | Permalink

Sun Mar 01 12:59:35 GMT 2009

probes...and more probes


My Ubuntu vmware session seems to have about 5000 probes available in fbt. I upgraded to 2.6.28.5 kernel last night to try and figure out the crash on the real non-vmware machine (see prior blog).

My real Ubuntu has 16000+ probes available.

But the real machine seems to be unstable at more than 8000+ probes. The error strikes if you ctrl-c dtrace - a GPF occurs somewhere during the close code.

On my vmware machine, I have modprobe'd all available drivers, and can get to nearly 14000 probes (this is a 2.3GHz dual core cpu vs 1.2GHz for the real machine).

Looks like something in locking or interrupt disabling may be causing the instability.

The good thing is it does work well, but just not reliably enough for a real machine where you care about crashing or locking up the driver.

More when I know what the deal is here.

/home/fox/src/dtrace@vmub32: dtrace -n fbt:::
dtrace: description 'fbt:::' matched 13752 probes
CPU     ID                    FUNCTION:NAME
  0  11858                   epcapoll:entry
  0  11859                  epcapoll:return
  0   8587               ia_led_timer:entry
  0  12474               ipmi_timeout:entry
  0  12475              ipmi_timeout:return
  0  11858                   epcapoll:entry
  ...

Posted by Paul Fox | Permalink

Sun Mar 01 00:31:05 GMT 2009

idiot. me.


dtrace was looking so good, yet it actually failed on a real cpu again.

Spent a bit of time to trace this down to my bad changes to the interrupt enable/disable code, not working well with gcc.

This is fixed and so, dtrace should work much better now.

Still more to do to add probes, but at least the basics should work well again.


Posted by Paul Fox | Permalink

Sat Feb 28 08:45:35 GMT 2009

dtrace bug of the day


dtrace is working well, but just found a serious bug which can hang your system. On running dtrace with just a few probes on my real (non-vmware) system with 1GB RAM, from inside Gnome desktop, I get this kernel message:
[ 4336.530339] vmap allocation failed: use vmalloc= to increase size.
[ 4336.640674] vmap allocation failed: use vmalloc= to increase size.
[ 4336.758243] vmap allocation failed: use vmalloc= to increase size.
[ 4336.843121] BUG: unable to handle kernel paging request at 0020027e
[ 4336.843139] IP: [] 0xe354db6d
[ 4336.843151] *pde = 00000000
[ 4336.843158] Oops: 0002 [#1] SMP

At this point, dtrace cannot be killed, although the rest of the system is fine (I am typing this logged into to another machine). Looks like I need to research memory allocation more on Linux.

In a previous blog I had mentioned how use of kmalloc() would cause frantic heavy I/O as the freemem got fragmented. Use of vmalloc() solved that question, but we may be sensitive to whether how much contiguous kernel RAM is available.


Posted by Paul Fox | Permalink

Fri Feb 27 20:19:13 GMT 2009

dtrace torture


I havent released anything this week (until now), because I have been seeking why my methodical FBT debugging was causing random lockups.

There are missing function probes in FBT, but the 5000+ available in my vmware Ubuntu (as seen via 'dtrace -l'), would cause the machine to panic or hang. I tried various techniques to track down which function was causing the problem, but no specific one was.

This work made much more difficult since when we die in the interrupt handler for INT3, we have very little ways to communicate to the outside world (without putting in more effort; it can be done).

After a lot of experimentation I tracked it down to badness on my behalf in dtrace_interrupt_enable()/dtrace_interrupt_disable().

When dtrace_probe() is called, it can happen from interrupt context - this is how FBT works, by planting INT3 instructions at the probe points. What was happening was nested interrupts - very rare - but within seconds to minutes of heavy probing.

After studying the trivial routines in dtrace_asm.c (which I had already had to rewrite once), I realised the silliness. The silliness is that these two routines allow nested disable/enable - but my implementation didnt do that, and so we prematurely enabled interrupts inside dtrace_probe() and the INT3 trap handler.

A slight complication is that even tho these lines are a few lines of trivial assembler, they depend on the GCC calling conventions as specified by the kernel build parameters. Fortunately, GCC's __asm__() allows us to not have to worry where calling parameters are (in the stack or in a register).

What does this mean?

The following example illustrates how to profile the functions in the kernel. Here, I did a kernel build (2.6.28.5). The kernel had been built already, so the make just spent a bit of time working out that nothing needed rebuilding (but it did do a relink).

/home/fox/src/dtrace@vmub32: dtrace -n 'fbt:::entry{ @[probefunc] = count(); }'
dtrace: description 'fbt:::entry' matched 2593 probes
....
  ext3_reserve_inode_write                                     288067
  __ext3_get_inode_loc                                         313572
  walk_page_buffers                                            368757
  __ext3_journal_stop                                          384168
  ext3_journal_start_sb                                        384168
  journal_start                                                384168
  journal_stop                                                 384168
  __ext3_journal_get_write_access                              470892
  journal_get_write_access                                     470892
  __ext3_journal_dirty_metadata                                557924
  journal_dirty_metadata                                       557924
  do_get_write_access                                          559167
  journal_cancel_revoke                                        559270
  ext3_htree_store_dirent                                      614816
  call_filldir                                                 617133
  ext3fs_dirhash                                               630767
  str2hashbuf                                                  655621
  journal_add_journal_head                                     747785
  ext3_check_dir_entry                                         754274
  journal_put_journal_head                                     786996

Posted by Paul Fox | Permalink

Wed Feb 25 21:46:22 GMT 2009

dtrace torture


I made good progress up until a few days ago. The FBT code probes the kernel for functions you can trace. You can trace as much or as little as you like. The dtrace probe syntax is great.

Inside fbt, we disassemble each function to find the instructions to put breakpoints on. Because GCC emits different code to Solaris CC, we have to do more work. Each possible instruction in a probe target must be emulated. Solaris worries about 5-6 instructions, but we must worry about nearly the entire x86 instruction set. (Its fun, and I am sad, so its no big deal).

cpu_emulate.c has this code in.

What I am doing is picking off the majority and working through them. Many instructions arent mapped (yet) - which is fine - they just become non-probable functions. (BTW FBT doesnt handle non-module code, ie. the kernel itself, yet, because I didnt know what I was doing at the time, and greatful I never wrote the code, since I have to solve the "how to trace printk, whilst using printk inside dtrace to debug itself!").

Anyway, i took the output of 'dtrace -l' and made a script to try probing each function individually, to find where dtrace is broken.

This worked well - you can really torture Linux with dtrace.

However, *something* is wrong. I enable 1300+ probes, and it can run for ages, and die (no panic, no message - just a hang) randomly. I tried bisecting the probe list to find the culprit, but the problem goes away.

My fear is that either we have something like a double interrupt - a rare and non-deterministic event, or the end-of-interrupt code is missing something (it is missing something, but I havent proved to myself if it matters, as garnered by looking at Linux's normal end of interrupt code where it needs to know if its returning to kernel mode or user mode, and restore IRQ triggers).

Alas, because I am debugging blind, its not helpful.

I need kgdb to help, but after my last kernel build with kgdb, Ubuntu refused to recognize the root filesystem.

I *do feel* its random interrupt 'edginess' rather than bad code in cpu_emulate.c. If it is bad code, it should be deterministic and I can track down which probe is the faulty one.

In any case, for mild dtrace attacks, it works. Just use it on a kernel you care about :-)


Posted by Paul Fox | Permalink

Sun Feb 22 13:11:17 GMT 2009

fixed syscall tracing


I mentioned in the previous note that syscall tracing could do strange things (core dump apps) if we tried to trace them all, showing something was wrong for at least one call.

I tracked it down to sys_clone - not copying enough of the stack. This works now, and you can do this:

 dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
and other variants, showing the execname and/or pid, etc.

The beauty of this one liner is that you can see the whole system in terms of who is doing what, and with a few small variations, you can get more info. The output is exactly as you would want to home in on a candidate process, e.g. sorted by frequency.

This is exactly what dtrace is all about.

Next up is to track down more fbt issues.


Posted by Paul Fox | Permalink

Sun Feb 22 11:33:34 GMT 2009

fixes for syscall tracing


Found out that not all syscalls were safe, e.g. sigreturn() needs to patch the return register set. The call to the underlying syscall wasnt faking the stack properly, and this is fixed (for i386).

This now works (nearl):

/home/fox/src/dtrace@vmub32: dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 321 probes


  clone                                                             2
  munmap                                                            2
  pipe                                                              2
  setpgid                                                           2
  socketcall                                                        2
  sigreturn                                                         3
  fstat64                                                           4
  waitpid                                                           4
  stat64                                                            5
  mmap2                                                             6
  writev                                                            8
  close                                                            11
  _newselect                                                       15
  open                                                             18
  gettimeofday                                                     22
  poll                                                             27
  rt_sigaction                                                     54
  rt_sigprocmask                                                   74
  ioctl                                                            81
  write                                                           141
  read                                                            172
  time                                                           1414
  futex                                                          1424
  clock_gettime                                                  7047

I say 'nearly', because whilst it is running, some apps were core dumping, so obviously its not 100% correct, but at least the kernel didnt panic or complain about nastiness


Posted by Paul Fox | Permalink

Sat Feb 21 21:40:37 GMT 2009

memory allocation issue resolved (hopefully)


Found that I should call vmalloc() instead of kmalloc() for big allocations. Ive set it up to use that if we ask for more than 100K of memory, and that looks much better.

Lets see how it goes.


Posted by Paul Fox | Permalink

Sat Feb 21 19:16:28 GMT 2009

linux kernel memory allocation


I reported the other day that after running dtrace on my 2GB RAM machine, the box started paging like crazy - so much so, I had to reboot.

I tried again today and I fear Linux kernel memory allocation is broken, again. dtrace uses the kmalloc() calls to allocate memory. It allocates quite a lot of memory in tiny pieces for things like the FBT and other probes, but when you run dtrace for real, it allocates some big buffers to store the probe-time records. dtrace must preallocate this space because its not safe to do at probe time.

dtrace seems to try around 2-4MB per buffer, but needs one per CPU. kmalloc nearly has a heart attack when it sees this - it causes pain to the memory fragmentation, and I suspect it then goes crazy trying to swap everything out in an attempt to 'defrag' memory and avoid problems with memory fragmentation.

All malloc() libraries have this problem sooner or later.

It makes me wonder if Sun tuned dtrace for big RAM machines, e.g. 16+GB RAM. (I need to look at the default).

I have temporarily modified the memory allocator to avoid allocating anything bigger than 260K, and this works well - no frantic paging (even after dtrace is removed from the kernel).

The next step is to look at the buffer allocation code and see if we can allocate a (scattered) array of pages, which would avoid any such fragmentation problems.

Theres still more work to do in FBT instruction sequences, am slowly working my way through the 32-bit list (we only trap about 4000 of 40,000 available functions at present). I noticed the 64bit kernel has a similar issue - namely, that GCC generates more distinct assembler, and on my ubuntu, where there are around 60,000+ functions available, we only allow trapping on about 2,000. So, hopefully this will be easy to address, simply going through piece by piece.

(FYI, tests/mkprobeall is useful to iterate over every probe to validate safety).


Posted by Paul Fox | Permalink

Fri Feb 20 21:13:37 GMT 2009

dtrace and mail


Sorry, but if people want to correspond with me, its going to be plain text or html or both, but not mime-only encoding. My emailer is CRiSP - its homebrew and is virus resistant because I wont run anything that will autoexecute foreign code or plugins - Windows or Linux. I dont want to spend my life patching other peoples security holes.

I will likely attract more spam than I do already because I freely publicise my name, email and web site, and thats fine - I can learn something from the spam, and I can recognize it without opening files.

Sorry to be a pedant !


Posted by Paul Fox | Permalink

Thu Feb 19 23:34:31 GMT 2009

dtrace progress


Things are looking good in dtrace - it seems to work as I try out more test cases. However, there are quirks and some missing stuff, which is slowly being worked on.

The quirks are anoying - maybe not repeatable. E.g. I crashed my laptop tonight; after a reboot dtrace and the same test was fine. (Maybe kmalloc memory pressure causes an issue?) The heavy paging I experienced last night hasnt happened. (Maybe a fight with vmware?)

Added a special case handler for "SUB $12,ESP".

The goal is just about everything in /proc/kallsyms is suitable for FBT probing. (49869 on Fedora Core 8; 58699 in my Ubuntu laptop). Presently only 3000+ entries are probable because many are disallowed until safe instruction emulations are in place.

So, we are looking at quality and polish, and the user community trying it out, feeding back.

Reliability is an issue: if I am doing anything bad, I will be caught out, and only after a lot of hard work to find out where that problem area lays.

I left dtrace doing something boring (intercepting mkdir() syscalls all night). It was fine.

But its torture and stress testing.

I need at some point to go back to 64-bit testing/debugging. At present I am just doing 32-bit testing.


Posted by Paul Fox | Permalink

Wed Feb 18 23:03:25 GMT 2009

dtrace .. more fixes .. strange VM behavior


What I thought I had fixed yesterday, I hadnt..quite.

The issue about /bin/ld and binutils is true - it didnt like what dtrace generates for a USDT application, but I only applied a fix for 32-bit ELF files, and not also for 64-bit ones.

This is fixed, and hopefully "make all" should now run to completion.

I loaded dtrace into a non-VMware machine - my main 2GB RAM laptop (32bit Ubuntu). It worked well - I could trace failed open() system calls (scripts/nonexist.d), but what is strange is that when Ctrl-C-ing dtrace, the kernel went funny forcibly evicting every page from the RAM cache.

Maybe I am using a bogus kmalloc/kfree syscall which is triggering that. (My vmware sessions tend to only have 256MB of RAM, so its not so noticable that this may have occurred).

Certainly makes machine unusable til the page flushing has completed.


Posted by Paul Fox | Permalink

Tue Feb 17 20:17:43 GMT 2009

dtrace build failures


Ive been compiling dtrace on various platforms, but on AS4 and some user reported platforms there are build errors.

An annoying error for me to track down is why the USDT userland application fails to build. If you do another make, progress is made and the driver is then built.

There seems to be an issue with an invalid ELF file being generated. Hopefully I can track it down tonight -- it could be a problem with versions of the development kit (/bin/ld, gcc, libelf etc) or maybe a sillyism in the porting of the code.

To those that are trying - the usdt piece doesnt matter, so you can skip over it (e.g. 'make -i'), but hopefully I will get this done.

I am also adding a bug report script to help on some of these - not sure if its that useful, but it will avoid delays to get consistent info from people about platforms and bit-ness.


Posted by Paul Fox | Permalink

Mon Feb 16 23:22:30 GMT 2009

more dtrace asm


There are now 3 cpu instructions emulated in assembler. This means we have access to hundreds or maybe thousands of probes. Three is a relative number - there are lots of instructions possible in the first part of a function using FBT, but this is a good start/cross-section.

I have a little script in the tests directory:

$ cat tests/mkprobeall
# Simple script to try out each probe in turn so we can 
# know where we may die
grep fbt.*entry /tmp/probes.current | 
while read id provider module func startstop
do
	echo $func $startstop
	dtrace -n "fbt::$func:entry  { @[probefunc] = count(); }
		profile:::tick-1sec { exit(0); }
		"
done
which is useful to try each function in turn to see where we crash, so I can start doing more work on the others.

If I am lucky, in the future I may be able to create decent regressions on these instruction sequences. (The latest - MOV $nnn,%ESP - needs some fixing since it may not cope with a small stack frame - the registers may overwrite themselves, but my initial test was a $0x48 size frame, so this works -- ext3_free_inode()).

To date, I am only doing i386 rather than x86-64, since this is where the most interest is.


Posted by Paul Fox | Permalink

Sun Feb 15 10:56:22 GMT 2009

more dtrace and fbt


Just in case...I wrote that we had reached achievement yesterday with FBT. Its not complete - it will still cause a GPF, but the major aspect of how it works is resolved.

In Linux and GCC, we have a more difficult life than Solaris with their C compiler. This means more instruction sequences need to be patched and emulated. It works for ext3_mkdir:entry, and I am just getting ext3_mkdir:return to work, and then I can hopefully blitz lots more functions.

Interestingly Solaris uses the LOCK prefix to intercept instructions for the i386, but we are going to use INT3, since LOCK is very dangerous, since we tripped up on the following instruction byte. I dont know why they did it that way, but I kind of get the feeling that Solaris dtrace just-about works, and a change in their compiler technology could cause a fair bit of work to get it working again. (I havent verified what Apple do, since they use GCC but they are restricting themselves to 64-bit only kernels so they may have gotten away with it too).

Again, will update here if I feel the dtrace is usable on an i386 for FBT.


Posted by Paul Fox | Permalink

Sat Feb 14 23:50:52 GMT 2009

dtrace and fbt and i386


Gaaaak! Took a week, but it appears to work now - or at least the first part works. Heres the code that I spent all week poring over:
                __asm(
                        "mov %0, %%esp\n"

                        "pop %%ebx\n"
                        "pop %%ecx\n"
                        "pop %%edx\n"
                        "pop %%esi\n"
                        "pop %%edi\n"
                        "pop %%ebp\n"
                        "pop %%eax\n"
                        "pop %%ds\n"

                        "pop %%es\n"
                        "pop %%fs\n"

                        "push %%eax\n"
                        "mov 8(%%esp), %%eax\n"  // EIP
                        "inc %%eax\n"
                        "mov %%eax,4(%%esp)\n"
                        "mov 12(%%esp),%%eax\n"  // CS
                        "mov %%eax,8(%%esp)\n"
                        "mov 16(%%esp),%%eax\n"  // Flags
                        "mov %%eax,12(%%esp)\n"
                        "mov %%ebp,16(%%esp)\n"  // emulated push EBP
                        "pop %%eax\n"
                        "iret\n"
                        :
                        : "a" (regs)
                        );

Ive put a comment in the code (cpu_emulate.c) to describe whats going on.

This has been interesting and frustrating. Half of the code is lifted from opensolaris - or - heavily borrowed. I had written most of the above in plain C, and did all sorts of silly things and experiments to figure out where I was going wrong and none of them worked. Almost certainly, I was nearly close at all times, but this stuff requires more accuracy than normal C code, since we cannot single step the assembler.

Whats almost hilariously funny is that I dont use kgdb. Mainly because I dont have time to set it up - but I should. I relented. I grabbed the latest linux kernel - 2.6.28.5, and built it for kgdb use, but my kernel failed to boot. So I quickly gave up.

Exactly at that point, I spotted the fatal flaw in my emulation code, and doing a FBT on ext3_mkdir works - when a new directory is created, I can see the calls fire, and I *DONT* panic the kernel.

Now, hopefully I can clean up FBT for i386 and reverify 64bit kernel works.

This is it: almost functionally complete. I have more tidyups to do and I need to ensure we cannot probe a dtrace function, and then we are done, with cosmetics left to do.


Posted by Paul Fox | Permalink

Thu Feb 12 22:04:40 GMT 2009

dtrace stuff


My inbox is starting to bloat with dtrace inquiries. I will try to respond when I can to some people, but I may need to setup a mailing list or something to try and offset the issues/support/query load.

Someone has set up a dtrace git repository based on my nightly tarballs - this is great stuff. Thanks Pete.

http://github.com/pmccormick/dtrace-for-linux/tree/master

This could be great news - people contributing in areas I dont have time for at the moment. (dtrace is a part time hobby and its interfering with other things I need to get on with).

Just some background here: people should download and build dtrace. The more kernels and distros the better. Then changes can be made so it can compile better everywhere. I have a few kernels but not every and not every distro, so the makefile will need some fine tuning if necessary to fix these issues.

At some point in the future I may release dtrace v1.00, but until then, treat all my hype as exactly that. When I say it 'works' it means it works better than it did yesterday for *me*. It may not work for you - or it may break - or it may cause a plague of penguins to visit you.

I remain optimistic (else I would give up). You can remain watching and monitoring and/or helping out, but I need to get the featureset complete first before I can think of what comes next.

As always, if its quiet, I am fighting a battle with bugs.


Posted by Paul Fox | Permalink

Wed Feb 11 21:34:18 GMT 2009

dtrace and the curse of the interrupt


I was making good progress over the weekend but slowed down this week: 15-20 lines of assembler are the root cause - the backend of a fbt interrupt when a trace trap is hit.

The problem is that to emulate the intercepted instruction we cannot 'return' from the C code back to the caller because we need to mangle the stack.

On Solaris, code on the backend of an interrupt does this, but we arent patching Linux remember, so we have to pretend we are running under MS-DOS, do the nasty manipulations ourself and never return, but emulate what would have happened if we had returned (pop the regs, reti).

The problem is that diagnosing what I am doing wrong is tricky (no more printk()'s to trace what happened!). I could call in the troops, e.g. kgdb, but thats too easy :-)

Nope, trial and error so I understand backwards whats going on. (I am surprised the 64bit dtrace works at all in this area, because it shouldnt; I'll diagnose why it works or why it might fail when I have finished the 32bit version).

Mostly in the kernel we deal with a struct pt_regs structure but this structure lies about whats really happening on an invalid opcode trap (and many other traps).

Fortunately, VMwares 'revert to snapshot' is a real time saver, about 3 seconds between a kernel panic and back in action to try the next test is a real timer saver.

More when I know more!


Posted by Paul Fox | Permalink

Sun Feb 08 11:29:53 GMT 2009

Dtrace - fbt under i386 + Purify


FBT for 32-bit mode kernel wont work at present. It works under 64-bit kernels, but the key issue is that the existing Sun supplied dtrace code doesnt understand GCC or Linux.

A normal C function has an entry and exit prologue (eg PUSHL %EBP). This may be true for the sun compiler, and no __inline assembler code, but not true for Linux. Many Linux functions in the kernel do not start with that prologue. In addition, use of --regparms=3 when compiling the kernel means theres more permutations.

When FBT patches a functions entry prologue, it notes what the opcode underneath the patch was and then emulates it (since we cannot easily single step in kernel mode).

Suns dtrace supports 5 scenarios -- common to 32/64 bit, but a real kernel has tens of permutations.

I am modifying the code to support the required permutations.

I mention Purify in the title. Why?

Many many years ago (about 10 years) ago, I started writing a Purify emulator (prior to valgrind being available). To avoid the key 1992 patent that Purify uses, I wrote a i386 CPU emulator. It nearly worked - I gave up, because debugging the exceptions was proving time consuming - but I did manage to run a crisp binary under it at one point (albeit at a 200x performance hit, this on a Pentium-233 processor if my memory holds true).

But, that old code is useful because I can lift some of the instruction emulations out for dtrace, so hopefully it wont take too long to get probes in.

In the prior releases of dtrace, it showed many probes under i386, but it shouldnt have: if you enabled FBT tracing, you would panic the kernel, because I allowed thru unsafe translations. This is fixed - it will only allow safe translations thru, but I have to treat each new instruction as a new case, to be emulated, so, as these get done, 'dtrace -l' will show many more fbt probes.


Posted by Paul Fox | Permalink

Sat Feb 07 18:57:52 GMT 2009

Dtrace progress 20090207


Been working on the FBT provider issue: it doesnt work, which is not surprising. The fbt code was one of the first bits of code to get compiled, and 'work' but I had to quickly move on to other bits of dtrace to flesh out the workings.

Now that I am returning to it, life is much easier. The existing driver, along with kernel source, and OpenSolaris code is like having a google-map to wade around, comparing/understanding pieces of code.

Today I understood how fbt works - when a trap is set in kernel space, it works by planting a breakpoint in the kernel code. The trap interrupt happens, but rather than what would normally happen in user space, where we unpatch the code, single step, then plant the breakpoint back in, instead we have to emulate the patched over instruction, which is not too difficult since theres only a handful of potential instructions we can patch (push %ebp,%esp; leave/ret, nop).

Interestingly, this code is not in the Dtrace code, but in the interrupt trap handler.

Linux - especially the later kernels - provides good notifier call chain mechanisms to intercept the traps; I dont have to patch the kernel code - I can just exist as a 'nice' citizen.

So - this bit nearly works - hopefully get this to work later.

A problem is that dtrace makes it easy to place a probe on every function in the kernel in one line. I need to detect which external dependencies we have (like printk()) and avoid us trying to call a patched function from inside dtrace, else we will have a recursion issue or a double fault trap and panic the kernel. On the other hand, the ability to patch each function is good - because its easy to prove the exercise is finished.


Posted by Paul Fox | Permalink

Fri Feb 06 22:54:07 GMT 2009

Dtrace progress 20090205


I've put out a new release of dtrace. Progress has been pretty good this week, but thats *my* opinion. A certain dtrace author[1] has had troubles with the release, and I need to resolve that.

I've revamped the directory layout slightly to allow a single src tree to build against all valid kernels on the machine (as found in /lib/modules, but this can be overridden to use any tree you have on the system).

I had spent some effort to see if I can compile for RedHat AS4 (2.6.9 kernel), and a few others too, and the kernel differences, although annoying, meant I needed to clean up. This includes things like the pt_regs register layouts, which is now set up in <sys/privregs.h> to handle 64+32 and differing kernel releases.

Builds now go to the build-2.X.Y directory with build/ being a symlink to the default running kernel.

The bottom line is that I can at least validate compilation issues across kernels before putting out a release, even if I am not yet doing any sanity checking across all kernels.

A number of issues were fixed in USDT (I have a crisp binary with a USDT probe in, which is useful, since it was firing kernel issues when I had forgotten the probes were even present). Also, the issues with a real 32-bit cpu are fixed.

I'm now looking at FBT - which fails because its relying on something which I havent yet plugged together - am just researching if this is an interrupt trap or bad parsing and placement of the patch instruction. (Kernel panics with an unimplemented instruction if you use fbt).


[1] The dtrace author will remain nameless, unless he's happy for me to use his name.

Posted by Paul Fox | Permalink

Tue Feb 03 13:44:22 GMT 2009

dtrace progress 32-bit cpus


Weird ! Dtrace works differently in a VMware session vs on a real cpu. I always thought it would be difficult to tell if you were in a VM, but dtrace found a crink.

Linux protects the syscall table (read-only). I put in code to let us patch it so that syscall tracing would work. This worked fine under a VMware guest OS, but on a real x86 CPU, the kernel would trap the fault.

After a lot of printk()ing and resolving it to very basic function calls, I ended up with a solution which is good - direct manipulation of the page table entries (followed by a TLB cache flush!)

32-bit dtrace can now trace calls again, and it will work.

Of course, theres more bugs for me to fix and some other unfinished business buts its rapidly taking final shape.

New release today with these fixes in.


Posted by Paul Fox | Permalink

Sat Jan 31 00:23:24 GMT 2009

Dtrace - thats all forks ! (just kidding)


A sillyism in the code, and now USDT is working beautifully - the target app no longer core dumps after the first trap.

What does this mean?

It means the dtrace experiment is over. It works on Linux.

Yes, there are cleanups to do and some missing code to handle forks and garbage collection of shadow procs and stuff and stuff.

But we have now exercised pretty much the code, and I need to do some more USDT exercises (like strings and stack dumps; I need to re-research the ruby stuff on Adams blog).

Of course, the code is very kernel specific - the kernel changes often from one release to the next, and it maybe possible to get smarter about handling forwards/backwards compatibility.

Some point in the future, I want to write D scripts for Linux and not be debugging dtrace. Time will tell what we can do and theres lots of existing D scripts to learn from.

I'll continue writing up progress on dtrace, and hopefully more people can try it out and report back on kernel build issues.


Posted by Paul Fox | Permalink

Thu Jan 29 23:29:27 GMT 2009

dtrace progress - USDT works almost


After some head scratching, heres an example of USDT on Linux:
$ dtrace -n :::saw-line
dtrace: description ':::saw-line' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0   1899                    main:saw-line

This example is taken from the simple-c example app bundled in the distribution. At this point, the target app died with a SIGTRAP since I havent finished testing.

What made this work?

The code I have ported (my fault) confuses Sun's 'regs' array with Linux's 'pt_regs' array. I've done some mappings so we get the correct interrupt level context, but had to comment out a few references to unsupported registers on Linux (eg %gs, %fs, etc). I assume the references are needed in probe context for D apps that want them.

Shame that the target user space binary died, but now hopefully I can make even more progress, e.g. for the other trap types (which I dont fully understand yet, but then, I am being thick).

I'll release new code with these changes in.


Posted by Paul Fox | Permalink

Mon Jan 26 20:54:40 GMT 2009

Linux is depressing...Very much so


I find Linux depressing. It really is depressing. The whole lot is out of control, all in the interests of kindness.

I upgraded one laptop last week to Ubuntu 8.10. Worked like a dream.

Until I tried to suspend the laptop, and sometimes it wouldnt recover and most of the time the stupid, very very stupid NetworkManager ... didnt. After some research, I found a kit to replace this.

But, still the same issues ... not working reliably after suspending. And definitely not reconnecting to the WPA wifi.

I cant believe how horrible and complicated this has become. In the old days there were just config files and /etc/rc.d files to worry about. Theres unnamed daemons controlling everything with layers of too much complexity.

I updated the kernel to 2.6.28.2. I lost my sound. I lose my sound every time I build a stock kernel, and I still dont know what/where.

Then my master server - Fedora Core 8 - tried to upgrade that to 9 or 10, and now, it too, wont restore (non-WIFI) ethernet on start, without me going to the other room to poke it with a sharp and hot stick ('ifconfig eth0 down; ifconfig eth0 up; route add default gw ....')

Honestly, I am ready to retire and give up on this.

At this rate, Windows 7 will be my prime operating system, or I am going to live in a cave where people dont upgrade things that werent broken.

Of course, its all my fault. I thought it was time to get 'real'. Silly me.


Posted by Paul Fox | Permalink

Thu Jan 22 22:33:17 GMT 2009

CRiSP - Unified Linux binary 9.3.6a


Job done: From now on, only two Linux releases of CRiSP will be produced - linux-x86_32 and linux-x86_64.

This is presently being produced on a Fedora Core (FC8) box running glibc2.7, but runs on AS2.1, AS3, AS4, Ubuntu 7/8 (and probably 9+ as well).

Having tracked down what was causing such dependencies, and the arms-race to keep up with distros, I found that really only a couple of things caused this. Strangely, the ancient C functions ctype.h (isalpha(), isdigit(), etc) were the biggest nuisance, since later glibcs use GCC smarts and libc versioning to disallow a new binary running on an earlier release.

I wrote a tool to patch the ELF section headers to remove the enforced GLIBC dependencies, and it works.

(I wasted a lot of time, because my FC8 box got updated and the X11 libs/headers moved around and I thought my unification was triggering the bizarre errors I was getting).

(Along the way, my dynamic IP address also changed, which I only found out when I tried to download from my own site).

And to make matters worse, one of my laptops suffered a "I am going to update apt-get and break your system badly". This was an old Knoppix release - which was nice since it was an old GLIBC release, but an upgrade pushed me into glibc2.7 territory, invalidating the first rule of software development: dont update things because it seems like a good idea. It isnt :-)

So, now that laptop gets the Ubuntu treatment. I'm feeling a lot more happy with apt-get, and now two systems (plus 1 vmware) is all on Ubuntu, with the master being RedHat (which is now a blacksheep because it has no easy upgrade without pain of a big download or Ubuntu). Still, diversity is good.

What else would I do if I didnt have to fight silly issues ?! Dtrace maybe ....

Yes, but now I need to fix fcterm - the terminal emulator which, added to the chores above by core dumping whilst running gdb. (UTF-8).


Posted by Paul Fox | Permalink

Sun Jan 18 21:54:03 GMT 2009

Inches away .. dtrace progress 2009018


I can now intercept application INT3 breakpoint traps, and pass them into dtrace. Its not quite right yet (and, if you load dtrace into your kernel, it will presently break gdb and single step / breakpoints), but I hope to fix that.

So now, we can have a USDT app tell the kernel it has probes, have /usr/bin/dtrace monitor the probes, have the app hit INT3 to jump into the kernel, and the next bit is to have the dtrace engine talk back to the application.

I peeked at FreeBSD again, only to find all this is commented out over there, so we are ahead in this area compared to FreeBSD. Next is to work out some details in dtrace_user_probe(), and just use it for a bit.


Posted by Paul Fox | Permalink

Sun Jan 18 21:43:44 GMT 2009

CRiSP and a universal Linux binary


For years - since the day CRiSP for Linux was built, I have been plagued with Linux ABI binary portability, meaning that CRiSP has had to be built for every combination of glibc (and now, 32+64 bit) platforms.

Why? Because, if you run a later crisp on an earlier system, the binaries will refuse to run, complaining about glibc mismatches.

This drives me nuts. For years I had been meaning to see what the cause was, and I was surprised. Very surprised how the glibc maintainers could do this.

No other platform: Windows, Mac, or any other Unix has this problem. (Well, Mac can be nearly as bad, but definitely not Windows, or any SVR4/BSD derivative - to my knowledge).

Take the standard C library for <ctype.h>. Its existed since practically day 1 of the C language, providing useful functions like isalpha(), isdigit(), etc. Did you realise that this family can cause binary API problems? Well, it does. Somewhere in glibc 2.[567] they made these functions Unicode and obscure-aware (eg, isalpha(EOF) should not cause an array bounds indexing violation). So, the simple #defines or array lookups of old are replaced with calls to a function in libc.so, which may not exist in older libc.so's. Yuk. This isnt an option that is turned on because you want, and its almost undocumented.

So, one of the trivialest functions in libc.so is being replaced by a private implementation.

pthreads is another issue - I am aware that at some point in the past, the size of structures for pthreads changed, and this caused portability issues for apps. Instead of hiding this in the implementation, they use versioning of symbols.

In GCC 4.x, it supports functions for detecting stack frame smashing, but this is turned on by default. If you compile with -D_FORTIFY_SOURCE=0, then these API compatibility issues are removed. (I am not advising others to do that; I test my apps with valgrind and my own builtin memory corruption detector).

I had to do lots of stuff to find this out, e.g.

objdump -T binary | grep GLIBC

Will tell some of the story.

objdump -p binary | grep VER
will tell the rest of the story. The definitions for VERNEED, VERNEEDNUM and VERSYM stops a later binary running on an old system. When I have finished writing a tool to strip this out of a binary, then I can run a glibc2.7 application on an AS2.1 (glibc 2.3 or glibc2.2).

I will then be able to build just two Linux releases: 32 and 64 bit, and use my latest development system to create a binary compatible release.

I have to say that doing this means the onus is on me to work around why such symbol versioning occurs, but its a nuisance.

I have lots of vmware and systems running a variety of Linux releases, but its an annoyance to have customers tell me that Ubuntu 8.10 isnt supported, even tho I use it myself (for dtrace work).


Posted by Paul Fox | Permalink

Sat Jan 17 22:52:51 GMT 2009

dtrace for OpenBSD ?


Just reading on the openbsd mailing list about ZFS for OpenBSD, and someone saying wouldnt dtrace be better. Was wondering about that comment. Yes, porting dtrace to OpenBSD should be easier than for Linux given that OpenBSD is a derivative or ancestor of FreeBSD. I dont know the relative maturity of one vs the other, although I think FreeBSD has a bigger user base, but, in theory, it follows, it is doable.

Would I do it? Maybe, if someone asked. But before then ... Linux needs to get a little bit further forward.

With regards Linux dtrace, I have a piece of glue to place -- on the interrupt vector which handles a user-space breakpoint trap. I can see the code in Solaris, and now need to work out the best place to put this in Linux, and that should handle the full cycle from user-to-kernel-to-user-to-kernel which is needed for USDT. Let me see how I can get on with this, and then some cleanups can start to happen....


Posted by Paul Fox | Permalink

Fri Jan 16 23:39:02 GMT 2009

dtrace on windows


I wander if that grabs your attention :-)

I was wandering if it was doable/viable/workable. To be honest, I dont see why not.

I am not proposing to attempt this (not unless I am really bored and Linux dtrace is 'finished').

But technically, most of the dtrace code is just plain-ol-C. Theres bits to hook into the kernel and userspace, but the dtrace code is modular and segregated that actually the Unix specific pieces are relatively small.

For anyone who has tackled Windows device drivers (and they are not that difficult, although operate in a more complex way than Unix), it should be doable.

Theres more layers in Windows (core kernel, nt.dll, win32, user, gdi, ...), but the fundamentals of reading/writing memory is what is crucial.

Of course, Windows doesnt support ELF, and I would hate to run a 'dtrace -l' inside a CMD.EXE window.


Posted by Paul Fox | Permalink

Fri Jan 16 20:29:21 GMT 2009

CRiSP and Large Files


Just wanted to take a detour away from dtrace for a moment. I rarely comment or write on CRiSP, even although it is a mature baby.

Someone asked me about editing/viewing large files in CRiSP. I thought I would crib some of the mail I sent.

Heres a question: What is the largest file you could edit on a 16-bit machine? 32-bit? 64-bit? (CRiSP has survived these CPU architectural changes over the years).

The answer is the same for all: how big is your hard drive. Naive coding would lead to just loading the file into memory and hence you would be limited to the size of RAM and addressability of the CPU. This has never been a good thing: if you spend all your time in the same editor for small files, you almost certainly want to use that tool for large files too, e.g. >4GB files.

The largest file I tried to test in CRiSP is around 16 GB. I didnt go much further (this was a 32-bit cpu), because it got boring waiting for the file to page in via the O/S, but it works.

Of course, you can find a weak spot in this: just try taking a huge file and do a search and replace of every character in the file. CRiSP will attempt to save the undo information and you will wait a long time for the I/O. At least CRiSP tries - and tries to be efficient.

So, the answer to the question is: How long do you want to wait?

CRiSP can support almost infinitely large files (upto the size of your hard disk or filesystem), but what you do next will really depend.

Its worth reiterating this point. Whether your tool of choice can survive being pushed to extremes, and whether its performance degrades linearly, exponentially, or catastrophically. That is an interesting topic for technically interested people. Maybe not for everyone.


Posted by Paul Fox | Permalink

Thu Jan 15 23:29:19 GMT 2009

dtrace progress 20090115


Some degree of success! We can now run a USDT enabled process, run dtrace on the probes of that process, and I can see the INT 0x3 (0xcc) instruction being written to the probe points of that proc. The kernel writes a breakpoint instruction with the goal of /usr/bin/dtrace monitoring the child for SIGTRAP signals. (And, presumably, to fire the callback for the process .. not sure what happens next).

I know the kernel isnt logging the triggered probe (or maybe my example simple.c is too simple!)

Alas, the proc falls over when it hits the SIGTRAP, since the ptrace parent isnt doing the right thing.

To see this happen, I modified simple.c to checksum its own code (very simple hack) and could see the checksum change, immediately followed by the SIGTRAP abort.

Next step is to get /usr/bin/dtrace to trace the child properly. Lets see what happens.

As always, latest code on my dtrace download site.


Posted by Paul Fox | Permalink

Tue Jan 13 23:14:39 GMT 2009

STUPID STUPID ME ! dtrace progress


Found it! After days/weeks of perusing source code, trying to understand the PID provider and fasttrap code, and pulling (what little) hair I have out, I found it.

When a user space app registers itself as a provider, it would not show up in 'dtrace -l'. Why?

Because I am stupid and missed the blindingly obvious.

Fasttrap.c has a limit on how many user space providers can be created - to avoid crashing or DOSing the kernel. But I forget (or rather, didnt realise) the variable was not set. (In Sun land, they read the attributes from kernel config variables, but I had commented that out).

Stupid me! Now I can see the provider. Heres an example:

/home/fox/src/dtrace/drivers/dtrace@vmubuntu: dtrace -l | tail
 1859        fbt              fuse                         fuse_iget entry
 1860        fbt              fuse                         fuse_iget return
 1861        fbt              fuse                  fuse_set_nowrite entry
 1862        fbt              fuse                  fuse_set_nowrite return
 1863        fbt              fuse                   fuse_abort_conn entry
 1864        fbt              fuse                   fuse_abort_conn return
 1865        fbt              fuse             fuse_flush_writepages entry
 1866        fbt              fuse             fuse_flush_writepages return
 1867 simple5555          simple-c                              main saw-line
 1868 simple5555          simple-c                              main saw-word

Now, hopefully I can make some real progress.


Posted by Paul Fox | Permalink

Sun Jan 11 16:44:27 GMT 2009

dtrace progress


Progress is slow at the moment. In the continuing battle to get USDT to work, I am reaching some roadblocks.

The 'easy' part was getting core dtrace into the kernel - wherever something was wrong, I would crash the kernel, so, I could track down where it broke and work backwards.

With USDT its slightly different. After getting a userland binary to have probes in it, it runs and tells the kernel it is probable. Kernel trace messages show the probe exists, yet 'dtrace -l' doesnt list the probe. (I am using MacOS to compare what *should* happen with what *does* happen on Linux). I am obviously missing something here.

Its a bit of chicken-and-egg trying to work out the flaw, e.g. it could be the userspace implementation not being complete, or it could be a sillyness in the kernel, or even something I have forgotten to do.

Interestingly, when running a USDT app, it declares the probes, and you can see them (eg on the Mac) with 'dtrace -l'.

You can run in two ways: run the app on its own, and attach to the probe with dtrace, or, do both together, launch dtrace to fire the app and monitor the probes.

Interestingly, on the Mac, gcc seems to have some enhancements to allow the inline probe declarations to work. Statically disassembling the binary and disassembling whilst the app is running shows the kernel correctly putting in "INT 3" instructions into the userspace code area.

Its possible on Linux that dtrace is too divorced from the real kernel, or I just had something stubbed out.

I also hit a problem with "dtrace -c ..." in Linux. I dont know if this is a pthreads issue or a Linux issue, but Linux doesnt allow ptrace(PTRACE_CONT) to be executed from a child thread, when the child target process is forked() from the main thread. In Linux, the target proc and the controlling thread are like siblings instead of parent-child. (I solved this temporarily by moving fork/exec creation to the monitoring thread, but its still a bit flaky).

I am spending a lot of time statically reviewing the dtrace code to work out where the problem is. I can find lots of code I want to be executed to handle USDT, but, am missing a vital cog to make it hang together...


Posted by Paul Fox | Permalink

Sat Jan 10 09:43:47 GMT 2009

dtrace for freebsd 7.1


FreeBSD 7.1 came out this week to a mild amount of fanfare. Thats a good thing. Its great that people spend a lot of effort on distro's for themselves and their own communities.

I grabbed the distro and the source to see what had changed in dtrace. It looks like "not a lot" from the source snapshots I had earlier in 2008.

Alas, disappointingly, USDT dtrace doesnt work. (I couldnt get dtrace to work at all in FreeBSD from the stock download for x86-64; I guess I need to rebuild the kernel).

Searching the web reveals user land tracing is not complete. This is a shame, because I have been using the FreeBSD model of implementation for Linux. I have had a hard time, because it looks like there are subtle things wrong/broken in FreeBSD/USDT tracing (e.g. the way a process is launched and ptrace() is used to attach to the process is missing some key lines of code).

I have spent the last week poring over the subtleties of what FreeBSD do, along with Sun and Apple. I should be able to get this bit to work, however I am not sure about other aspects of the tracing, such as aborting or skipping over syscalls. (The ptrace() syscall is simply not as powerful as Sun's /procfs interface).

I know most of the ELF code works for symtab lookups, so I should be able to make some new progress. I'll update the blog and put out a new source tarball when I feel happy with what I have.


Posted by Paul Fox | Permalink

Mon Jan 05 18:59:48 GMT 2009

dtrace progress 20090105


As always, things have been slow, but they sped up over the last few days. (I've been ill with flu over Xmas, which didnt help; every thought of dtrace made my head explode!)

First, the /proc/$$/ctl driver sort-of-nearly-almost-but-doesnt work. It hooks into the kernel and can respond to calls, but theres a problem/difficulty: I havent figured out how to simply intercept syscall entry/exit on a per process/thread basis, without lots of kernel hacking or a brute force patch on entry/exit to the syscall handlers. This would be against the ideal of dtrace having a zero-impact approach to monitoring. Maybe its doable long term (I do so love the solaris approach to procfs; ptrace doesnt cut the mustard).

In any case, this may not matter; I have spent more time understanding the libdtrace library about how it handles:

dtrace -c prog
and how it grabs a running process. I took a new look at FreeBSD and noticed it used the only other valid alternative: ptrace, so I am grabbing ideas and code from FreeBSD to see if I can make progress.

Side note: using the Apple code is rather pointless, since it relies on the MACH underlying OS calls to do process manipulation and theres nothing similar in Linux - i.e. an uphill struggle.

The FreeBSD code is nice and simple, except it does rely on the EVENT subsystem in FreeBSD for inter-thread communication (not sure I fully follow it). I have stubbed it out for now - just so I can get something/anything working.

Hopefully when this is done, I can handle the reverse journey for USDT.

Lets hope 2009 is a better dtrace year. It will be a long slog to get dtrace reliable, and the more that people try it or comment on it, the better, but I feel comfortable that key parts of dtrace just work, but I havent addressed quality. (I am slowly trying to clean up compiler warnings, for instance, which many times obscure real silliness on my behalf).


Posted by Paul Fox | Permalink