Two new tools..well old ones. Revamped Wednesday, 10 August 2011  
"top" is a great tool - very old. And lacking in many ways, 20+years after it first appeared.

"proc" is my version of top. I wrote it many many years ago for Solaris, so it could do what *I* wanted. I've had this tool for a long time. I ported it to Linux, and have been happy with it.

Why is it better? Because it does color. It crams more data into the screen real estate. It uses a better sorting algorithm (there are lots to choose from), and highlights memory deltas.

"proc" is available from my tools area (next door to dtrace).

Why am I bothering to talk about proc?

Because I realised as CPUs get faster, and with 8 cores on an i7, drilling down and understanding whats going on in a system requires more esoteric tools.

So, whats wrong with "proc"? For one thing, its easy to see something on screen which is of interest, but, on the next screen update, it might have disappeared. This can be annoying - as machines get bigger, more processes running, the screen real estate cannot show everything at once. Sometimes you want to go *backwards* and rewind what you just saw.

Well, that requires a bit of re-engineering. But its done. By default you can cycle back upto 20 mins of history. History is stored in /tmp/proc; by the time we factor in the process table, the amount of data is quite staggering (about 1GB of data per hour). This includes the key process attributes, along with extensions for /proc/pid/wchan, /proc/pid/stack etc. (Nearly everything is kept, but not absolutely everything; e.g. signal masks are not stored). And this includes the threads.

We also keep /proc/meminfo, /proc/vmstat and many more.

Theres so much data, that actually visually monitoring it is quite difficult. Just staring at /proc/meminfo has so many fields, one cannot understand/comprehend what is happening from one second to the next.

Even with history, its not comprehendable.

So, the second major update to "proc" is graphics. The ability to see, in graphical format, what is happening to the various key stats is very educational and illuminating.

The implementation of graphs is interesting. Rather than creating an X11 application or KDE or GNOME, I decided to implement this inside the terminal emulator. "fcterm" is my emulator of choice - and fcterm was recently enhanced to support various escape sequences to do line and rectangle drawing. By using simple printf/escape-sequences, anything can be drawn - sufficient for drawing graphs.

[I have a sampler in the fcterm/ctw distribution, available on my site, written in Perl, to show just about every /proc entry as a graph. Its crude, but effective, but quickly shows that dumping all graphs onto a page just overwhelms; that is why proc was enhanced].

I have just uploaded proc-b21, for you to play with (but you must use it inside fcterm; I havent validated what it does in another xterm). It is still a work in progress, so dont bother reporting bugs to me yet.

I will upload a few images in the next blog post.


Posted at 22:06:16 by fox | Permalink
  Virginmedia Tivo Saturday, 06 August 2011  
Sometimes, I dont understand the web. If you read the reviews of Virginmedias Tivo product, they all rave about how good it is.

But it isnt. The interface is buggy and designed by people who havent tried to read the text from across a living room, even on a big screen.

The fact that you cannot archive programs from the device without watching the same program on the main TV is, well, somewhat myopic.

The ethernet and USB interfaces do nothing (as yet). Why? Its 2011. (I think I know why, because the film industry doesnt want people to watch films away from a DRM controlled environment).

The on-demand and catch-up services are badly thought out and confusing.

The remote control is as bad as all other remote controls - it cannot be held in the hand and used in a one-handed mode - not if you are fast forwarding.

The fast forward on tivo is very badly thought out and confusing. Its idea of fast forward is 2x or 3x. One cannot quickly scroll through without a lot of button pressing. (The jump in 10min intervals is broken and confusing).

The suggestions do not understand recording and watching something - it only seems to work based on thumbs up/down.

Tivo does not understand a family who have different tastes and preferences.

The YouTube app is a joke. Watching 240x320 youtube videos in degraded quality on a 40" TV is about as ugly as you can get.

http://virgintivo.blogspot.com/2011/07/virgin-media-ceo-tivo-impact-will-equal.html

The above is typical of the glowing self-satisfying reports on the product. Multi-room streaming? How will that work? If that means I can watch tivo on my PC in another room, then I am salivating.

If it means I can watch one tivo from another room, then think on! Who is going to have two Tivos in a household?

Applications on the tivo are just a real joke. On the ipad, applications are great because it allows a degree of 'context' without having everything coming through the browser.

BTW the virgin Android TV guide app is poor. Very poor. It is welcomed - it is better than nothing, but it is one of those 'why am I wasting storage space on my device' apps. (You can control your tivo device from this app, to do remote recordings; but theres a serious glitch in the way it works - you cannot record a program if it is on in less than 35mins from now. Why?)

The tivo is 'not bad' but its certainly not a step up from the prior Virgin+ device.


Posted at 22:10:55 by fox | Permalink
  cpu visualisation Friday, 22 July 2011  
Its quite interesting to contemplate different ways of looking at things.

I have an Intel i7 machine - its fast (its a laptop, so it could be faster if I had a desktop CPU).

Linux provides a lot of raw data, but one thing that "top" lacks is more detailed info. There are display widgets for KDE and GNOME which help you visualise cpu load, but this display shows something interesting:

last pid: 4792 in: 4448 load avg: 1.28 0.71 0.43                      23:21:45
CPU: 8(HT)  @ 2.00GHz, proc:231, thr:464, zombies: 1, stopped: 5, running: 3 [t
dixxy:  7.3% usr, 0.1% nice, 1.5% sys, 84.6% idle, 6.4% iow, 0.1% sirq
RAM:7918M RSS:0K Free:303M Cached:1913M Dirty: 664K Swap:225M Free:7878M
cpu
Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
          usr   nice    sys   idle    iow    irq   sirq  steal  guest  gnice
CPU0     8.4%   0.0%   2.6%  73.6%  15.2%   0.0%   0.8%   0.0%   0.0%   0.0%
CPU1    63.6%   0.0%   1.8%  21.0%  14.2%   0.0%   0.0%   0.0%   0.0%   0.0%
CPU2     0.2%   0.0%   1.0%  99.2%   0.0%   0.0%   0.2%   0.0%   0.0%   0.0%
CPU3     2.4%   0.0%   1.0%  97.0%   0.4%   0.0%   0.0%   0.0%   0.0%   0.0%
CPU4     0.0%   0.2%   0.2% 101.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
CPU5     0.0%   0.0%   0.2% 100.2%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
CPU6     0.2%   0.0%   0.8%  99.4%   0.2%   0.0%   0.0%   0.0%   0.0%   0.0%
CPU7     0.0%   0.2%   0.6%  98.6%   1.2%   0.0%   0.0%   0.0%   0.0%   0.0%

MHz Cache Bogomips CPU0 2001.000 6144 KB 3990.88 CPU1 1400.000 6144 KB 3990.92 CPU2 800.000 6144 KB 3990.97 CPU3 800.000 6144 KB 3990.93 CPU4 800.000 6144 KB 3990.96 CPU5 800.000 6144 KB 3990.96 CPU6 800.000 6144 KB 3990.94 CPU7 800.000 6144 KB 3990.98

The info is taken from /proc/cpuinfo (this is the "proc" utility - available at my website; run it and type 'cpu' at the command line to see this display).

Note that CPU0 is running at 2GHz - to be expected, although slightly strange. Its strange because this represents the cpu that the proc command is instantaneously running on. It doesnt use much cpu, but the cpu has adjusted the clock to give it speed. (Note that, as an i7, this CPU should be able to ramp up to 2.9GHz but I havent seen evidence in /proc/cpuinfo this occurs).

Note also that cpus 2-7 are idle (800MHz is the lowest speed without actually sleeping).

CPU1 is running at 1.4GHz - I have a backup job running in another window. The question is - *what is cpu1?* I presume its the hyperthreaded cpu, and therefore should run slower than cpu0. Ideally, jobs should run on: cpu0, cpu2, cpu4, cpu6, cpu1, cpu3, cpu5, cpu7, in that order.

The question in my mind - what is hyperthreading -- is it an attribute of the cpu, which is fixed, or does it meander from one cpu to another. If the hyperthreaded sibling is solely virtual, then one can deduce that for this system, we should get unequal performance as the 5th cpu is made to do work.

I just did a test (seeing how many "counts" we can do per second), and ran 5 of them in parallel. Certainly, one of them was not as busy as the other 4. [This was not a good test, since the counter-loop doesnt exercise cache-misses and hyperthread ability, but solely relies on the Linux scheduler to run the processes].

Definitely requires more investigation to understand the effects.


Posted at 23:21:37 by fox | Permalink
  warning! warning! warning! In the beginning was more. Then there was less. Tuesday, 19 July 2011  
In the very old days of computing, you could sit in front of a screen or a teletype and watch the output, a character at a time. 110 baud or 300 baud was eminently readable.

As output devices progress to 9600 baud serial lines, one could fill a screen in a second (80x24). And "cat" or "make" on its own was not good enough to read the text frantically scrolling off the screen.

Zoom forward a few years, and with todays multi-GHz cpus and fast screens, one can 'cat' a 10MB file in a few seconds to the screen.

Did you see the error on line 12,723,104 ? No? Didnt think so.

Tools like "more" and "less" are great for paging slowly through a file and allow searching and backwards motion.

Or, one can use an editor, such as vim/emacs/CRiSP.

These are great.

When building software, e.g. with gcc/g++, and as projects have gotten bigger, it can be difficult to spot an error in the middle of a huge amount of benign output. Worse, gcc has a tendency to overdo the warnings. Scrolling in an xterm to review the output is frustrating, trying to spot the magic "error" in the midst of warnings (or other output).

There are many solutions (such as viewing the output in "more" or "less", and relying on highlighting to find the item you are after). "less" can do highlight, but "more" cannot. CRiSP can do highlighting too.

fcterm (my own personal terminal emulator) can do this too, but you have to tell it what to search for. (I must modify it to have a default set of words - having a single search pattern is not good enough).

I wrote a simple tool called "warn". You use it like this:

$ warn make
...

and all error output lines are shown in red, with warnings in yellow. (My default console is green on black).

Very useful for spotting the wood for the trees.

I havent released it as a standalone tool (it has bare minimum requirements - its plain C code). If people are interested, I will put it out.

Next up is to fix fcterm...


Posted at 22:13:48 by fox | Permalink
  What does '1' mean? Sunday, 17 July 2011  
In the context of load average on a system, a load avg of 1 is something meaningful, if you are on a single cpu system. It represents the cpu is busy, continuously.

Now consider multicore/multicpu machines. A load avg of 1 is not quite so meaningful. On Linux, the load average represents a moving average of processes which are blocking. It slows ramps up and ramps down.

Doing heavy duty work (like parallel compilation) means that "gmake -j" doesnt have enough information to determine if the system is busy.

In the old days, when a source file compilation could take many seconds or minutes, the load average told us what the system was doing.

On an 8-core (Intel i7) cpu, doing 'gmake -j' can invoke tens of parallel compilations, yet, 'top' can show the system as being idle, because the load average takes a while to ramp up.

On an 8-core system, with one cpu being busy, should we say 'the system is busy' (system usage == 100%), or should we say it is idle (system usage == 12.5%)?

The answer depends on what you are measuring and how you want to handle it. If 1 out of 8 cpus is busy (maybe the application is broken and stuck, and eating cpu continuously), then that is important. The system may be busy, but noticing that rogue application is useful. Ignoring it until all 8 cores are busy may never happen.

An additional complexity is that on a totally idle system, a single CPU can ramp up the clock speed; but if that cpu is not doing useful work, then the second cpu may not be able to ramp up as high, and get worse performance.

In the end, what is useful is to notice one or more processes 'behaving badly', e.g. consuming too much cpu, or too many failed syscalls, or too much I/O.

Today top (or my application, 'proc') does not readily show that, but that needs to change.


Posted at 12:59:11 by fox | Permalink
  dtrace gripe Wednesday, 13 July 2011  
I really dislike some aspects of dtrace. Its a great tool, but the "lets pretend we are C" when it isnt is a nuisance. Macro languages should be designed to be expressive, but dtraces' D language is annoying.

Firstly, the lack of if-then-else is a problem. It leads to convoluted use of ?: (which cannot handle multiple statements). I really dont understand why if-then-else isnt there. It doesnt harm the "Thou shalt not have loops" which can lock up a kernel.

Whats annoying is that the C programming language, and D, copying it, does it to an extent that is .. well, annoying !

Consider this: I want a probe which can exit after 5s of execution time. Heres the naive implementation:

BEGIN {
	t = timestamp;
	}
tick-1ms {
	timestamp - t > 5*1000*1000*1000 ? exit(0) : 1;
}

This isnt possible, because exit(0) is a void function.

BEGIN {
	t = timestamp;
	}
tick-1ms {
	timestamp - t > 5*1000*1000*1000 ? (int) exit(0) : 1;
}

But, oh-no! You cannot cast a "void" to an "int". In C, I can understand that (almost) but it leads to painful workarounds. In D, there is even less reason: if a (void) could be cast to "(int) 0", then the above would work. Its still ugly, but functional.

The actual solution is:

BEGIN {
	t = timestamp;
	}
tick-1ms / timestamp - t > 5*1000*1000*1000 / {
	exit(0);
}

Which is fine - although I havent determined if the predicate is worse or more expensive than the actual code. What is annoying is that the predicate is a "different part of the language". What if I wanted to do this:

tick-1ms {
	do-some-stuff;
	if (var > somevalue) { printf("hello"); exit(0);}
	do-some-more-stuff;
	if (var > someOthervalue) {printf("world"); }
	...
}

This can be translated into predicate format, but this can involve ugliness in performing the transformation, especially if the do-stuff lines of code are complex in themselves.

Its time to start addressing these deficiencies in Dtrace (at the risk of being non-standard extensions to the true code).


Posted at 22:30:11 by fox | Permalink
  CRiSP website updated Tuesday, 12 July 2011  
After many years of staring at abject ugliness, http://www.crisp.demon.co.uk has been given a lick of paint, more in tune with the blog site in terms of look and feel and stylesheet.

I have updated some of the very dated things, and hope to update it more so.

Obligatory plug: you can now purchase CRiSP (via paypal) if you so choose.


Posted at 21:47:42 by fox | Permalink
  dtrace update Monday, 20 June 2011  
I've updated dtrace to slightly improve the xcall code. Having tried on an AS5 kernel - hit some other issues.

Some build issues are fixed (2.6.18 kernels confuse the syscall extraction code); it mostly works - but some warnings are present. Additionally, a 'dtrace -n syscall:::' will crash the kernel. I suspect some mismatch on the ptregs syscalls and/or 32b syscalls on this kernel. Need to debug.

Also found that on 16-core machine, the xcall code leads to a lot of noise when things arent the way it expects. This eventually led to an assertion failure in dtrace.c (on a buffer switch - which is in agreement that the dtrace_sync() didnt hit the expected cpus, i.e. some race condition/bug), and eventually a failure from the kernel that a vm_free was invalid.

Oh dear.

To date I have been testing on dual-core cpus. I need to get an i7 so I can ramp up to 8 cores and do more heavy torture tests.

So, keep an eye out for updates (which are likely to be slow in coming in next week or two), whilst I hopefully try to refine the xcall issue.


Posted at 23:05:58 by fox | Permalink
  NMI support added Sunday, 19 June 2011  
I modified the cross-cpu call code to allow use of the NMI interrupt when the IPI interrupt is not responding. Hopefully this will avoid the xcall code from busting a lock due to a deadlock/timeout.

It looks like the APIC allows specific interrupts to be marked as NMI - which would be great since rather than sharing the NMI with other users of the interrupt, we could just make the IPI interrupt work like an NMI and avoid the deadlock scenario.

For now, the interrupt handler tries to be careful and not trigger when its uncalled for. It does present a problem if we need the NMI and someone else does at the same time, but I can investigate what/how the APIC works a little better (or check the Solaris code to see if indeed, that is what it does).

I also need to update the dtrace_linux.c code so that I dont just grab interrupt vector 0xea (random interrupt which appears not to be used, but it could be). I am a naughty programmer.

Release 20110619 contains the above fixes.


Posted at 22:14:33 by fox | Permalink
  The Final Phase of dtrace Sunday, 19 June 2011  
I have been writing about the issues of inter cpu cross function calls (xcall) - a key part of dtrace. This feature isnt used very often, but its an important part of SMP to ensure consistency in accessing buffers.

After a lot of effort, writing, rewriting and rewriting yet again, the code is (nearly) finished. It looks good - it handles arbitrary cpus calling into other cpus and allows for a xcall to be interrupted by another call to xcall (effectively a mesh of NCPU * NCPU callers).

However, I have found a flaw. If I modify the dtrace_sync() function to sync 100-200 times instead of just once, then occasionally there are delays and kernel printk()s from the code - where spinlocks are taking too long.

Turns out, we could deadlock if we try to invoke an IPI on another CPU which has interrupts disabled. Not totally sure how Solaris handles this - I get a little lost in the maze of mutex_enter() and splx() code.

There is a solution to take us to the next level - NMI - the NMI interrupt is not maskable (unless an NMI is in progress). NMIs are typically used by Linux for a watchdog facility - make sure CPUs arent locking up, as well as "danger signals" (like ECC/parity memory errors).

I will experiment to see if I can run via an NMI rather than a normal interrupt and that should help reduce the problems of lock-busting significantly.

At the moment dtrace is pretty good - my ultra-torture tests really are horrible, and most people wont do that in real life.

So, as always, tread carefully until *you* feel happy this is not going to panic your production system.


Posted at 12:50:48 by fox | Permalink
  Update to prior post Thursday, 16 June 2011  
My findings in the prior post are not strictly the end of the story. Subsequent to this, I found that I needed to resort to the cmpxchg instructions to beef up the resilience of dtrace_xcalls. Current results look good.

More testing to follow and I need to fix AS4 (Linux 2.6.9) kernel compilation issues.


Posted at 23:22:41 by fox | Permalink
  DTrace xcall issue -- fixed? Website of the day. Thursday, 16 June 2011  

http://forum.osdev.org/viewtopic.php?f=1&t=21768&start=0

Now, this has been driving me nuts for months. Why was my spanking new cross-cpu code hanging occasionally? I had spent ages building up the courage to write it, and was fairly proud of it. But it just wasnt relilable enough and I disabled it in recent releases of dtrace.

Heres the problem: a cross-cpu synchronisation call is needed in dtrace. Not often, but in key components. I feel like the way this was done in dtrace was almost laziness, because there are other ways to achieve this (I believe). But the single cross call (in dtrace_sync()) is a problem...

Interestingly, I was surprised it was called so often. Its called during the tear-down of /usr/bin/dtrace as the process exits. I had wondered why dtrace intercepts ^C and doesnt die immediately. It does something very curious - it intercepts ^C and asks the driver nicely to tear down the probes we may have set up. Of course, you can kill -9 the process, and it works. *But*. *But*. If you do that the probes arent torn down! Instead, they are left running. After about 20-30s, since nothing in user land empties the buffers, the kernel auto garbage collects, but it means on a kill -9 scenario, whatever you were tracing may continue to take effect.

I dont like the way ^C works in dtrace and I may attempt to fix it (eg fork a child to tear down the probes; tear down is done by a STOP ioctl(), btw).

Ok - so cross calls happen a lot especially during tear down (and also during timer/tick interrupt handling).

So .. what happens? Well, on a two cpu system, the cpu invoking cross call deadlocks against the other cpu waiting for the remote procedure call to be acknowledged.

With the original Linux smp_call_function() there were lots of issues in calling it with interrupts disabled (ie from the timer tick interrupt). This is not allowed - two cpus calling each other at the same time will deadlock.

The cross-call code has to run with interrupts enabled and that means being very careful with reentrancy and mutual invocation.

One day I put some debug into the code to try and spot mutual or nested invocations and I got a hit. On a real machine. But never on my VMs.

I modified the code to allow a break-out - after too long waiting, the code gives up and allows the machine to stay in tact. Without this, the machine would lock up (deadlock with interrupts disabled).

I fixed the code to handle mutual invocation and recursion.

But I could not figure out what the locked-up CPU was doing. I tried to get stack dumps from the locked CPU - but these would only happen after dtrace had given up waiting. Its as if the other CPU was asleep and wouldnt wake up until the primary CPU had given up looking (a definite Heisenbug!).

The web link at the top of this page illustrates the exact same setup I was seeing. So, I followed the page (it tells that acknowledging an end of interrupt to the APIC too prematurely may not work on a VM).

Not only had I spent a huge amount of time to understand, fix and engineer a solution but I almost had a working solution without realising it. I had moved the APIC_EOI code to the end of the interrupt routine previously, but because of the lack of support for mutual invocation, it hadnt worked. So I put it back again.

So I think this is looking good - much better than before. I need to do more torture testing and cleanup before I release.

On the way, I tried or started trying with lots of things (like using a crash dump to analyse this problem .. which wasnt successful). Or, using NMI interrupts instead of normal interrupts. I've learnt a lot and been frustrated by a lot too along the way.

Keep an eye on twitter .. I'll report a status update if I think I am not close enough.


Posted at 20:02:21 by fox | Permalink
  My blogs Sunday, 12 June 2011  
Just reading Nigel Smiths Blog (http://nwsmith.blogspot.com/), which has a nice back-reference to this blog and dtrace.

People may find my blogs a bit confusing. I thought it worth detailing "why".

Originally I set up a series of blog posts, using my own Perl blog code, which was in turn, based on the nanoblogger code. (http://nanoblogger.sourceforge.net/).

The website I publish to (www.crisp.demon.co.uk) is interesting in itself. Demon was the first ISP in the UK (back in early 1990s) to offer access to the Internet. Alas, they have never done anything useful since then, and I pay subscriptions for a near-useless service (teeny amount of web space, no perl or cli or anything else). Because space is so tight, I tend to leave most things, including CRiSP and Dtrace downloads on my internet facing machine at home. The only thing that Demon usefully serves me is the email address, although I do try to get people to switch to my (numerous) gmail accounts.

I was using the Dyndns service for a DNS entry but due to some sillyness on my behalf, I lost the name entry, which put dtrace off the map for many people. I reinstated a new address (via crisp.dyndns-server.com).

I should just pay for a normal DNS entry but I havent decided what I want.

The crisp.demon.co.uk is costly, much more costly than a decent hosted web applicance, so I do need to do something.

At home, I have two main dev machines - and when I blog post, I try to update both the original Demon hosted site, and also blogger. It turns out to be easier to update blogspot first, and the Demon at a later date when I "get around to it". ("Get around to it" means powering on my main PC, running a script, and shutting it down again). Things got confused because I have two dev machines and have to be careful how I sync to and from each other.

So, thats the feeble excuse for me appearing and disappearing in the waves.


Posted at 17:12:51 by fox | Permalink
  dtrace progress Saturday, 11 June 2011  
Been continuing work to increase resilience of dtrace. One thing I found was that there are some syscalls which have differing calling sequence compared to the others (fork, clone, sigreturn, execve and a few others).

Bear in mind when we think of a kernel - there are multiple views of the kernel:

  - 64b kernel running 64b apps
  - 32b kernel running 32b apps
  - 64b kernel running 32b apps

The apps get to the kernel via system calls. System calls are implemented in a variety of ways - depending on the kernel version and the CPU. (Some older cpus, such as i386, i486 dont support instructions like SYSCALL, SYSENTER).

So dtrace traps the system calls by patching the system call table. The code is mostly the same but subtley different for a 32b and 64b kernel.

But when a 32b app is running on a 64b kernel - the app doesnt know any different, but the kernel does. The kernel has two system call tables: the system call, for eg. "open" is a different index on the two OS's. The two OS's developed differently. i386 kernels have had to maintain backwards compatibility, but the amd64 kernel did not and started afresh at the point these cpus became available.

Dtrace handles that.

Except it didnt handle the special syscalls: when a 32b app invokes fork(), clone(), etc, we usually ended up panicing the kernel.

Most Linux distros are "pure": a 64b distro has 64b apps, so you rarely see the effect of a 32b app.

Linux/dtrace has a nice interface for system calls. The probe name, e.g.

$ dtrace -n syscall:::

matches all system calls. But the 32b and 64b calls are different probes. So, you can intercept all 32b syscalls on a 64b system:

$ dtrace -n syscall:x32::

which is useful in many ways.

I have nearly fixed these special syscalls on the 64b kernel - just have clone() to fix. The symptom of not fixing is a cascade of kernel OOPs and panics (because the kernel stack layout is not what it should be).

I hope to release later today a fix for this problem.


Posted at 09:33:06 by fox | Permalink
  dtrace -- some updates Sunday, 05 June 2011  
After spending a lot of effort on the xcall issue, I had hit an issue where occasionally, system calls would fail. The regression test shows this up by running a perl script which continuously opens an existing and a non-existing file, plus a variety of other things.

Very occasionally, Perl would emit a warning relating to a file handle being referred to which belong to a file which couldnt be opened. (/etc/hosts - which always exists).

Similarly, other apps would occasionally fail to start with rtld linker errors.

This proved very hard to track down: I was pretty certain it was related to the xcall work I was doing. The error rates were rare - less than 1 in a million, and almost impossible to track down.

I moved away from xcall debugging and found that by having two simple perl scripts (on a dual core machine), which continuously opened files and nothing else, that the error rate would increase whilst the two scripts ran.

To try and get a better handle on this, I moved from 64-bit kernel debugging to 32-bit kernel, where the error rate was significantly higher.

After a lot of experimentation, it transpired that the error wasnt to do with xcall, but the syscall provider. Specifically, a piece of assembler glue turned out to be rubbish. I am not sure why it appeared to work, but it didnt. (I had made some changes earlier on which may have broken the syscall tracing on 32-bit kernels).

After recoding the assembler glue - things looked much better. The errors in syscall processing appeared to be gone. But a new problem surfaced - one I wasnt too surprised to see. There are a handful of 32-bit syscalls which use a differing calling convention to the others. (The 64-bit code handles this, but not the 32-bit code).

I have nearly finished redoing the 32-bit syscall tracing, and, once done, will need to validate the 64-bit syscall tracing.

If I am lucky, hopefully in the next few days or weeks, the resiliency issues will disappear and I can put out a new release.

The syscall tracing code is horribly ugly - because we have to support different calling conventions across the two types of cpu architecture. I may split the code up into an x86 and x86_64 code file.


Posted at 21:31:11 by fox | Permalink
  Bad websites Thursday, 02 June 2011  
Bad websites. Whats with them?

I have a beef with a variety of websites - nice websites, let down by the "We dont care attitude" or "We didnt test it".

http://www.tvguide.co.uk

I despair of this web site. Its a great guide to TV channels for UK people. Nice layout. Lots of content.

So, whats wrong with it?

A number of things. One - the menu bar at the top of the screen is over engineered. If you try to do something, like select one of the sub-menu items, the ability to navigate and not lose context is near impossible. Try and select something, e.g. "New series". I leave you to find which submenu thats under (a minor annoyance).

Secondly, the huge amount of real estate given over to pointless banners. These arent advertising banners, but program banners. On a small screen you have no information content on the first screen at all. On a large screen you barely get 50% of your screen with the TV grid.

The search function is badly over engineered using javascript.

And if you turn off some ad sites via an ad-blocker, the whole page becomes non-functional.

And lastly, the page quite often forgets who you are and your channel selections.

I gave up with this site, and wrote my own TV highlighting application.

BBC RadioTimes

The BBC provides XML files containing 14 days of TV schedules. This is a great source of data (which I use in my TV planning application).

But the reviews are *awful*. No, make that, *truly awful*. When I see a film or a series of potential interest, the paragraph of review is of this form:

This film, made by XXX YYY, is a follow on to his earlier work ZZZ, AAA, BBB.
The director did blah, and the actors did bloop. The film won an award at
Cannes, and went straight to video.

Can you tell whats wrong with the above? Its totally devoid of any information about what the film or program is *about*. The reviews/write-ups on tvguide.co.uk at least tell you what the program is about.

Heres a real quote from the BBC:

One of two low-budget westerns made by Barbara Stanwyck - the other was 
1956's The Maverick Queen - before she found her glorious late-career 
stride with such titles as Forty Guns and TV's The Big Valley. 
Aided by thoughtful direction from the prolific and talented Allan Dwan, 
this movie now has great curiosity value, in that the leading man is 
former US president Ronald Reagan, a bland and colourless performer 
when pitted against screen villains Gene Evans and Jack Elam. 
The location scenery is very attractive, the action sequences 
well staged, and Stanwyck as tough as ever: it's a shame the 
script didn't give her or any of the cast more opportunities. 
Still, this will pass the time nicely, and teenage girls might 
discover a useful role model.

Gizmodo.com

My first site of the day is http://www.dailymail.co.uk. Yes, I know - thats a poor choice of a news website, but consider it bubblegum for the brain first thing in the morning. My second is Engadget. A very nice and highly fluid website with news stories of interest to me.

And this, *was* Gizmodo. But I have removed the link from my web browser.

On my ipad, I have a cached page dating back to April - I cannot get it to update. I dont know what they did. On my other devices, I dont have the caching problem. Gizmodo used to track Engadget in style and content. But recently, they have overhauled it. And they have not done any user testing as far as I can tell.

First, I would be redirected to the mobile site, even though I dont want that. Now, they have reformatted the website, and its totally devoid of content on the front page.

It used to work and be a great site, but I waste my monthly bandwidth quota vising Gizmodo and hoping for something useful to browse.

So, goodbye to Gizmodo. Maybe, when others start linking to it again and it contains useful content (even if its a rehash of other sites), I will revisit.

Slashdot

This site has been great for years. Until now. The pool of people powered news stories they have is great. But slashdot have been playing games with their presentation and - as my 4th choice of read of the day - is close to being binned as well.

For starters, the three column format is annoying. Very annoying. When browsing on a mobile/small screen device, the left hand column requires you to scroll the screen to view the text. I never look at the left hand column - because I know it never changes. So why waste prime real estate with that, *there*.

Next. Slashdot has tried to create slow and large home page loads. I applaud that. But they have done that by limiting the number of visible stories to about 6. Given that they seem to dribble items out at about once per hour, that means its pointless visiting the site repeatedly during the day. And if you leave it too long, you lose continuity of stories you have read/not-read. (You have to scroll to the bottom of the screen, click on "More", wait, wait, and then you see the stories you saw a few hours ago).

Slashdot seems to have "lost it" - it used to be an interesting place to read non-news stories, about technology, but they have taken the Gizmodo approach - reduce the amount of useful info on the page to the point where visiting it has taken on a boring attitude.

BBC

BBC - what a poor website. It used to be awful. Now it is pointless. Another home page devoid of content. Its full of flash cleverness where you can edit the layout, but I dont want to do that. I want to see news. The news page is devoid of information - almost like it is a commodity which is in short supply.

(Compare the BBC news defaults with the Dailymail website - theres enough information in each paragraph on Dailymail to decide if you want to read further. On BBC, you have to guess if the news item says anything useful).

Next, try reading BBC on a mobile device. The customisations do not work (at least, not on my android device). The site is untested in real life. I rarely look at BBC - every few years when I look, I think the same thing. A waste.

There *is* good content on the BBC site - if you spend the time to find the programme schedule and radio information. But using the BBC website is like having an unfaithful lover: things move around so much you are never sure if the site will be the same when you visit it. It would not be so bad if it got better when the changes happen. But it gets worse.

The real-estate vs information content is so low, that it reminds me of the days of a Teletype (ASR-33 with a paper punch drive).

Can I do better?

I dont for one moment think I can do better than these sites. I have learnt lots of interesting things (both in terms of content and in terms of presentation). But the dilution of news sites which all feed off each other, has made the internet quite boring.

Which is a shame.


Posted at 21:28:57 by fox | Permalink