/var/log/ksymoops modprobe error flood with a Xen kernel

I have a small virtual server in Linode that I use for various public-facing things such as serving web pages, a small Debian repository, email etc. Unfortunately it has a bad habit of filling up /var/log – I did say it was small.

I noticed today a lot of space being used up in `/var/log/ksymoops`, every few seconds and all of the form:

xen:~# tail -f /var/log/ksymoops/20180502.log
20180502 183837 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183837 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended

What’s more, these log files are not subject to logrotate by default, so I have them going back years, for as long as the VM has existed. While this is somewhat concerning, it does allow me to work out what happened.

The ksymoops logfiles are relatively sparse (a few lines every month or so) up until the date last year that I dist-upgraded the server from Debian 8 to 9. Then the dam broke and it has been throwing a batch of four errors like the above every five seconds since, day and night. Something somewhere is calling modprobe in a very tight loop, and it appears to be related to a package upgrade.

I managed to track down the offending process by replacing /sbin/modprobe with a script that called `sleep 100000` and running `ps axf`. It turns out that it had been called directly from one of the kworkers. If all four kworkers (2xHT) are doing the same thing, that would explain the pattern of errors.

Remember that this is in Linode – their VMs run under Xen paravirtualisation with non-modular kernels. Which means that modprobe, lsmod etc. have no effect – they crash out with the following error if you try to use them:

xen:~# lsmod
Module Size Used by Not tainted
lsmod: QM_MODULES: Function not implemented

This is only to be expected, as the VM’s modutils won’t talk to the Xen para-v kernel. But why is a para-v kernel worker triggering modprobe, when failure is guaranteed?

Luckily, I don’t have to care.

ksymoops is non-optional, but it can only write its logfiles if `/var/log/ksymoops` exists. The solution is simple.

rm -rf /var/log/ksymoops

Peace and quiet.

warning: connect to Milter service unix:/var/run/opendkim/opendkim.sock: No such file or directory

I currently run a postfix mailserver and have souped it up to use all the latest security features (see Hamzah Khan’s blog for a good tutorial). One thing that had been bothering me though was the appearance of the above milter connection failures in the logs – even though these seemed to fail gracefully it was a worrying sign that something was Just Not Right.

After a lot of trial and error, it seems that the culprit is my postfix chroot jail. I had originally attempted to compensate for this by defining “Socket /var/spool/postfix/var/run/opendkim/opendkim.sock” in /etc/opendkim.conf, but even so, postfix was throwing errors (and no, putting the socket in the standard location doesn’t work – I tried that!). It turns out that postfix sometimes attempts to connect to the socket from inside the jail, and sometimes from outside. The solution is to create a soft link in the standard location pointing to the real socket inside the jail.

Of course I could have reconfigured it to bind to a localhost port instead, but the soft link was less work.

Reassociating old Time Machine backups

In an attempt to get myself cheap remote backups over the internet, I bought a Raspberry Pi kit and set it up as a hackintosh Time Capsule by attaching my USB backup disk to the Pi. I however wanted to keep my existing backup history, so instead of using a fresh Linux-formatted partition (like a clever boy) I tried to get the Pi to use my existing HFS+ filesystem. Anyone interested in trying this should probably read about Linux’s flaky HFS+ user mapping and lack of journaling support first, and then back away very slowly. I blame this for all my subsequent problems.

After some effort I did get my aging Macbook to write a new backup on the Pi, but I couldn’t get it to see the existing backups on the drive. Apple uses hard links for deduplication of backups, and because remote filesystems can’t be guaranteed to support them it uses a trick. Remote backups are written not directly on the remote drive, but into a sparse disk image inside it. Thinking that it would be a relatively simple matter to move the old backups from the outer filesystem into the sparsebundle, I remounted the USB drive on the Mac (as Linux doesn’t understand sparsebundles, fair enough).

The Macbook first denied me the move, saying that the case sensitivity of the target filesystem was not correct for a backup – strange, because it had just created the sparsebundle itself moments before. Remembering the journaling hack  I performed “repair disk” on both the sparsebundle and then the physical disk itself. At this point disk utility complained that the filesystem was unrecoverable (“invalid key length”) and the physical disk would no longer mount. In an attempt to get better debug information from the repair, I ran fsck_hfs -drfy on the filesystem in a terminal. This didn’t help much with the source of the error, but I did notice that at the end it said “filesystem modified =1”. Running it again produced slightly different output, but again “filesystem modified =1”. It was doing something, so I kept going.

In the meantime, I had been looking into ways of improving the backup transfer speed over the internet. I originally planned to use a tunnel over openvpn, but this would involve channeling all backup traffic through my rented virtual server, which might not be so good for my bank account. I did some research into NAT traversal, and although the technology exists to allow direct connections between two NATed clients (libnice), I would have to write my own application around it and at this point I was getting nervous about having no backups for an extended period. I had also been working from home and getting frustrated with the bulk transfer speed between home and work, and came to the conclusion that my domestic internet connection couldn’t satisfy Time Machine’s aggressive and inflexible hourly backup schedule.

Six iterations of fsck_hfs -drfy later, the disk repair finally succeeded and the backup disk mounted cleanly. At this point, I decided a strategic retreat was in order. I went to set up Time Machine on the old disk, but it insisted that there were no existing backups, saying “last backup: none”. Alt-clicking on the TM icon in the tray and choosing “Browse Other Backup Disks” showed however that the backups were intact. While I could make new backups and browse old ones, they would not deduplicate. As I have a large number of RAW photographs to back up, this was far from ideal. There is a way to get a Mac to recognise another computer’s backups as its own (after upgrading your hardware, for example) . However, it threw “unexpectedly found no machine directories” when attempting the first step. It appeared that not only did it not recognise its own backup, it didn’t recognise it as a backup at all.

After a lot of googling at 2am, it emerged that local Time Machine backups use extended attributes on the backup folders to store information relating to (amongst other things) the identity of the computer that had made the backup. In my earlier orgy of fscking, the extended attributes on my Mac’s top backup folder had been erased. Luckily, I still had the abandoned sparsebundle backup in the trash. Inside a sparsebundle backup, the equivalent metadata is stored not as extended attributes, but in a plist file. In my case, this was in /Volumes/Backups3TB/.Trashes/501/galactica.sparsebundle/com.apple.TimeMachine.MachineID.plist, and contained amongst other bits and bobs the following nuggets:

<key>com.apple.backupd.HostUUID</key>
<string>XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX</string>
<key>com.apple.backupd.ModelID</key>
<string>MacBookPro5,1</string>

These key names were a similar format to the extended attributes on the daily subdirectories in the backup, so I applied them directly to the containing folder:

$ sudo xattr -w com.apple.backupd.HostUUID XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX /Volumes/Backups3TB/Backups.backupdb/galactica
$ sudo xattr -w com.apple.backupd.ModelID MacBookPro5,1 /Volumes/Backups3TB/Backups.backupdb/galactica

After that was fixed, I could inherit the old backups and reassociate each of the backed up volumes to their master copies:

$ sudo tmutil inheritbackup /Volumes/Backups3TB/Backups.backupdb/galactica/
$ sudo tmutil associatedisk -a / /Volumes/Backups3TB/Backups.backupdb/galactica/Latest/Macintosh\ HD/
$ sudo tmutil associatedisk -a /Volumes/WD\ 1 /Volumes/Backups3TB/Backups.backupdb/galactica/Latest/WD\ 1/

The only problem arose when I tried to reassociate the volume containing my photographs. Turns out they had never been backed up at all. They bloody well are now.


 

So what happened to my plan to run offsite backups? I bought a second Time Machine drive and will keep one plugged in at home and one asleep in my drawer in work, swapping them once a week. This is known as the bandwidth of FedEx.

Ubuntu upgrade hell

So I decided to upgrade my work ubuntu laptop. I had been putting it off for ages because of the hell I went through upgrading it from 6.06 to 8.10, but being stuck on old versions of pretty much everything (but especially openoffice) was becoming impossible. Strangely though, it was when I tried to install monkeysphere that I finally snapped. Time to do a dist-upgrade to 10.04 I thought.

I started using the desktop package manager – it had been prompting me with “a new version of Ubuntu is available” for quite some time, so I pushed the button and let it do its thing. After an initial false start (out of disk space) it downloaded the upgrades and went to work. About half an hour in, it decided to restart kdm, which is where the fun started.

Now I’m at a command login and the upgrade is in a bad state. A few apt-get -f installs later I find that there’s a file missing in the latex hyphenation config and one of the postinst scripts keeps failing. The postrm script fails too, so I can’t even remove the offending package. After sleeping on the problem, grep -r comes to my rescue and I find the reference to the missing file. Commenting that out just bumped me to the next missing file, so I rmed the lot. I know I now have a badly broken latex system but dammit, I just want this upgrade to finish.

A few more apt-get -f installs and apt-get dist-upgrades later, and the system is ready to reboot. But grub can’t find the root partition, and it drops me to an initramfs which can’t see any hardware. So I had to fire up a 10.04 install cd that I fortunately had to hand, and use it to chroot into the root partition and rebuild the initrd.

Finally, it boots. But I can’t login graphically because gnome-session can’t be found. Back into the command line and apt-get install ubuntu-desktop, which takes another half an hour because it’s all missing, the lot of it. At this point, I notice something odd – I can use my thinkpad fingerprint reader to log in on the command line, but not graphically – scanning the finger when in X gives nothing, not even an error.

Anyway, my xorg.conf file is apparently no longer valid, as it doesn’t recognise the dual screen setup. I rename it and let it run on auto-detect and the screen comes back, but the EmulateWheel on my trackball now doesn’t work. So I run X :2 -configure to get a skeleton xorg.conf file, save that in /etc/X11/xorg.conf and cut&paste my old mouse settings into it. This doesn’t work.

At this point, having used sudo several times, I accidentally discover how to make the fingerprint sensor work while inside X. When prompted, scan your finger. Wait two seconds, hit Ctrl-C and then enter. Don’t ask me why.

It turns out that in recent versions of xorg, you need to set the option “AllowEmptyInput off” in the ServerFlags section or else it ignores any mouse or keyboard configuration sections. Sure enough, this allows EmulateWheel to work again, but the mouse pointer also moves while the emulated scroll wheel is turning. I’m now running very late and this sorry saga will have to continue in the new year.

Merry Christmas, software developers.

How to upgrade Debian etch with kernel 2.4

The change to udev badly broke the Debian upgrade pathway from etch to lenny and above. If you are still running a 2.4 kernel when you try to upgrade, you can easily be left without any kernel at all. The “official” way to do it is to first add the etch repositories and upgrade to etch’s 2.6 kernel, then reboot into your udev-capable kernel before continuing. Now that the etch repositories have gone, this is quite difficult.

But not impossible! You just can’t do it from within a running system. Instructions follow:

  1. add the lenny (or later) repositories to /etc/apt/sources.list and run `apt-get update`
  2. reboot into a lenny (or later) install CD
  3. choose advanced -> rescue mode
  4. answer the usual install/config questions – don’t worry about networking for now
  5. execute a shell in the target environment *
  6. `ifup eth0` (or whatever you need to get networking running)
  7. `export TERM=vt100` (because bterm is badly broken)
  8. `apt-get install linux-image-2.6.26-2-686` (or whatever kernel is appropriate)
  9. reboot and do `apt-get dist-upgrade`

(*) The installer may not automatically mount your root partition – if so then you won’t be able to execute a shell in the target environment. In that case:

  1. start a shell in the installer environment
  2. mount your root partition somewhere by hand (this may be nontrivial if you’re using LVM!)
  3. cd into it
  4. `mount -t proc proc proc`
  5. `chroot .`

You now have a shell in the target environment.

Fresh linux install not booting? Don’t get mad, get googling.

The main hard drive in my shitebox HP media centre PC died over the weekend, taking with it a significant collection of TV episodes. That didn’t bother me too much, as I rarely rewatch old TV, but it also meant that I couldn’t download any new TV, and that’s just intolerable when BSG, Dollhouse and Terminator are all airing in the States on the same night. My first reaction was to install Linux on a semi-spare USB-connected hard disk. One copy of the Lenny installer on a USB key later, and a base system was up and configured on an unclean partition.*

First problem: the bios won’t boot from USB hard drives, even though it boots from flash drives without a second thought. And no, it won’t boot from FireWire either. So, out with the screwdriver and the disk is transplanted into the main machine where the dead HD had been futilely spinning and radiating heat.

Second problem: the replacement drive is PATA, not SATA, and the mobo only has one PATA port. Not a problem, I wasn’t using the LightScribe DVDR for anything anyway…

Third problem (and this was the killer): GRUB hangs at

GRUB Loading stage 1.5
GRUB loading, please wait...

No amount of reinstalling or reconfiguring seemed to help, and I spent quite some time on google trying to track down possible reasons. I even installed grub on a USB key to see what happened; it got as far as the boot menu, but hung immediately afterwards.

Thinking it may have been something to do with installing on an unclean filesystem, I partitioned out some free space on the same drive and started from scratch. This ended up exactly the same way, but with an added “Error 22”. Much swearing was done, some of it on Twitter, and I went to bed really late and really angry two nights in a row, because I had two fresh yet broken installs of Linux and still no Galactica.

Maybe the disk wasn’t LBA? I didn’t think it was that old, but anything was worth a shot. Firing up parted, I tried to move the big partition out of the first 1024 cylinders to make space for a tiny boot partition. Parted gave me “Unable to satisfy all constraints on the partition”, which I had never seen before. When googling for this error, I stumbled across the following:
Error: Unable to satisfy all constraints on the pa: msg#00059 gnu.parted.bugs

The line about cylinder boundary alignment tickled something in the back of my brain. So, I booted up once more using the trusty usb key, ran fdisk and sure enough the old unclean partition didn’t end on a cylinder boundary. Deleting it and recreating it with the same parameters jiggled it a few kb larger and lo! both copies of Linux were magically fixed.

* Can’t be deleting the old photographs now, can we?

How to manage mailman list membership using LDAP or Active Directory

Run this perl script on your mailman server once an hour using cron. Replace MY_LDAP_SERVER etc. with your own configuration. Also, depending on your LDAP implementation you may need to use group or groupOfNames instead of posixGroup.

For each list you wish to manage, create an LDAP/AD group with the email attribute set to the full address of the mailing list. The script scans all groups under the BASE_DN for any with an email address ending in @MY.LIST.SERVER. It overwrites each list’s membership with that of the corresponding LDAP group (if such a group exists, otherwise it does nothing). Make sure there is only one group for each mailing list! Multiple domain names are not supported, but could be with only a little hacking.

#!/usr/bin/perl -w

use Net::LDAP;

# Connect to LDAP proxy and authenticate
$ldap = Net::LDAP->new('ldaps://MY_LDAP_SERVER') || die "Can't connect to server\n";
$mesg = $ldap->bind(
  'MY_DN',
  password => 'MY_PASSWORD'
) || die "Connected to server, but couldn't bind\n";
                                 
# search for interesting AD groups
$mesg = $ldap->search(                 
  base   => "MY_BASE_DN",
  filter => "(&(objectClass=posixGroup))"
);                     
die "Search returned no interesting security groups\n" unless $mesg;
                       
foreach $group ($mesg->entries) {
  $list_email = $group->get_value("mail");
                     
  # For groups with emails of the form "*@MY.LIST.SERVER"                            
  # Try to chop off the name of our list server. If we fail, it wasn't meant to be.
  if($list_email && $list_email=~s/\@MY\.LIST\.SERVER$//) {
                                        
    # get the membership list   
    @member_list = $group->get_value("uniqueMember");
    die "Security group for list $list_email looks empty - PANIC!\n" unless @member_list;
                
    # make a list of emails to pass to mailman
    $member_emails = "";
    foreach $member_dn (@member_list) {
      $mesg2 = $ldap->search(
        base  => $member_dn,
        filter => "(&(cn=*))",
        scope => "base"
      );
      die "Couldn't locate entry $member_dn - PANIC!\n" unless $mesg2;
      $member = $mesg2->entry(0);
      $member_emails .= $member->get_value("cn") . " get_value("mail") . ">\n";
    };
                
    # now update the mailman list membership
    # be verbose!
    print "\nchanging $list_email\n";
    open( PIPE, "|/var/mailman/bin/sync_members -w=yes -g=yes -a=yes -f - $list_email" )
      || die "Couldn't fork process! $!\n";
    print PIPE $member_emails;
    close PIPE;
  };
};

The history meme

Spreading the meme

serenity:~ andrewg$ history | awk ‘{a[$2]++} END {for(i in a)print a[i] ” ” i}’ | sort -rn | head -10
108 telnet
63 rscreen
43 ping
24 sudo
24 ssh
20 host
16 scp
16 more
14 ls
13 xdvi

Hm. I seem to be using the command line mainly as a gateway into remote systems – which reflects my average working day. The stray ‘xdvi’ is due to my recent heavy use of TextMate  to write a paper in LaTeX. Not sure why I’ve been using sudo so much on my Mac though.

Similarly, on my work Linux laptop:

andgal@nbgal185:~$ history | awk ‘{a[$2]++} END {for(i in a)print a[i] ” ” i}’ | sort -rn | head -10
68 rscreen
52 ping
45 host
36 sudo
35 rdesktop
32 xrandr
21 startmenu
20 ssh
18 ifconfig
15 telnet

rscreen is merely a wrapper for ssh:

function rscreen() { /usr/bin/ssh -t $1 ‘screen -dr || /usr/bin/screen || /bin/bash’; }

and startmenu is a cool but dodgy hack to get into my windows virtual machine:

alias startmenu=’nohup rdesktop -A -s “c:\program files\seamlessrdp\seamlessrdpshell.exe explorer.exe” 192.168.185.128 -u andgal -p xxxxxxxx&

xrandr reflects the fact that I have to configure my dual-screen setup by hand after each boot under Ubuntu 7.10, as the GUI configurator just Doesn’t Work. I had to test this many many times. Apparently the latest Ubuntu beta fixes most of these problems.

Galway Linux installfest, Sat 17th Nov

Galway LUG is organising a Linux installfest on Saturday 17th from 10am-noon in the DERI building, Lower Dangan (map). This is a chance for you to bring along your old laptop/desktop and give it new purpose in life! If you have thought about trying Linux, but haven’t yet summoned up the courage, here is your chance to get some hands-on help. We will have several experienced users on hand to help you select, install and configure your first Linux.

A word of warning: if you have data on your hard drive, please BACK IT UP before bringing your machine. Galway LUG and its volunteers cannot be held responsible for loss of data. It is your responsibility to have current backups.

See you there!

Avahi and dot-local addresses on Ubuntu Gutsy

I’ve noticed a problem with avahi and *.local addresses on ubuntu gutsy
– this will probably have cropped up on other distributions, or will do
soon. It is related to the similar Mac *.local problem.

It is thus: if you have avahi (aka zeroconf) installed, *.local
addresses are resolved via mDNS first. The default config of avahi is to
fail if mDNS is enabled and the host is not found in mDNS. This means
that you cannot resolve addresses under .local which are
in DNS but not mDNS.

To fix, edit /etc/nsswitch.conf and remove the text “[NOTFOUND=return]”
as follows:

hosts:          files mdns4_minimal dns mdns4
#hosts:          files mdns4_minimal [NOTFOUND=return] dns mdns4

You then need to restart the problem software. Avahi still works, but
will fail over to standard DNS if the host cannot be resolved via mDNS.

Alternatively, you can change the default suffix that avahi uses for mDNS, by adding the following to the [server] section of /etc/avahi/avahi-daemon.conf:

domain-name=.alocal

(H/T Josh McIntyre)