Force the allocation of an elasticsearch index on restore

Today, I had occasion to delete an elasticsearch index and restore it from a snapshot. Normally this is a straightforward process, but this particular cluster has been in a yellow state for a while because all of its nodes are up against the low-disk watermark and many of its replica shards are thus sitting unallocated. We’re working on a fix for that, but in the meantime I needed to restore a small index urgently.

I only discovered this would be a problem after I tried and failed to restore the snapshot. I figured that a small index should be able to slip between the cracks, especially if I had deleted something of a similar size immediately beforehand. But the shard allocator had other ideas, and all the restored shards (both primary and replica) went straight into an unallocated state and sat there unblinking.

I found some old indexes that I could safely delete to free up a little space, but the shard allocator started work on the long-pending replica shards rather than the (surely more important!) primaries of my fresh restore. Even after setting the index priority to 1000 on the restored index, it still preferred to allocate old replicas.

I ended up forcing the allocation by hand, after combining the techniques here, here and here. The trick is to get a list of the shard numbers for the offending index and call the command “allocate_empty_primary” on each, which forces them into an allocated (but empty) state. Once they are allocated, we can then retry the restore from snapshot.

Defining BAD_INDEX and TARGET_NODE appropriately, we incant:

curl -q -s "http://localhost:9200/_cat/shards" | egrep "$BAD_INDEX" | \
  while read index shard type state; do
    if [ $type = "p" ]; then
      curl -X POST "http://localhost:9200/_cluster/reroute" -d "{commands\" : [ { \"allocate_empty_primary\": { \"index\": \"$index\", \"shard\": $shard, \"node\": \"$TARGET_NODE\", \"accept_data_loss\": true } } ] }"
    fi
  done

 

This produced an ungodly amount of output, as the shard allocator proceeded to restructure its entire work queue. But the offending index had indeed been allocated with a higher priority than the old replicas, and a repeat attempt at restoring from snapshot worked.

/var/log/ksymoops modprobe error flood with a Xen kernel

I have a small virtual server in Linode that I use for various public-facing things such as serving web pages, a small Debian repository, email etc. Unfortunately it has a bad habit of filling up /var/log – I did say it was small.

I noticed today a lot of space being used up in `/var/log/ksymoops`, every few seconds and all of the form:

xen:~# tail -f /var/log/ksymoops/20180502.log
20180502 183837 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183837 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183842 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183842 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended
20180502 183847 start /sbin/modprobe -q -- net-pf-0 safemode=1
20180502 183847 probe ended

What’s more, these log files are not subject to logrotate by default, so I have them going back years, for as long as the VM has existed. While this is somewhat concerning, it does allow me to work out what happened.

The ksymoops logfiles are relatively sparse (a few lines every month or so) up until the date last year that I dist-upgraded the server from Debian 8 to 9. Then the dam broke and it has been throwing a batch of four errors like the above every five seconds since, day and night. Something somewhere is calling modprobe in a very tight loop, and it appears to be related to a package upgrade.

I managed to track down the offending process by replacing /sbin/modprobe with a script that called `sleep 100000` and running `ps axf`. It turns out that it had been called directly from one of the kworkers. If all four kworkers (2xHT) are doing the same thing, that would explain the pattern of errors.

Remember that this is in Linode – their VMs run under Xen paravirtualisation with non-modular kernels. Which means that modprobe, lsmod etc. have no effect – they crash out with the following error if you try to use them:

xen:~# lsmod
Module Size Used by Not tainted
lsmod: QM_MODULES: Function not implemented

This is only to be expected, as the VM’s modutils won’t talk to the Xen para-v kernel. But why is a para-v kernel worker triggering modprobe, when failure is guaranteed?

Luckily, I don’t have to care.

ksymoops is non-optional, but it can only write its logfiles if `/var/log/ksymoops` exists. The solution is simple.

rm -rf /var/log/ksymoops

Peace and quiet.

A universal layout for grid-symmetrical keyboards

In a previous post, I mentioned the Keyboard.io Model 01. One thing that slightly worried me about it was the keyboard layout of the prototype – it was yet another symmetric-grid arrangement similar to, but distinct from, each of the other Kinesis/Maltron layouts already out there. We are promised of course a fully programmable model, which means I can fix it up just the way I like, but it would be really nice if modern keyboards would pick a standard and stick to it – not least that maybe the keyboard legends would be useful. 😉

So it was with great interest that I saw they are requesting feedback before finalizing the key layout. This post is my contribution.

Keymaps, arrangements and layouts

In the following I use “keymap” for an OS-level mapping of “physical” scancodes to “logical” code points, and “arrangement” for the physical location of the keys that generate particular scancodes. When taken together, these form a “layout”, which is normally represented by the labels on the physical keys. One can usually change the keymap in the OS (e.g. via a menu in the taskbar) so that the layout no longer matches the key labels. On some keyboards (such as the Kinesis Advantage and keyboard.io), one can effectively change the physical arrangement in firmware so that the same OS keymap produces a different layout.

Alternative keyboard arrangements

Most keyboards manufactured today follow the pc104 (USA) or pc105 (everywhere else) paradigm, which derives from the original Scholes typewriter, and was set out in the 1980s with the IBM XT. Many keyboards since then have either added extra keys (e.g. multimedia controls) or left out a few (e.g. laptops) but the basic arrangement remains the same. This is notably asymmetrical, with the keys located along staggered diagonals and the right hand (for touch typists) having significantly more keys to cover than the left.

By contrast, in the Maltron keyboard and its various grid-symmetrical derivatives, this basic arrangement has been changed so that the left and right hands are equal and the staggered diagonals have been made vertical to match the natural movement of the fingers. This has required a number of changes to the pc105 layout to rebalance the key arrangement. In most such grid-symmetrical keyboards, a selection of keys (usually modifiers) have been moved under the thumbs and the total number of keys under the fingers reduced to (typically) two 6-by-4 grids, sometimes with extra keys under the “z” row and/or between the hands. There are exactly 48 keys on a pc105 keyboard that produce printable, non-whitespace characters, so they would fit perfectly into 2x6x4 if all the whitespace and modifier keys could be relocated elsewhere, e.g. under the thumbs. However the shift keys in particular have often been kept under the little fingers in grid keyboards for user familiarity, requiring further compromises elsewhere. Unfortunately these are almost always made with only us_ascii in mind.

When keymaps go bad

The default Kinesis Advantage arrangement (for example) is tolerable under a pc105 us_ascii keymap, but is nasty under alternative keymaps, as it differs too much from the traditional 105-key arrangement and so breaks too many assumptions that most keymaps are designed under (e.g. that “[” is to the immediate right of “p”). It leaves the shift, tab and caps-lock keys inside the core 2x6x4 grid and moves four printables to a row below “z” along with the cursor keys. It also rearranges some of the remaining symbol keys to slightly non-standard positions. This is where the trouble starts.

Kinesis Advantage default arrangement + us_ascii keymap:

   = 1 2 3 4 5    6 7 8 9 0 -
 tab q w e r t    y u i o p \
 cap a s d f g    h j k l ; '
 lsh z x c v b    n m , . / rsh
     ` § <-->      ^--v [ ]

(NB the key I’ve labelled “§” is the “international key” that is normally located to the left of “z” outside the USA and produces a variety of symbols depending on your exact keymap)

It’s only if you speak English that these innocent rearrangements are mere “symbol keys”. In most other language keymaps, one or more of these scancodes maps to an accented letter. And if the key you were expecting to be at the right of “p” disappears to somewhere below “.” your touch typing is screwed.

Consider for example what happens if we ask the OS to apply the us_dvorak keymap instead of us_ascii:

   ] 1 2 3 4 5    6 7 8 9 0 [
 tab ' , . p y    f g c r l \
 cap a o e u i    d h t n s -
 lsh ; q j k x    b m w v z rsh
     ` § <-->      ^--v / =

A Dvorak touch typist (particularly a programmer!) expects “/” to be the key to the right of “l”. Kinesis try to get around this by recommending their users not to use their OS-supplied Dvorak keymap, but instead use a hotkey to change the keyboard’s firmware arrangement to another custom one that they provide. This however has just as many oddities as their QWERTY arrangement:

   = 1 2 3 4 5    6 7 8 9 0 -
 tab ' , . p y    f g c r l /
 cap a o e u i    d h t n s \
 lsh ; q j k x    b m w v z rsh
     ` § <-->      ^--v [ ]

Sorry, but the key to the right of “s” in Dvorak should be “-“. For touch-typing, this is even worse than losing “/”!

Dvorak is bad enough, but in other language keymaps the scancodes just to the right of “0” and “p” are even more vital. In a Scandinavian keymap, the key to the right of “p” should be “Å”. In Italian, this should be “è”, and in German it should be “Ü”. And now look at how many accented letters the Hungarian layout requires.

Most other symmetric-grid arrangements make similar errors. Here’s Maltron with us_ascii:

   1 2 3 4 5 6    7 8 9 0 [ ]
   ` q w e r t    y u i o p \
   § a s d f g    h j k l ; '
     z x c v b    n m , . / 
           -        =

Again, the key to the right of “p” has gone walkies, as have the keys to the right of “0”, which have disappeared into the extra row. On the bright side, the keys to the left of “q” and “a” are used relatively sensibly.

One commonality between all these imperfect arrangements seems to be a desire to keep the square brackets “[]” together on the keyboard. But for touch-typists, particularly those who speak a language other than English, it is much more important that “[” is to the right of “p” and above ” ‘ “. But all is not lost!

A universal keyboard arrangement

The following 2x6x4 arrangement minimizes the pain across a wide selection of standard European language keymaps:

    1 2 3 4 5 6   7 8 9 0 - = 
    ` q w e r t   y u i o p [
    \ a s d f g   h j k l ; '
    § z x c v b   n m , . / ]

This only relocates three keys when compared with pc105 — “`”, “]”and “\”. The first is only moved by one position and the second by two, but the third unfortunately has to move the whole way to the opposite side of the keyboard — we just don’t have enough spare keys under the right hand to do otherwise. Considering though that this is the key most inaccessible to touch-typists on a pc105 keyboard, its new position could be considered an improvement (and for UK keymap users, its placement beside the “international key” is entirely sensible!). Other than these three changes, no surprises lie in store for touch typists — and the only ones likely to find a letter (rather than a symbol) at the wrong side of the keyboard entirely are the Hungarians.

Sorry.

Note that the positions of “-” and “=” have been preserved by left-shifting the number row (as per Maltron). This is not as far-fetched as it seems — old-school touch typists were taught to hit “5” and “6” with their left index finger etc., because the number row on pc105 is shifted almost a whole key width to the left of the home row due to column stagger. The placement of the numbers is also particularly suited to keyboards that have the function keys in an embedded layer, as F1 can be trivially mapped onto 1, F2 onto 2 etc. without running out of keys for F11 and F12.

Also note that “[]” are not totally disassociated — they remain close together and symmetrically arranged around the home row little finger position. The slight inconvenience for us_ascii users here should be weighed against the vast increase in usability for non-English keymap users.

(BTW it is relatively easy to apply this arrangement to existing programmable keyboards such as the Kinesis Advantage, although in that particular case it is easier to keep the number row in the position that matches the labels and live with “=” being in an odd position).

A plea to keyboard.io

You don’t have to use the exact arrangement I suggest above, but if you don’t then please take into account that not everyone uses us_ascii, and being able to change keymaps in the OS and touch type is a desirable feature for many people, particularly those who work in more than one language.

And maybe if we do find a key arrangement that works acceptably for everyone, it could become a de facto standard for grid-symmetrical keyboards — and everyone will ask for a “keyboard.io layout” in the future…!

Indistinguishability Obfuscation, or how I learned to stop worrying and love DRM

There’s a lot of hype running around about this:

https://www.quantamagazine.org/20150902-indistinguishability-obfuscation-cryptographys-black-box/
https://www.quantamagazine.org/20140130-perfecting-the-art-of-sensible-nonsense/

Lots of excitable talk of “perfect security” and other stuff. One of the possible applications is supposedly quantum-resistant public-key crypto. But if you read into it, it’s actually a way of making code resistant to decompilation. So instead of creating a secret number that’s processed by a well-known algorithm, you create a secret algorithm that you can ask other people to run without fear of it being reverse engineered. So the “public-key” crypto is really shared-secret crypto with the secret sealed inside an obfuscated algorithm.

In other words, it’s bulletproof DRM. deCSS is even referenced (obliquely) in one of the articles as a use case.

Of course, this makes it in principle impossible to test code for malicious behaviour. You could insert a latent trojan into it and never be discovered, and it removes one of the most important security features of security software – auditability of the algorithm. For example, someone could write a rot13 algorithm and call it “encryption” and the only way to (dis)prove it would be to run a statistical analysis on the ciphertext.

So the question becomes – why would anyone allow IO programs to run on their systems? Virus scanners would be useless in principle. Performance, even in the most optimistic case, would be dreadful. And it doesn’t do anything for the end user that can’t be achieved by traditional crypto (barring the development of a quantum factoriser, and even that is not yet certain). No, the only people who gain are the ones who want to prevent the next deCSS.

warning: connect to Milter service unix:/var/run/opendkim/opendkim.sock: No such file or directory

I currently run a postfix mailserver and have souped it up to use all the latest security features (see Hamzah Khan’s blog for a good tutorial). One thing that had been bothering me though was the appearance of the above milter connection failures in the logs – even though these seemed to fail gracefully it was a worrying sign that something was Just Not Right.

After a lot of trial and error, it seems that the culprit is my postfix chroot jail. I had originally attempted to compensate for this by defining “Socket /var/spool/postfix/var/run/opendkim/opendkim.sock” in /etc/opendkim.conf, but even so, postfix was throwing errors (and no, putting the socket in the standard location doesn’t work – I tried that!). It turns out that postfix sometimes attempts to connect to the socket from inside the jail, and sometimes from outside. The solution is to create a soft link in the standard location pointing to the real socket inside the jail.

Of course I could have reconfigured it to bind to a localhost port instead, but the soft link was less work.

The keyboard.io Model 01 vs the Kinesis Advantage


Several years ago I took the plunge and bought a Kinesis Advantage (after even more years of lusting) and it’s still my favourite keyboard (so far) and the one that I have used exclusively in work ever since (every new work colleague at some point says “how do you type on THAT…?” in a tone of voice somewhere between suspicion and awe). Even so, I have found over time that it does have its issues. The tiny escape key is probably the most annoying, as well as the general rubbery crapness of the entire function key row. I find it very easy to accidentally trigger the embedded keypad, which is not as immediately noticeable as you might think. The number keys are also surprisingly difficult to type on without changing hand position.

I also found that I needed to do a LOT of remapping to get keys in a fully ergonomic position (the shift keys are by default under the pinkies rather than the thumbs, for example). And if I remove it from power for too long it forgets and I need to do the whole dance again (I have the steps stuck to the underneath in case I forget). It would be nice if this was scriptable, or unnecessary.

I did once disassemble an IBM model M with the intent of hotwiring it into a more sensible arrangement, but was put off by the difficulty of performing surgery on the underlying membrane circuitry. I think the bits are still in a box under the sofa on my mum’s landing…


The keyboard.io Model 01 is almost exactly the keyboard I envisioned at the time but didn’t have the patience to see through. It’s good to see someone finally build what I was dreaming of all those years ago (and after seeing how much work went into it, I’m sort of relieved it wasn’t me!). They are currently making tons of money on kickstarter, so it looks like full steam ahead. They plan to have the first shipments next year, and I plan to have one in my home office soon thereafter (the Microsoft Natural currently in there just can’t compete with the Kinesis).

In other cool keyboard news, it looks like waytools.com are nearly ready to start shipping their dinky TextBlade Bluetooth keyboard. I may have one of those on preorder too…

Openvpn “WARNING: Bad encapsulated packet length from peer”

xkcd979

I run a VPN from my Linode VM for various reasons, the most important of which is so that I and other family members can submit email over SMTP without having to worry about braindead networks that block outgoing port 587 for makey-uppey “security” reasons. Since my brother and I both have jobs that entail connecting to random corporate wireless networks, this is critical.

The problem was that I was running openvpn over the standard port 1194, which is also blocked by many networks – including my own employer’s. Openvpn uses a mock-HTTP protocol that will work over HTTP proxies, so I configured squid on the server’s port 8080 to forward packets to localhost:1194 and told the laptop openvpn client to use myserver:8080 as a proxy.

This worked well for my employer’s network, but did not agree with the guest wireless network of one of my clients, which had absolutely no problem with port 1194, but uses its own transparent proxy that doesn’t play nice with daisychained proxies. I kept having to comment and uncomment the proxy directive in my laptop’s openvpn.conf and restart, depending on location.

So I decided to do it the proper way, by connecting directly to openvpn on port 8080. My employer’s network would allow this through directly, and the client’s network should route through its transparent proxy without complaint. I don’t want to turn off port 1194 though, as this would rudely nobble all my brother’s devices, so I configured the server’s iptables to masquerade 8080->1194. I could then remove the proxy config from the laptop, change its connecting port to 8080 and restart the vpn client.

Problem solved! Except then I started getting the following error in my server logs:

Apr 28 13:02:43 xxx ovpn-server[13110]: xxx.xxx.xxx.xxx:57458 WARNING: Bad encapsulated packet length from peer (17231), which must be > 0 and <= 1560 -- please ensure that --tun-mtu or --link-mtu is equal on both peers -- this condition could also indicate a possible active attack on the TCP link -- [Attempting restart...]

It turned out this was being generated by another client which had also been configured to use the proxy, but which had slipped my mind. The error stems from the client connecting to an openvpn port directly but sending requests formatted for a web proxy. Not sure why it shows up as an MTU error, but changing the other client config to match the laptop solved it.

Web of Trust vs Certificate Authorities – a hybrid approach

The only thing in engineering worse than a single point of failure is multiple single points of failure. A SPOF is a component that the entire system depends upon, and which can thus bring down the entire system single-handedly. MSPOFs are a collection of components any one of which can bring down the entire system single-handedly. The X509 certificate architecture is such a system.

The job of a public key encryption system is to transform the task of securely sharing secret encryption keys into one of reliably verifying publicly-available certificates. Any form of reliable verification must necessarily involve out of band confirmation of the expected data – if an attacker has compromised our communications channel then any confirmation could be faked just as easily as the original communication. Out of band confirmation requires an attacker to simultaneously compromise multiple independent communications channels – this is the rationale behind sending confirmation codes to mobile phones, for example.

The competing verification models

Public key encryption systems typically rely on one of two out of band methods – an Authority or a Web of Trust. An Authority is a person or organisation that is assumed to be both well-known and trustworthy. If a chain of signatures can be established from an Authority to the certificate in question, then the trustworthiness of that certificate is assumed. By contrast, a Web of Trust requires a chain of signatures to be established from each user to the certificate in question – each user acting as his own Authority.

The out of band confirmation of an Authority relies on the shrinkwrap software model – assuming that your software has been delivered via a trustworthy method, then any Authority certificates that were included with your software are by implication verified. A Web of Trust typically relies on personal knowledge – users are assumed to only have signed the certificates of those people they either know personally or have otherwise verified the identity of offline. In this case, “out of band” means “face to face”.

In addition, some PKI systems allow for multiple signature chains to be defined – this is typical in Web of Trust models where the intermediate certificates are usually controlled by ordinary users whose reliability may be questionable. Multiple chains mitigate this risk by providing extra confirmation pathways which collectively provide greater assurance than any single path.

In a Web of Trust model, each user is responsible for cultivating and maintaining his own outgoing signature chains. This means continually assessing the reliability of the downstream certificates, which even with the help of software tools requires a nontrivial amount of work. The reward for this effort is that the reliability of a well-maintained Web of Trust can be made arbitrarily high as more independent chains are established. An Authority model also requires a maintenance overhead, but Authorities are typically large organisations with well-paid staff, and the certificate chains are much shorter. Authorities also tend to use automated verification methods (such as emails) which can be used by an attacker to escalate from one form of compromise to another.

An Authority is thus more easily attacked than a well-maintained Web of Trust, but less easily attacked than a badly-maintained one, and more convenient for the end user.

Why X509 is broken, and how to fix it

X509 has several points of design weakness:

  1. CA distribution is done via shrinkwrap software – but since almost all software is distributed over the internet this is easily subverted.
  2. Only one certificate chain may be established for any certificate – multiple incoming signatures are not supported, so every chain has a single point of failure.
  3. All CAs have the authority to sign any certificate they choose – therefore if one CA certificate is compromised the attacker can impersonate any site on the internet. Thus every certificate verification has multiple single points of failure.

The first two flaws can be addressed by incorporating features from the Web of Trust model:

  1. Web of Trust for Authorities – browser vendors and CAs should agree to sign each other’s signing certificates using an established WoT, thus mitigating against the distribution of rogue certs. Signing certificates not verifiable via the WoT should be ignored by client software, and technical end users can verify the integrity of the entire CA set using established methods.
  2. Multiple signature chains – each public site on the internet should be expected to provide multiple independently signed copies of the same certificate. A site providing only one signature should be treated as insecure by client software.

The third flaw can be addressed by limiting the validity of individual CA certificates. Each signing certificate should contain a field restricting its signatures to a subset of the internet using a b-tree method. A given CA certificate would state that it may only be used to verify leaf certificates whose names, when hashed using SHA256, match a particular N least-significant-bits, where N increases as we travel down the signature chain. Client software should invalidate any signatures made outside the stated b-tree bucket, and deprecate any signing certificate whose bucket is larger than an agreed standard.

(It has been suggested that CAs be geographically restricted, however it is unclear how this could be enforced for non-geographic domains such as .com, and may conflict with a solution to problem 2.)

With these improvements, the Authority model would organically gain many of the strengths of the Web of Trust model, but without imposing a significant burden upon end users.

Generating a random password

A tool that I wrote last year to generate random passwords, and have since found unbelievably useful. Save it in a shell script and use at will. It takes one optional parameter, which is the password length (it defaults to 12 chars), and produces a typable password without problematic characters (such as quotes) that some badly-configured websites choke on.

#!/bin/bash
if [[ -n $1 ]]; then 
  len=$1 
else 
  len=12 
fi 
< /dev/urandom tr -dc \
  _\!\@\#\$\%\^\&\*\(\)\<\>,.:\;+\-=\[\]\\/\?\|\~A-Za-z0-9 \
  | head -c$len 
echo

Reassociating old Time Machine backups

In an attempt to get myself cheap remote backups over the internet, I bought a Raspberry Pi kit and set it up as a hackintosh Time Capsule by attaching my USB backup disk to the Pi. I however wanted to keep my existing backup history, so instead of using a fresh Linux-formatted partition (like a clever boy) I tried to get the Pi to use my existing HFS+ filesystem. Anyone interested in trying this should probably read about Linux’s flaky HFS+ user mapping and lack of journaling support first, and then back away very slowly. I blame this for all my subsequent problems.

After some effort I did get my aging Macbook to write a new backup on the Pi, but I couldn’t get it to see the existing backups on the drive. Apple uses hard links for deduplication of backups, and because remote filesystems can’t be guaranteed to support them it uses a trick. Remote backups are written not directly on the remote drive, but into a sparse disk image inside it. Thinking that it would be a relatively simple matter to move the old backups from the outer filesystem into the sparsebundle, I remounted the USB drive on the Mac (as Linux doesn’t understand sparsebundles, fair enough).

The Macbook first denied me the move, saying that the case sensitivity of the target filesystem was not correct for a backup – strange, because it had just created the sparsebundle itself moments before. Remembering the journaling hack  I performed “repair disk” on both the sparsebundle and then the physical disk itself. At this point disk utility complained that the filesystem was unrecoverable (“invalid key length”) and the physical disk would no longer mount. In an attempt to get better debug information from the repair, I ran fsck_hfs -drfy on the filesystem in a terminal. This didn’t help much with the source of the error, but I did notice that at the end it said “filesystem modified =1”. Running it again produced slightly different output, but again “filesystem modified =1”. It was doing something, so I kept going.

In the meantime, I had been looking into ways of improving the backup transfer speed over the internet. I originally planned to use a tunnel over openvpn, but this would involve channeling all backup traffic through my rented virtual server, which might not be so good for my bank account. I did some research into NAT traversal, and although the technology exists to allow direct connections between two NATed clients (libnice), I would have to write my own application around it and at this point I was getting nervous about having no backups for an extended period. I had also been working from home and getting frustrated with the bulk transfer speed between home and work, and came to the conclusion that my domestic internet connection couldn’t satisfy Time Machine’s aggressive and inflexible hourly backup schedule.

Six iterations of fsck_hfs -drfy later, the disk repair finally succeeded and the backup disk mounted cleanly. At this point, I decided a strategic retreat was in order. I went to set up Time Machine on the old disk, but it insisted that there were no existing backups, saying “last backup: none”. Alt-clicking on the TM icon in the tray and choosing “Browse Other Backup Disks” showed however that the backups were intact. While I could make new backups and browse old ones, they would not deduplicate. As I have a large number of RAW photographs to back up, this was far from ideal. There is a way to get a Mac to recognise another computer’s backups as its own (after upgrading your hardware, for example) . However, it threw “unexpectedly found no machine directories” when attempting the first step. It appeared that not only did it not recognise its own backup, it didn’t recognise it as a backup at all.

After a lot of googling at 2am, it emerged that local Time Machine backups use extended attributes on the backup folders to store information relating to (amongst other things) the identity of the computer that had made the backup. In my earlier orgy of fscking, the extended attributes on my Mac’s top backup folder had been erased. Luckily, I still had the abandoned sparsebundle backup in the trash. Inside a sparsebundle backup, the equivalent metadata is stored not as extended attributes, but in a plist file. In my case, this was in /Volumes/Backups3TB/.Trashes/501/galactica.sparsebundle/com.apple.TimeMachine.MachineID.plist, and contained amongst other bits and bobs the following nuggets:

<key>com.apple.backupd.HostUUID</key>
<string>XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX</string>
<key>com.apple.backupd.ModelID</key>
<string>MacBookPro5,1</string>

These key names were a similar format to the extended attributes on the daily subdirectories in the backup, so I applied them directly to the containing folder:

$ sudo xattr -w com.apple.backupd.HostUUID XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX /Volumes/Backups3TB/Backups.backupdb/galactica
$ sudo xattr -w com.apple.backupd.ModelID MacBookPro5,1 /Volumes/Backups3TB/Backups.backupdb/galactica

After that was fixed, I could inherit the old backups and reassociate each of the backed up volumes to their master copies:

$ sudo tmutil inheritbackup /Volumes/Backups3TB/Backups.backupdb/galactica/
$ sudo tmutil associatedisk -a / /Volumes/Backups3TB/Backups.backupdb/galactica/Latest/Macintosh\ HD/
$ sudo tmutil associatedisk -a /Volumes/WD\ 1 /Volumes/Backups3TB/Backups.backupdb/galactica/Latest/WD\ 1/

The only problem arose when I tried to reassociate the volume containing my photographs. Turns out they had never been backed up at all. They bloody well are now.


 

So what happened to my plan to run offsite backups? I bought a second Time Machine drive and will keep one plugged in at home and one asleep in my drawer in work, swapping them once a week. This is known as the bandwidth of FedEx.