Archive for January, 2006

The Cardbox email archive

18 January 2006

It’s done, and I’ve extracted 46,000 emails from our archives and imported them into a Cardbox database. It took 8 minutes in all, and every word in every email is indexed for fast retrieval. The Cardbox database is 231MB in size.

I won’t be posting a public spam database yet because of the trouble of going through it to make sure that nothing private has found its way in accidentally, but if anyone wants to see one then let me know: I’ve now set up a dummy email address specifically for spam and nothing else, and once the spams start coming in I’ll look at making a Cardbox database out of them.

What goes into an email archive?

12 January 2006

If we are going to create an email archive using Cardbox, the first thing to decide is what goes in it. Should an archive be an exact copy of everything that originally arrived by email?

(more…)

A quick tool for Unicode input

10 January 2006

Unicode® is one of the great achievements of the human intellect. Combining computer science with the expertise of countless academic specialists, it provides a numeric code for every possible character in every language, living or dead. Here are some examples (if your computer has the necessary fonts):

  • Ħija taġhni ħawħa u qalli: "Ħa, ħi, ħudu u ħawilla fil-ħamrija ħamra taħt il-ħitan ta'Ħararaw"
  • Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։
  • μη μαν ασπουδι γε και ακλειως απολοιμην
    αλλα μεγα ῥεξας τι και εσσομενοισι πυθεσθαι
  • Wędzony łosoś

There is just one snag. There is no general, consistent way of typing any chosen Unicode character into Windows.

Some programs do it one way, some do it another. Some programs don't do it at all (including Windows Explorer, Internet Explorer, and Mozilla Firefox). With those, you have to install a keyboard for the language that your chosen character comes from and work out by trial and error which key means what character. Imagine doing that with Janáček, Lutosławski, Beyoğlu and Mohorovičić; and of course many characters such as ✔ don't occur on any keyboard at all.

We've already had a solution for this in Cardbox, that basically extends the Alt+number convention that has been around since the days of the IBM PC, but now we've made a separate utility program that lets you use this simple convention to type any Unicode character you like into any program that accepts Unicode input. The utility is available for download here. At 100KB including a full installation program it must be one of the smallest Windows utilities around. The core DLL file that actually does all the work is only 7KB in size.


We are giving this program away free:

  1. For the greater glory of Cardbox.
  2. To make life easier for Unicode users everywhere.
  3. To spread the use of the simple "Alt+." convention for Unicode, in the hope that one day all Windows programs will (like Cardbox) support it without needing a special utility.

Download it, use it, enjoy it, and tell your friends.

Cardbox for email archives?

10 January 2006

Jack Schofield in his Guardian blog asks “How do you back up your e-mail?”. He mentions AskSam, which I remember from more than a decade ago, and he gives an interesting reason for that choice: it had import filters for the email system he was using at the time.

We often choose programs not for what the developers think of as the “core functionality” (good indexing, for example) but for something peripheral - often literally peripheral, such as a simple interface to a source of data.

Thus, for example, in the days of the BIX bulletin board, I had a “glue program” (in modern terms, a filter) that took all the messages that I had read and translated them into a format that Cardbox-Plus could read.

Similarly, some medical researchers download abstracts from online services such as Medline and store them, indexed, as Cardbox archives. If you have Cardbox installed (the free client is here) then follow this link to look at a small Abstracts sample database.

So – what about email? We use Eudora and we now have nine years’ worth of emails that are sitting in Eudora mailboxes and accessible only through Eudora’s horrific search interface. Every time I need to search for an email I say to myself “we really must get all this into Cardbox”. But of course the moment when you have at last found the email you are looking for is not the moment to drop everything and start a development project – and the urge passes…

… until now. I am tired of not being able to browse my emails, not being able to search them instantaneously, and above all I am tired of not being able to build up searches step by step – start by finding all emails to or from a certain person, then look for key words in those emails.

I am going to rise to Jack Schofield’s challenge and write an import filter that will extract emails from all those years of Eudora archives and put them into a Cardbox database. When it’s finished, the filter will be available free to all Cardbox users. As the development progresses I’ll put a sample database on our Cardbox Server so that everyone can see the archived emails (although, for reasons of confidentiality, I’ll only put our spam mailbox into that public archive).

Already, when such a filter is still only an idea, I can see other applications: our mail server could activate the filter automatically whenever an email comes in or out, and add it to an online copy of the archive database — thus the entire archive will be always up to date and accessible from anywhere on the Internet.

I’ll post as much as I can of the design process here in the blog, so watch this space.