Using Calibre to catalogue a physical speculative fiction book collection

The problem

I have a large collection of paper books, which I am slowly entering into Calibre as empty book records, so that I can have a more easily accessible record of what I do and do not own.

I initially thought that I could make the data capture trivial by scanning the ISBN barcodes of the books with a phone app. Then I would enter a list of ISBNs into Calibre, push a button to fetch metadata for them automatically, and everything would Just Work, like magic.

I was wrong for various reasons:

  • A lot of my books predate the existence of ISBNs.

  • I want my virtual book records to match my physical books as closely as possible -- but an ISBN is not an unambiguous identifier of a particular edition of a book. Multiple editions with completely different covers can share the same ISBN.

  • Calibre wasn't really designed for this kind of pedantic cataloguing of physical objects, so it doesn't really care about these distinctions. It also helpfully smushes metadata records together if their ISBNs match, and there is no way to make it stop.

  • The default public metadata sources that Calibre uses don't even have metadata for the vast majority of old editions of SFF books.

The solution

A really good source of metadata for old SFF books is ISFDB, the Internet Speculative Fiction Database. And there is a plugin for Calibre which scrapes metadata from it! Unfortunately it hasn't been updated in several years, and ISFDB's HTML periodically changes. So I am maintaining a fork, which I tweak whenever I enter a new batch of books and discover that something has broken. I have also started a reimplementation, in which I hope to include all the little bits of data entry glue that I am about to describe.

Edit: I am now focusing on the reimplementation -- the fork is pretty much abandoned.

My current workflow

  1. I take a pile of books and look them up on ISFDB in my browser (Firefox). I usually process one author at a time, since it's the most efficient way to find multiple book titles at once. I search each title page for the specific edition which most closely matches the physical copy I have. This record is uniquely identified with an ISFDB ID which appears in the URL and on the publication page.

  2. At this point I would previously laboriously copy the ISFDB IDs from all the open ISFDB pages by hand into a text file, and then run a script to create entries in Calibre with these identifiers. The manual copying became very annoying very quickly, so I hacked together a Python script which automatically extracts these identifiers from the currently open tabs in a running Firefox session. Now I can pipe the output of this script to the script which creates records.

  3. At this point I have some empty records with only the ISFDB ID set (and also some custom columns which are not related to the metadata). Now to avoid the record-smushing issue I disable all metadata sources except ISFDB (important!) and fetch metadata for all the records. If an ISFDB ID is present, my fork of the plugin will ignore all other data (like author and title) and use only the ID in its search, so assuming that I have found the correct records in step one the download is guaranteed to fetch the correct data.

  4. Now I do some manual cleanup, like fetching or correcting cover images which were missing from ISFDB.

The Horrible Firefox Hack

Edit: The latest versions of Firefox store the session in an lz4-compressed json file rather than an uncompressed json file, which necessitates the update below (thanks, StackOverflow!). You will need to install the lz4 library.

Edit: Recent versions of the lz4 library require you to import lz4.block explicitly.

isfdb_ids_from_firefox.py:

#!/usr/bin/env python3
firefox_session_path="/home/confluence/.mozilla/firefox/mhxsxkg0.default/sessionstore-backups/recovery.jsonlz4"
import lz4.block
import json
import re
f = open(firefox_session_path, "rb")
magic = f.read(8)
session = json.loads(lz4.block.decompress(f.read()).decode("utf-8"))
f.close()
tabs = []
for w in session["windows"]:
    tabs.extend(w["tabs"])
urls = [t["entries"][-1]["url"] for t in tabs]
for u in urls:
    #print(u)
    m = re.search("www\.isfdb\.org/cgi-bin/pl\.cgi\?(\d+)", u)
    if m:
        print(m.group(1))

The Record Creation Script

calibre-add-from-isfdb.sh (marvel at my consistent naming conventions):

#!/bin/bash
while read id
#for id in `cat $@`
do
  calibredb add -e -I isfdb:$id
  added_id=`calibredb search -l 1 identifiers:isfdb:$id`
  if [ -n "$added_id" ]
  then
    calibredb set_custom "shelf" $added_id "SFF"
    calibredb set_custom "read" $added_id "1"
  fi
#done
done < "${1:-/dev/stdin}"

In addition to creating a new record with the ISFDB ID filled into the identifier field with the appropriate prefix, I also set two custom columns to mark the book as read and file it in the correct category, which I have called a shelf. You can edit this to set whatever custom values you want, or remove it entirely.

Assuming that both scripts are executable and in your path, you can put them together like this:

isfdb_ids_from_firefox.py | calibre-add-from-isfdb.sh

Future work

In the reimplementation I hope to incorporate an entry field for ISFDB IDs and the magical Firefox session scraping directly into the plugin, so that the whole process is more streamlined, not operating system-dependent, and more usable by other people. In the meantime, if you use some flavour of Unix, you may be able to use my very messy current setup with minor modifications.

If you have comments, questions or rotten tomatoes, contact me on Twitter or file an issue against the reimplementation on GitHub (even though it currently doesn't exist).