Disk-based Caching for a Large Uniqueness Problem

Earlier this year I was tasked with created a relational database containing all of the PubMed article data. Their web site has extensive searching capabilities but they allow downloading their entire data set1 for custom uses. The XML is fully redundant though – each author is listed once for each article, as is each instance of Journal information, chemicals, publication type, etc. So the task involved a lot more than just parsing XML and inserting it into a database.

XMLBeans

A junior programmer on the team made an initial attempt at loading the XML involving XMLBeans. This approach had severe memory issues since the DOM structure of the file was maintained in memory along with many supporting objects. XMLBeans is a very powerful tool for working with XML in an OO fashion, but it quickly choked as each zipped file typically contains 50,000 citations in around 135MB of XML. I realized that a SAX parser would work fine since treating the data as objects was an unnecessary luxury – we’d need to read all the data but visiting each node once would be fine.

The biggest problem though was the redundancy:

Entity Name Unique Count
Citation 14,792,864
Author 5,852,735
Chemical 148,082
Grant 561,486
Journal 16,310
Journal Info 28,667
Journal Issue 1,199,586
Keyword 75,436
Mesh Term 22,286
Mesh Term Qualifier Name 83
Publication Type 52

Most of the entities are very manageable, but at 5.8 million, Author was a problem. Note that although there are 14.7 million Citations, these are the top-level objects – they contain/reference all the rest – and are singletons.

Hibernate

We were planning on using Hibernate for application data access so my first pass at inserting the data into the database used our Hibernate DAOs. For each object (e.g. Author) referenced in a Citation, I’d execute a findOrCreate() method that would find the previously inserted row or creates one if necessary. After running for about two full days it became apparent that it would take approximately two months on a fast multiprocessor box to finish, but that should have been obvious from the beginning. Hibernate would be fine for the runtime application but not the best bulk loading solution.

Caching

I played with caching as Hibernate has excellent support for it. It comes bundled with a few options and is pluggable, so that seemed like the best option. The problem with caching is unique keys. Take Author as an example. Generating a hash code is insufficient since two objects with the same hash code aren’t necessarily equal. That’s not a problem in a HashMap but it would be for our purposes since a 2nd author with the same cache key would be merged in with the previous one.2

My first naive approach to unique keys was to concatenate the objects’ fields, so for example an author “James A. Gosling” my key would be “james_a_gosling”. It was rather of silly in retrospect I guess since the keys ended up being about the same size as the data with this approach, and even with disk-based caches I still had the same crippling memory issues as before. I was running on both Windows and 32-bit Linux, so the maximum memory I could allocate to the JVM is around 2GB, and I was relatively quickly getting OOMEs. I could allocate a lot more on a 64-bit multiprocessor Linux box that was available, but if we had time I wanted to make sure this would run on any team member’s PC.

RandomAccessFile as a disk cache

I’ll spare you the gory details of the few intermediate approaches that I tried and describe the eventual solution. The implementation that I settled on could certainly use more optimizing but it’s a process that will run rarely (once initially plus once a year for updates. It runs in a few hours and takes only 1-1.5GB of heap memory, so any decent developer PC should be up for the task.

What finally worked for me was an implementation something like that in java.util.HashMap. HashMap handles collisions with a private implementation of a linked list. All members with the same hash code end up in the same list, and equals() is called on each member in the list to find the actual keyed object. This is efficient as long as hash codes are well-distributed and the lists stay short. In the pathological case where all members share identical hash codes, everything would be stored in the same bucket and rather than constant time lookups you’d have linear time lookups.

I created a helper class DiskCache. DiskCache uses a java.ui.RandomAccessFile to store serialized entities. I maintain a Map keyed by hash code (Integer) with long[] values representing the file offsets of all cached entities with that hash code.

To store an entity, I get the array of locations for its hash code from the map, creating a new array if necessary. I move to the end of the file, write out the serialized form, and store the file offset in the location array.

To test for an entity’s existence, I get the array of locations for its hash code from the map, deserialize each stored instance from the file, and call equals() for each. This is the expensive part of the process – there is a decent amount of disk thrashing here. But if the hash codes are well-defined there won’t be too many instances per hash code and the number of deserializations will be manageable.

This works pretty well. The cache files get pretty big, but disk space is cheap. The memory usage is primarily the arrays of file offsets, so processing 14.7 million files in ~50GB of XML results in a maximum memory usage of a little over 1GB, and the entire process (4 stages – one to cache the Citations’ dependent objects, then generating insert statements for these cached instances, then re-reading the XML and building the Citation inserts plus the one-to-many and many-to-many inserts for dependent objects, and finally bulk insertion into the database) takes less than 8 hours on a reasonably fast machine.

One interesting side benefit of this approach is that PubMed updates will be straightforward. The bulk loading code could be amended (I didn’t implement this – I’ve since left the project) to read the entire database and populate cache files as if from the raw XML. Then we’d only need to read in the new XML files, caching new entities, and inserting at the end of the process only new Citations and their dependent entities.

The code:

package com.foo.bar.loader;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import com.foo.bar.model.BaseEntity;

public class DiskCache<T extends BaseEntity>
       implements Iterable<T> {

  /*package*/ final Map<Integer, long&#91;&#93;> _hashcodeToLocations =
       new HashMap<Integer, long&#91;&#93;>();

  private static final String NULL_STRING = "___NULL___";

  private final RandomAccessFile _raf;
  private final File _file;
  private final StringConverter<T> _stringConverter;

  private int _size;

  /**
   * Constructor.
   * @param filename  path to disk cache file
   * @param stringConverter  converter to serialize an entity to string form
   */
  public DiskCache(final String filename, final StringConverter<T> stringConverter) {

    _file = new File(filename);
    if (_file.exists() && !_file.delete()) {
      throw new CacheException("can't delete " + filename);
    }
    try {
      _raf = new RandomAccessFile(_file, "rw");
    }
    catch (FileNotFoundException e) {
      throw new CacheException(e);
    }
    _stringConverter = stringConverter;
  }

  /**
   * Add the instance to the cache if an equivalent instance doesn't already exist.
   * @param t  the entity to cache
   * @return  <code>true</code> if it was stored
   */
  public boolean cache(final T t) {
    if (!exists(t)) {
      store(t);
      return true;
    }

    return false;
  }

  /**
   * Set the primary key in the 2nd pass once all the instances have been
   * cached and inserted into the database.
   * @param t  the entity
   * @param pk  the primary key
   */
  public void setPk(final T t, final Integer pk) {
    try {
      for (long location : _hashcodeToLocations.get(t.hashCode())) {
        T stored = load(location);
        if (stored.equals(t)) {
          _raf.seek(location);
          _raf.writeInt(pk);
          return;
        }
      }
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  /**
   * Test for existence of an entity.
   * @param t  the entity
   * @return  <code>true</code> if it's cached on the disk
   */
  /*package*/ boolean exists(final T t) {
    return get(t) != null;
  }

  private void store(final T t) {
    try {
      _raf.seek(_raf.length());
      appendLocation(t);
      _raf.writeInt(-1); // placeholder for pk
      for (String string : _stringConverter.toStringArray(t)) {
        _raf.writeUTF(toNullSafe(string));
      }
      _size++;
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  private String toNullSafe(final String string) {
    return string == null ? NULL_STRING : string;
  }

  private String fromNullSafe(final String string) {
    return NULL_STRING.equals(string) ? null : string;
  }

  private void appendLocation(final T t) {
    long location;
    try {
      location = _raf.length();
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
    long[] locations = _hashcodeToLocations.get(t.hashCode());
    if (locations == null) {
      locations = new long[] { location };
    }
    else {
      long[] newLocations = new long[locations.length + 1];
      System.arraycopy(locations, 0, newLocations, 0, locations.length);
      newLocations[newLocations.length - 1] = location;
      locations = newLocations;
    }
    _hashcodeToLocations.put(t.hashCode(), locations);
  }

  private T load(final long location) {
    try {
      _raf.seek(location);
      int pk = _raf.readInt();
      String[] strings = new String[_stringConverter.getArraySize()];
      for (int i = 0; i < strings.length; i++) {
        strings&#91;i&#93; = fromNullSafe(_raf.readUTF());
      }
      return _stringConverter.fromStringArray(pk == -1 ? null : pk, strings);
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  /**
   * Serializes and deserializes an entity to/from an array of strings.
   *
   * @author Burt
   * @param <T>  the entity type
   */
 public static interface StringConverter<T extends BaseEntity> {
    int getArraySize();
    String[] toStringArray(T a);
    T fromStringArray(Integer pk, String[] strings);
  }

  /**
   * Close and delete the cache file.
   */
  public void shutdown() {
    try {
      _raf.close();
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
    _file.delete();
  }

  /* (non-Javadoc)
   * @see java.lang.Iterable#iterator()
   */
  public Iterator<T> iterator() {
    return new Iterator<T>() {

      long[] locations = findAllLocations();
      int index = 0;

      public boolean hasNext() {
        return index < locations.length;
      }

      public T next() {
        return load(locations&#91;index++&#93;);
      }

      public void remove() {
        throw new UnsupportedOperationException();
      }
    };
  }

  private long&#91;&#93; findAllLocations() {
    int locationCount = 0;
    for (long&#91;&#93; locations : _hashcodeToLocations.values()) {
      locationCount += locations.length;
    }

    int index = 0;
    long&#91;&#93; allLocations = new long&#91;locationCount&#93;;
    for (long&#91;&#93; locations : _hashcodeToLocations.values()) {
      if (locations.length == 1) {
        allLocations&#91;index++&#93; = locations&#91;0&#93;;
      }
      else {
        System.arraycopy(locations, 0, allLocations, index, locations.length);
        index += locations.length;
      }
    }

    Arrays.sort(allLocations);

    return allLocations;
  }

  /**
   * Get the number of stored entities.
   * @return  the number
   */
  public int size() {
    return _size;
  }

  /**
   * Retrieve a pre-existing instance with the same values
   * @param t  instance to compare to
   * @return  the pre-existing instance
   */
  public T get(final T t) {
    long&#91;&#93; locations = _hashcodeToLocations.get(t.hashCode());
    if (locations == null) {
      return null;
    }

    for (long location : locations) {
      T stored = load(location);
      if (stored.equals(t)) {
        return stored;
      }
    }

    return null;
  }

  /**
   * Wrapper for <code>IOException</code>s.
   *
   * @author Burt
   */
  public static class CacheException extends RuntimeException {

    private static final long serialVersionUID = -8390095654330098197L;

    CacheException(final String message) {
      super(message);
    }

    CacheException(final Throwable cause) {
      super(cause);
    }
  }
}
  1. 48.9GB of XML files, 6.3GB zipped as of 2005[back]
  2. Ironically many authors share the same name already and sometimes are listed with more than one different name. Disambiguation of authors, both programmatically using inferences based on information known about authors’ backgrounds, research areas, etc. and also manually by us and the authors themselves, is a significant task that is critical to having a usable database. So any bugs that we’d introduce that made the problem worse would not be good.[back]

Comments are closed.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.