Archive for July, 2006

Using Generics to Simplify Hibernate/Spring DAOs (update)

Sunday, July 16th, 2006

Tim noted in the comments of my previous post that it’s possible to use reflection to determine the parameterized type of the entity class. It’s a great enhancement and drops simple DAOs that only use the standard base DAO methods down to zero LOC (other than the class definition and imports). Cool.

The updated base DAO implementation class:

(see getEntityClass())

package com.foo.bar.dao.hibernate;

import java.sql.Timestamp;
import java.util.HashSet;
import java.util.Set;

import org.hibernate.Criteria;
import org.hibernate.Session;
import org.hibernate.criterion.Expression;
import org.springframework.orm.hibernate3.HibernateCallback;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;

import com.foo.bar.dao.BaseDAO;
import com.foo.bar.model.DatabaseObject;
import com.foo.bar.model.User;

/**
 * Abstract base class for Hibernate DAOs.
 *
 * @author Burt
 * @param <T> the entity type
 */
public abstract class BaseDaoImpl<T extends DatabaseObject>
       extends HibernateDaoSupport
       implements BaseDAO<T> {

  private Class _entityClass;

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findAll()
   */
  public Set<T> findAll() {
    return findAll(false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findAll()
   */
  @SuppressWarnings("unchecked")
  public Set<T> findAll(final boolean includeDeleted) {
    return (Set<T>)getHibernateTemplate().execute(new HibernateCallback() {
        public Object doInHibernate(final Session session) {
          Criteria criteria = session.createCriteria(getEntityClass());
          if (!includeDeleted) {
            criteria.add(Expression.isNull("deleted"));
          }
          return new HashSet<T>(criteria.list());
        }
      });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByPrimaryKey(int)
   */
  public T findByPrimaryKey(final int pk) {
    return findByPrimaryKey(pk, false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByPrimaryKey()
   */
  @SuppressWarnings("unchecked")
  public T findByPrimaryKey(final int pk, final boolean includeDeleted) {

    return (T)getHibernateTemplate().execute(new HibernateCallback() {
      public Object doInHibernate(final Session session) {
        T instance = (T)session.get(getEntityClass(), pk);
        if (instance != null && (includeDeleted || !instance.isDeleted())) {
          return initializeIfNotNull(instance);
        }
        return null;
      }
    });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByName()
   */
  public T findByName(final String name) {
    return findByName(name, false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByName()
   */
  @SuppressWarnings("unchecked")
  public T findByName(final String name, final boolean includeDeleted) {

    return (T)getHibernateTemplate().execute(new HibernateCallback() {
      public Object doInHibernate(final Session session) {
        Criteria criteria = session.createCriteria(getEntityClass());
        criteria.add(Expression.eq(getNameFieldName(), name));
        if (!includeDeleted) {
          criteria.add(Expression.isNull("deleted"));
        }
        return initializeIfNotNull(criteria.uniqueResult());
      }
    });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#create()
   */
  public T create(final T t) {
    getHibernateTemplate().saveOrUpdate(t);
    return t;
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#update()
   */
  public void update(final T t, final User updatedBy) {
    t.setModified(new Timestamp(System.currentTimeMillis()));
    t.setModifiedBy(updatedBy);
    getHibernateTemplate().update(t);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#delete()
   */
  public void delete(final T t, final User deletedBy) {
    getHibernateTemplate().delete(t);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#softDelete()
   */
  public void softDelete(final T t, final User deletedBy) {
    t.setDeleted(new Timestamp(System.currentTimeMillis()));
    update(t, deletedBy);
  }

  @SuppressWarnings("unchecked")
  private Object initializeIfNotNull(final Object instance) {
    if (instance != null) {
      initialize((T)instance);
    }
    return instance;
  }

  protected void initialize(@SuppressWarnings("unused") T instance) {
    // let subclass do special initialization
  }

  protected synchronized Class getEntityClass() {

    if (_entityClass == null) {
      Type type = getClass().getGenericSuperclass();
      loop:
      while (true) {
        if (type instanceof ParameterizedType) {
          Type[] arguments = ( (ParameterizedType)type).getActualTypeArguments();
          for (Type argument : arguments) {
            if (argument instanceof Class &&
                DatabaseObject.class.isAssignableFrom(((Class)argument))) {
              _entityClass = (Class)argument;
              break loop;
            }
          }
        }
        type = ((Class)type).getGenericSuperclass();
        if (type == Object.class) {
          throw new RuntimeException(
              "Could not find a DatabaseObject subclass parameterized type");
        }
      }
    }
    return _entityClass;
  }

  /**
   * Each entity has a 'name' but not necessarily
   * called 'name'.
   * @return  the name's field name
   */
  protected String getNameFieldName() {
    return "name";
  }
}

And the updated sample implementation class:

package com.foo.bar.dao.hibernate;

import com.foo.bar.dao.hibernate.BaseDaoImpl;
import com.foo.bar.dao.ErrorAnnotationDAO;
import com.foo.bar.model.ErrorAnnotation;

/**
 * Hibernate implementation of
 * <code>ErrorAnnotationDAO</code>.
 *
 * @author Burt
 */
public class ErrorAnnotationDaoImpl
       extends BaseDaoImpl<ErrorAnnotation>
       implements ErrorAnnotationDAO {
}

Using Generics to Simplify Hibernate/Spring DAOs

Saturday, July 15th, 2006

Database access always involves a significant amount of boilerplate code, especially for standard CRUD operations. Complex relationships and queries make things interesting, but the bulk of the database code involves reading individual instances (looked up by primary key, ‘name’, etc.), storing new instances, editing instances and deleting. I’ve become a big fan of the Hibernate approach to ORM, and Spring‘s helper classes significantly reduce the amount of coding required. But for each entity there’s still the same nearly identical code for each DAO implementation class, and for their interfaces too.

So I’ve developed a base interface and base implementation class that use Java5 generics to reduce the code for each DAO to almost nothing for DAOs that only perform the “standard” operations.

The DAOs manage DatabaseObjects which extend DataObject and contain fields and getters and setters for primary key, user who created, time created, user who last edited, time last edited, and the user and time that the instance was “soft” deleted, i.e. flagged as deleted to hide from the UI but still in the database.

The base interface:1

package com.foo.bar.dao;

import java.util.Set;

import com.foo.bar.model.DatabaseObject;
import com.foo.bar.model.User;

/**
 * Base DAO.
 * @author Burt
 * @param <T> the entity type
 */
public interface BaseDAO<T extends DatabaseObject> {

  /**
   * Find all.
   * @param includeDeleted
   * @return  all instances
   */
  Set<T> findAll(boolean includeDeleted);

  /**
   * Find all.
   * @return  all instances
   */
  Set<T> findAll();

  /**
   * Find an instance by primary key.
   * @param pk  the primary key
   * @param includeDeleted
   * @return  the instance
   */
  T findByPrimaryKey(int pk, boolean includeDeleted);

  /**
   * Find an instance by primary key.
   * @param pk  the primary key
   * @return  the instance
   */
  T findByPrimaryKey(int pk);

  /**
   * Find an instance by name.
   * @param name  the name
   * @param includeDeleted
   * @return  the instance
   */
  T findByName(String name, boolean includeDeleted);

  /**
   * Find an instance by name.
   * @param name  the name
   * @return  the instance
   */
  T findByName(String name);

  /**
   * Insert a new instance.
   * @param t  the new instance
   * @return  the instance updated with its primary key
   */
  T create(T t);

  /**
   * Save changes to an existing instance.
   * @param t  the instance
   * @param updatedBy
   */
  void update(T t, User updatedBy);

  /**
   * Delete an existing instance.
   * @param t  the instance
   * @param deletedBy
   */
  void delete(T t, User deletedBy);

  /**
   * Soft delete (hide) an existing instance.
   * @param t  the instance
   * @param deletedBy
   */
  void softDelete(T t, User deletedBy);
}

And the base implementation class:

package com.foo.bar.dao.hibernate;

import java.sql.Timestamp;
import java.util.HashSet;
import java.util.Set;

import org.hibernate.Criteria;
import org.hibernate.Session;
import org.hibernate.criterion.Expression;
import org.springframework.orm.hibernate3.HibernateCallback;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;

import com.foo.bar.dao.BaseDAO;
import com.foo.bar.model.DatabaseObject;
import com.foo.bar.model.User;

/**
 * Abstract base class for Hibernate DAOs.
 *
 * @author Burt
 * @param <T> the entity type
 */
public abstract class BaseDaoImpl<T extends DatabaseObject>
       extends HibernateDaoSupport
       implements BaseDAO<T> {

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findAll()
   */
  public Set<T> findAll() {
    return findAll(false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findAll()
   */
  @SuppressWarnings("unchecked")
  public Set<T> findAll(final boolean includeDeleted) {
    return (Set<T>)getHibernateTemplate().execute(new HibernateCallback() {
        public Object doInHibernate(final Session session) {
          Criteria criteria = session.createCriteria(getEntityClass());
          if (!includeDeleted) {
            criteria.add(Expression.isNull("deleted"));
          }
          return new HashSet<T>(criteria.list());
        }
      });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByPrimaryKey(int)
   */
  public T findByPrimaryKey(final int pk) {
    return findByPrimaryKey(pk, false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByPrimaryKey()
   */
  @SuppressWarnings("unchecked")
  public T findByPrimaryKey(final int pk, final boolean includeDeleted) {
    return (T)getHibernateTemplate().execute(new HibernateCallback() {
      public Object doInHibernate(final Session session) {
        T instance = (T)session.get(getEntityClass(), pk);
        if (instance != null && (includeDeleted || !instance.isDeleted())) {
          return initializeIfNotNull(instance);
        }
        return null;
      }
    });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByName()
   */
  public T findByName(final String name) {
    return findByName(name, false);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#findByName()
   */
  @SuppressWarnings("unchecked")
  public T findByName(final String name, final boolean includeDeleted) {
    return (T)getHibernateTemplate().execute(new HibernateCallback() {
      public Object doInHibernate(final Session session) {
        Criteria criteria = session.createCriteria(getEntityClass());
        criteria.add(Expression.eq(getNameFieldName(), name));
        if (!includeDeleted) {
          criteria.add(Expression.isNull("deleted"));
        }
        return initializeIfNotNull(criteria.uniqueResult());
      }
    });
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#create()
   */
  public T create(final T t) {
    getHibernateTemplate().saveOrUpdate(t);
    return t;
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#update()
   */
  public void update(final T t, final User updatedBy) {
    t.setModified(new Timestamp(System.currentTimeMillis()));
    t.setModifiedBy(updatedBy);
    getHibernateTemplate().update(t);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#delete()
   */
  public void delete(final T t, final User deletedBy) {
    getHibernateTemplate().delete(t);
  }

  /** (non-Javadoc)
   * @see com.foo.bar.dao.BaseDAO#softDelete()
   */
  public void softDelete(final T t, final User deletedBy) {
    t.setDeleted(new Timestamp(System.currentTimeMillis()));
    update(t, deletedBy);
  }

  @SuppressWarnings("unchecked")
  private Object initializeIfNotNull(final Object instance) {
    if (instance != null) {
      initialize((T)instance);
    }
    return instance;
  }

  protected void initialize(@SuppressWarnings("unused") T instance) {
    // let subclass do special initialization
  }

  /**
   * Can't determine T.getclass() due to erasure,
   * so must declare the class redundantly.
   * @return  the entity class
   */
  protected abstract Class getEntityClass();

  /**
   * Each entity has a 'name' but not necessarily
   * called 'name'.
   * @return  the name's field name
   */
  protected String getNameFieldName() {
    return "name";
  }
}

And a sample DAO:

package com.foo.bar.dao;

import com.foo.bar.dao.BaseDAO;
import com.foo.bar.model.ErrorAnnotation;

/**
 * DAO for <code>ErrorAnnotation</code>.
 * @author Burt
 */
public interface ErrorAnnotationDAO extends BaseDAO<ErrorAnnotation> {
  // no other methods
}

And its implementation:

(uses only base interface methods, so the only required method is getEntityClass)

package com.foo.bar.dao.hibernate;

import com.foo.bar.dao.hibernate.BaseDaoImpl;
import com.foo.bar.dao.ErrorAnnotationDAO;
import com.foo.bar.model.ErrorAnnotation;

/**
 * Hibernate implementation of
 * <code>ErrorAnnotationDAO</code>.
 *
 * @author Burt
 */
public class ErrorAnnotationDaoImpl
       extends BaseDaoImpl<ErrorAnnotation>
       implements ErrorAnnotationDAO {

  @Override
  protected Class getEntityClass() {
    return ErrorAnnotation.class;
  }
}

Of course client usage is the same as if I’d copy/pasted the boilerplate code thanks to the generic types, e.g.:

ErrorAnnotationDAO dao = ...
ErrorAnnotation errorAnnotation = dao.findByPrimaryKey(12345); // no need to cast
...
Collection<ErrorAnnotation> allErrorAnnotations = dao.findAll(); // no need to cast
...
dao.create(new ErrorAnnotation(...));
...
ErrorAnnotation errorAnnotation = ...
errorAnnotation.setFoo('foo');
errorAnnotation.setBar('bar');
User user = ...
dao.update(errorAnnotation, user);

The one downside is the need to suppress unchecked warnings, but Hibernate guarantees that all instances are of the appropriate class so that’s just to get past the compiler.

  1. Note that the update and delete methods take a User parameter to record the user performing the action. create() doesn’t though because DatabaseObject have by convention a User parameter in their constructors.[back]

A Value Object Base Class

Saturday, July 15th, 2006

Value Objects (aka Data Objects) are pretty common. They typically have little or no functionality, but rather just contain one or more data fields. They’re used to represent database entities, to represent model objects in MVC applications, etc.

To be well-behaved members of Collections they should have good hashCode() and equals() methods, and for debugging should have good toString() methods. To be stored in a Session, they must be serializable. So I’ve developed a simple abstract base class that meets these requirements.

It uses Jakarta Commons Lang utility classes for default reflection-based implementations of hashCode(), equals(), and toString(), it implements Serializable, and has a Logger1 in case concrete subclasses need it. The logger is transient because it isn’t serializable, and also shouldn’t be included in hashCode(), equals(), and toString() (Commons Lang ignores transients by default).

Of course any concrete subclass can override any of the three methods if the default implementation isn’t sufficient. But I find that’s the exception rather than the norm.

The code:

package com.foo.bar.model;

import java.io.Serializable;

import org.apache.commons.lang.builder.EqualsBuilder;
import org.apache.commons.lang.builder.HashCodeBuilder;
import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.commons.lang.builder.ToStringStyle;
import org.apache.log4j.Logger;

/**
 * Abstract base class for value objects. Contains reflection-based default
 * methods for hashCode, equals, and toString.
 *
 * @author Burt
 */
public abstract class DataObject implements Serializable {

  protected transient final Logger _log = Logger.getLogger(getClass());

  @Override
  public String toString() {
    return ToStringBuilder.reflectionToString(this, ToStringStyle.MULTI_LINE_STYLE);
  }

  @Override
  public int hashCode() {
    return HashCodeBuilder.reflectionHashCode(this);
  }

  @Override
  public boolean equals(final Object other) {
    return EqualsBuilder.reflectionEquals(this, other);
  }
}
  1. I gave up on Commons Logging a while ago since I always use Log4j[back]

Disk-based Caching for a Large Uniqueness Problem

Saturday, July 15th, 2006

Earlier this year I was tasked with created a relational database containing all of the PubMed article data. Their web site has extensive searching capabilities but they allow downloading their entire data set1 for custom uses. The XML is fully redundant though – each author is listed once for each article, as is each instance of Journal information, chemicals, publication type, etc. So the task involved a lot more than just parsing XML and inserting it into a database.

XMLBeans

A junior programmer on the team made an initial attempt at loading the XML involving XMLBeans. This approach had severe memory issues since the DOM structure of the file was maintained in memory along with many supporting objects. XMLBeans is a very powerful tool for working with XML in an OO fashion, but it quickly choked as each zipped file typically contains 50,000 citations in around 135MB of XML. I realized that a SAX parser would work fine since treating the data as objects was an unnecessary luxury – we’d need to read all the data but visiting each node once would be fine.

The biggest problem though was the redundancy:

Entity Name Unique Count
Citation 14,792,864
Author 5,852,735
Chemical 148,082
Grant 561,486
Journal 16,310
Journal Info 28,667
Journal Issue 1,199,586
Keyword 75,436
Mesh Term 22,286
Mesh Term Qualifier Name 83
Publication Type 52

Most of the entities are very manageable, but at 5.8 million, Author was a problem. Note that although there are 14.7 million Citations, these are the top-level objects – they contain/reference all the rest – and are singletons.

Hibernate

We were planning on using Hibernate for application data access so my first pass at inserting the data into the database used our Hibernate DAOs. For each object (e.g. Author) referenced in a Citation, I’d execute a findOrCreate() method that would find the previously inserted row or creates one if necessary. After running for about two full days it became apparent that it would take approximately two months on a fast multiprocessor box to finish, but that should have been obvious from the beginning. Hibernate would be fine for the runtime application but not the best bulk loading solution.

Caching

I played with caching as Hibernate has excellent support for it. It comes bundled with a few options and is pluggable, so that seemed like the best option. The problem with caching is unique keys. Take Author as an example. Generating a hash code is insufficient since two objects with the same hash code aren’t necessarily equal. That’s not a problem in a HashMap but it would be for our purposes since a 2nd author with the same cache key would be merged in with the previous one.2

My first naive approach to unique keys was to concatenate the objects’ fields, so for example an author “James A. Gosling” my key would be “james_a_gosling”. It was rather of silly in retrospect I guess since the keys ended up being about the same size as the data with this approach, and even with disk-based caches I still had the same crippling memory issues as before. I was running on both Windows and 32-bit Linux, so the maximum memory I could allocate to the JVM is around 2GB, and I was relatively quickly getting OOMEs. I could allocate a lot more on a 64-bit multiprocessor Linux box that was available, but if we had time I wanted to make sure this would run on any team member’s PC.

RandomAccessFile as a disk cache

I’ll spare you the gory details of the few intermediate approaches that I tried and describe the eventual solution. The implementation that I settled on could certainly use more optimizing but it’s a process that will run rarely (once initially plus once a year for updates. It runs in a few hours and takes only 1-1.5GB of heap memory, so any decent developer PC should be up for the task.

What finally worked for me was an implementation something like that in java.util.HashMap. HashMap handles collisions with a private implementation of a linked list. All members with the same hash code end up in the same list, and equals() is called on each member in the list to find the actual keyed object. This is efficient as long as hash codes are well-distributed and the lists stay short. In the pathological case where all members share identical hash codes, everything would be stored in the same bucket and rather than constant time lookups you’d have linear time lookups.

I created a helper class DiskCache. DiskCache uses a java.ui.RandomAccessFile to store serialized entities. I maintain a Map keyed by hash code (Integer) with long[] values representing the file offsets of all cached entities with that hash code.

To store an entity, I get the array of locations for its hash code from the map, creating a new array if necessary. I move to the end of the file, write out the serialized form, and store the file offset in the location array.

To test for an entity’s existence, I get the array of locations for its hash code from the map, deserialize each stored instance from the file, and call equals() for each. This is the expensive part of the process – there is a decent amount of disk thrashing here. But if the hash codes are well-defined there won’t be too many instances per hash code and the number of deserializations will be manageable.

This works pretty well. The cache files get pretty big, but disk space is cheap. The memory usage is primarily the arrays of file offsets, so processing 14.7 million files in ~50GB of XML results in a maximum memory usage of a little over 1GB, and the entire process (4 stages – one to cache the Citations’ dependent objects, then generating insert statements for these cached instances, then re-reading the XML and building the Citation inserts plus the one-to-many and many-to-many inserts for dependent objects, and finally bulk insertion into the database) takes less than 8 hours on a reasonably fast machine.

One interesting side benefit of this approach is that PubMed updates will be straightforward. The bulk loading code could be amended (I didn’t implement this – I’ve since left the project) to read the entire database and populate cache files as if from the raw XML. Then we’d only need to read in the new XML files, caching new entities, and inserting at the end of the process only new Citations and their dependent entities.

The code:

package com.foo.bar.loader;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import com.foo.bar.model.BaseEntity;

public class DiskCache<T extends BaseEntity>
       implements Iterable<T> {

  /*package*/ final Map<Integer, long&#91;&#93;> _hashcodeToLocations =
       new HashMap<Integer, long&#91;&#93;>();

  private static final String NULL_STRING = "___NULL___";

  private final RandomAccessFile _raf;
  private final File _file;
  private final StringConverter<T> _stringConverter;

  private int _size;

  /**
   * Constructor.
   * @param filename  path to disk cache file
   * @param stringConverter  converter to serialize an entity to string form
   */
  public DiskCache(final String filename, final StringConverter<T> stringConverter) {

    _file = new File(filename);
    if (_file.exists() && !_file.delete()) {
      throw new CacheException("can't delete " + filename);
    }
    try {
      _raf = new RandomAccessFile(_file, "rw");
    }
    catch (FileNotFoundException e) {
      throw new CacheException(e);
    }
    _stringConverter = stringConverter;
  }

  /**
   * Add the instance to the cache if an equivalent instance doesn't already exist.
   * @param t  the entity to cache
   * @return  <code>true</code> if it was stored
   */
  public boolean cache(final T t) {
    if (!exists(t)) {
      store(t);
      return true;
    }

    return false;
  }

  /**
   * Set the primary key in the 2nd pass once all the instances have been
   * cached and inserted into the database.
   * @param t  the entity
   * @param pk  the primary key
   */
  public void setPk(final T t, final Integer pk) {
    try {
      for (long location : _hashcodeToLocations.get(t.hashCode())) {
        T stored = load(location);
        if (stored.equals(t)) {
          _raf.seek(location);
          _raf.writeInt(pk);
          return;
        }
      }
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  /**
   * Test for existence of an entity.
   * @param t  the entity
   * @return  <code>true</code> if it's cached on the disk
   */
  /*package*/ boolean exists(final T t) {
    return get(t) != null;
  }

  private void store(final T t) {
    try {
      _raf.seek(_raf.length());
      appendLocation(t);
      _raf.writeInt(-1); // placeholder for pk
      for (String string : _stringConverter.toStringArray(t)) {
        _raf.writeUTF(toNullSafe(string));
      }
      _size++;
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  private String toNullSafe(final String string) {
    return string == null ? NULL_STRING : string;
  }

  private String fromNullSafe(final String string) {
    return NULL_STRING.equals(string) ? null : string;
  }

  private void appendLocation(final T t) {
    long location;
    try {
      location = _raf.length();
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
    long[] locations = _hashcodeToLocations.get(t.hashCode());
    if (locations == null) {
      locations = new long[] { location };
    }
    else {
      long[] newLocations = new long[locations.length + 1];
      System.arraycopy(locations, 0, newLocations, 0, locations.length);
      newLocations[newLocations.length - 1] = location;
      locations = newLocations;
    }
    _hashcodeToLocations.put(t.hashCode(), locations);
  }

  private T load(final long location) {
    try {
      _raf.seek(location);
      int pk = _raf.readInt();
      String[] strings = new String[_stringConverter.getArraySize()];
      for (int i = 0; i < strings.length; i++) {
        strings&#91;i&#93; = fromNullSafe(_raf.readUTF());
      }
      return _stringConverter.fromStringArray(pk == -1 ? null : pk, strings);
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
  }

  /**
   * Serializes and deserializes an entity to/from an array of strings.
   *
   * @author Burt
   * @param <T>  the entity type
   */
 public static interface StringConverter<T extends BaseEntity> {
    int getArraySize();
    String[] toStringArray(T a);
    T fromStringArray(Integer pk, String[] strings);
  }

  /**
   * Close and delete the cache file.
   */
  public void shutdown() {
    try {
      _raf.close();
    }
    catch (IOException e) {
      throw new CacheException(e);
    }
    _file.delete();
  }

  /* (non-Javadoc)
   * @see java.lang.Iterable#iterator()
   */
  public Iterator<T> iterator() {
    return new Iterator<T>() {

      long[] locations = findAllLocations();
      int index = 0;

      public boolean hasNext() {
        return index < locations.length;
      }

      public T next() {
        return load(locations&#91;index++&#93;);
      }

      public void remove() {
        throw new UnsupportedOperationException();
      }
    };
  }

  private long&#91;&#93; findAllLocations() {
    int locationCount = 0;
    for (long&#91;&#93; locations : _hashcodeToLocations.values()) {
      locationCount += locations.length;
    }

    int index = 0;
    long&#91;&#93; allLocations = new long&#91;locationCount&#93;;
    for (long&#91;&#93; locations : _hashcodeToLocations.values()) {
      if (locations.length == 1) {
        allLocations&#91;index++&#93; = locations&#91;0&#93;;
      }
      else {
        System.arraycopy(locations, 0, allLocations, index, locations.length);
        index += locations.length;
      }
    }

    Arrays.sort(allLocations);

    return allLocations;
  }

  /**
   * Get the number of stored entities.
   * @return  the number
   */
  public int size() {
    return _size;
  }

  /**
   * Retrieve a pre-existing instance with the same values
   * @param t  instance to compare to
   * @return  the pre-existing instance
   */
  public T get(final T t) {
    long&#91;&#93; locations = _hashcodeToLocations.get(t.hashCode());
    if (locations == null) {
      return null;
    }

    for (long location : locations) {
      T stored = load(location);
      if (stored.equals(t)) {
        return stored;
      }
    }

    return null;
  }

  /**
   * Wrapper for <code>IOException</code>s.
   *
   * @author Burt
   */
  public static class CacheException extends RuntimeException {

    private static final long serialVersionUID = -8390095654330098197L;

    CacheException(final String message) {
      super(message);
    }

    CacheException(final Throwable cause) {
      super(cause);
    }
  }
}
  1. 48.9GB of XML files, 6.3GB zipped as of 2005[back]
  2. Ironically many authors share the same name already and sometimes are listed with more than one different name. Disambiguation of authors, both programmatically using inferences based on information known about authors’ backgrounds, research areas, etc. and also manually by us and the authors themselves, is a significant task that is critical to having a usable database. So any bugs that we’d introduce that made the problem worse would not be good.[back]

Unintentionally Squashing Exceptions

Friday, July 14th, 2006

At my last job we had a utility method that created a SysException (the one and only application exception) given a message and an optional cause exception. The primary method (there are several overloads) was:

public static SysException createSysException(String message, Throwable t) {

   SysException e = null;
   if (t == null) {
      e = new SysException(message);
   }
   else {
      if (t instanceof SysException) {
         // don't create a new exception,
         // just add message and fill stack trace
         e = (SysException) t;
         ...
         e.fillInStackTrace();
      }
      else if (t instanceof RemoteException) {
         RemoteException re = (RemoteException)t;
         e = new SysException(message, re.detail);
      }
      else {
         e = new SysException(message, t);
      }
   }

   return e;
}

and this is certainly decent code. By centralizing exception generation, we could log the errors, do validation, etc. But the crux of the problem is that the usage pattern was:

try {
   ...
}
catch (MyException e) {
   log.error("uh oh", e);
   throw ExceptionUtils.createSysException("uh oh", e);
}

Again, decent code. Except that it could have been:

try {
   ...
}
catch (MyException e) {
   log.error("uh oh", e);
   ExceptionUtils.createSysException("uh oh", e);
}

and the compiler would have been just as happy. And given the current implementation of createSysException(), there is no automatic logging, so unless the caller throws the generated exception, it gets squashed. That’s why I was mucking around in the code – there were several places where the exception wasn’t thrown and execution was allowed to proceed, with occasional unintended consequences.

This made debugging harder since it was an intermittent problem. We lost the original information from the cause exception, and because the execution didn’t stop after the call to ExceptionUtils.createSysException, variables were null and we’d failing on a NullPointerException. This lead us to the problem but it took a while to find the true cause of the original problem.

So my solution was to (in addition to solving this particular exception’s cause) change

public static SysException createSysException(String message, Throwable t) {

to

public static void throwSysException(String message, Throwable t) throws SysException{

The method would be identical, except it would throw the exception instead of returning the exception and allowing the caller to disregard it. Lucky me, I only needed to change ~500 files.

Interesting List bug

Thursday, July 13th, 2006

I found an interesting bug in the code at work the other day. The task involved looping through a list and creating a new list of items generated from the input list’s items. The core problem was that the implementation didn’t create a new list, but added to the original, i.e. (not the real code but the same pattern):

List list = generateInitialList();
for (int i = 0; i < list.size(); i++) { MyClass thing = (MyClass)list.get(i); MyOtherClass newThing = createMyOtherClassFromMyClass(thing); list.add(newThing); } [/java] This is interesting for a few reasons. It's bad in that it uses get(index) instead of using an iterator; in this case we’re using a (rather weak) implementation of quasi-type-safe lists that subclass ArrayList (old pre-JDK5 codebase) so that’s safe, but it’s a dangerous Java 1.1-style antipattern that would be potentially expensive if the list were a LinkedList.

The cool part about this is how it fails. It loops through the N elements in the list, and since it’s adding to the original list, the list is getting larger. It calls size() every time, so unless it’s stopped by an exception (e.g. if there were no casting) it would go on forever. Of course at the N+1st iteration it fails since it tries to cast the first MyOtherClass to a MyClass, and that dies with a ClassCastException.

If he’d used an iterator, it would have failed earlier since by adding to a list that’s being iterated, he’d get a ConcurrentModificationException.

And if he’d continued with the get(index) antipattern but called size() once and stored it in a temporary variable, e.g.

for (int i = 0, size = list.size(); i < size; i++) { MyClass thing = (MyClass)list.get(i); MyOtherClass newThing = createNewThingFromThing(thing); list.add(newThing); } [/java] then it would have looped through all N items, created N new MyOtherClass instances, and then it would have failed on a ClassCastException in the calling code that iterated through this list’s items since that code expects a list containing only MyOtherClass instances.

Of course this would have never made it into the code base if it had been even trivially unit tested (his initial lists must have been empty, I suppose) but that’s another topic.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.