John pointed me to an interesting open source project, the IMS Open Corpus Workbench. This project has developed tools are for “managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.” Obviously it has a linguistics bent, but the tools seem to be well documented and usable.
You can see an example of an interesting interface to the Corpus Workbench at BwanaNet – a wizard-like interface where you go through 5 steps to get results on an English, Catalan, and Spanish corpus.