Making sense of document disasters

My challenge has been to create a system to extract structured data from PDF documents with a somewhat consistent structure. Imagine you’ve received a copy of every contract a state agency has signed for the last year; one simple use would be capturing the name of the vendor and the contract amount from that data. It’s almost always better, of course, to ask the agency for their database directly; this sort of approach is only useful when that’s not an option.

Last fall I envisioned building a system to extract structured data from documents that was focused primarily on scanned-in documents. Natively-generated PDFs were, I reasoned, already relatively easy to process with existing tools like PDFMiner. And Tabula, an open-sourced project, worked great on handling tabular data within these documents.

As I worked toward this, I realized I was transforming scanned-in documents into something that was almost — but not quite — comparable to natively-generated documents. For any given document, I transformed it into a series of words with locations on a page. With a few assumptions, I could create a “read-order” representation of the words. That was information that was available from natively-generated PDFs as well. But natively generated PDFs also included font information, which can be extremely useful in locating and identifying text to extract. (Scanned-in documents sometimes include a detected font, though it doesn’t seem to be reliable).

What’s more, a lot of the documents I looked at were weird. Really weird. Journalists had diligently collected the same forms from dozens of agencies, but each agency had its own variant of the form. Or the form changed slightly every year. Or every few months. Or every time a different user printed it out. It made sense to expand my approach to documents that weren’t scanned (the ‘natively generated’ PDFs) just because they were often incredibly frustrating for reporters to process — especially those with no background in writing their own software.

My original approach had been to use a human-driven templating system that allowed values to be captured at specific locations, either at an absolute position or relative to another “anchor” feature on a page. I’m mostly through a rough version of an interface that does this. But increasingly it seemed like I needed more fine-grained differentiation between subtle document variations. The hard part wasn’t capturing values from a known form, it was capturing values when forms seemed to change a little bit.

During my fellowship year I’ve split my time between machine learning classes, political science classes that tried to apply machine learning techniques, and courses that would help me make better sense of climate change, particularly geophysics and oil economics.

The most relevant part of these classes, for my challenge, are a number of machine learning algorithms I studied that can group similar documents together without human intervention. From there it is possible to find which words always appear (and thus are part of the underlying form) and which vary (and represent an ‘answer’ to a question that appears in the boilerplate of the form).

When the forms are exactly the same, one can do this using word location. But electronic forms allow lengthy answers that change the position of the words. When this occurs, I’ve found that using fonts can help identify which words are part of the form boilerplate.

The next steps for me in this project are incorporating fonts into the project and into the interface, and building a user-guided analysis system that clusters documents together and suggests to the user which parts of a document is repeated. The pieces I’ve been working on are scattered across a number of repositories, so in the coming months I’ll be joining them together and documenting how to use them.