One of the most frequent requests is "I have a few PDFs and want to train a system that extracts the values I need from it". We know that this is a huge topic and trust me when I say this: We want that system too!
Without going into the technicalities, you and I need to accept that it doesn't work that way. It works for very narrow problems such as invoices but even there it gets noisy, e.g. when line items have to be recognized.
The problem is two-fold:
- People seem to have agreed that PDFs (or emails) are a great way to store information that was once neatly organized in a database. We get that a properly signed form is just a print away and scanners have become a commodity that can "digitize" that information again.
- How information is displayed varies from one company or person to the next – and so does the way how it should be organized. A machine cannot know that you always want a field to be called "consignee" even when it says "recipient" (well, it actually can do that but it stops when highly specific terms come into play).
Nearly every time someone knocks at our door and demands extraction, we take the call nonetheless. We do so because we know that extraction is often not the main value driver: While it would be handy to receive all this information in a structured form, it is often sufficient to know the type of document in order to do something useful with it.
The checks that happen on a constant basis are often much more simple. For instance, a company might receive this email:
A typical procedure that might follow:
- Drag & drop files into a specific folder (e.g. "Purchase order 50303y5403").
- Rename files ("invoice.pdf", "packing list.pdf", "wine tasting pictures 2011.pdf").
- Notify Angela from accounting that the files are ready.
- Check if the information is complete and arranges the payment of the invoice total (Angela probably does that).
Note that the only stage where information had to be extracted was the last. What would have been helpful are things like automatically storing all files to folders, renaming them, and notifying Angela – all of which can be easily done with Levity and Zapier.
Had the company worried about extracting all data from these documents, it would have ended up working with lots of noise and double-checking the same information again.
There are cases where extraction is absolutely necessary. But rather than suggesting people wait, we like to propose a better solution: Combining document classification (or really any data) with Amazon's Mechanical Turk.
I leave it to their website to explore the details but what it essentially means is that humans on their end are given some instructions about what to do ("I need this information in that table") and... well, that's it. They just do it when they receive a file.
Expanding on the above example, they would receive each identified invoice and be tasked with providing data in a standardized format:
- Invoice number
This is at the high end of automation and often not required for mid-sized companies. For them, it is often sufficient to have data pre-processed so that it can be routed effectively or found more easily later on.
Our hope is that with time and broader adoption of AI technology, fewer people will think of it as "I feed the monster with data, now it shall giveth me answers" but rather get creative within their limitations. There is plenty of room for improvement with just that.
We know how painful it is and how many people actually need to extract data. But unless there comes a major breakthrough in AI research, we will keep on suggesting ways like the above to rethink problems in a machine-like way.
But one thing I can assure you: The moment intelligent extraction from piles of data becomes a thing, we will be among the first to implement it. And boy, will we let the world know!