Divider

Optimism

One of the most frequent requests is "I have a few PDFs and want to train a system that extracts the values I need from it". We know that this is a huge topic and trust me when I say this: We want that system too!

Fear

Without going into the technicalities, you and I need to accept that it doesn't work that way. It works for very narrow problems such as invoices but even there it gets noisy, e.g. when line items have to be recognized.

The problem is two-fold:

  1. People seem to have agreed that PDFs (or emails) are a great way to store information that was once neatly organized in a database. We get that a properly signed form is just a print away and scanners have become a commodity that can "digitize" that information again.
  2. How information is displayed varies from one company or person to the next – and so does the way how it should be organized. A machine cannot know that you always want a field to be called "consignee" even when it says "recipient" (well, it actually can do that but it stops when highly specific terms come into play).

Hope

Nearly every time someone knocks at our door and demands extraction, we take the call nonetheless. We do so because we know that extraction is often not the main value driver: While it would be handy to receive all this information in a structured form, it is often sufficient to know the type of document in order to do something useful with it.

The checks that happen on a constant basis are often much more simple. For instance, a company might receive this email:

Extract PDF attachments
Thank you for nothing

A typical procedure that might follow:

  1. Drag & drop files into a specific folder (e.g. "Purchase order 50303y5403").
  2. Rename files ("invoice.pdf", "packing list.pdf", "wine tasting pictures 2011.pdf").
  3. Notify Angela from accounting that the files are ready.
  4. Check if the information is complete and arranges the payment of the invoice total (Angela probably does that).
Automatically download attachments & rename according to its content
Automatically download attachments & rename them according to their content

Note that the only stage where information had to be extracted was the last. What would have been helpful are things like automatically storing all files to folders, renaming them, and notifying Angela – all of which can be easily done with Levity and Zapier.

Had the company worried about extracting all data from these documents, it would have ended up working with lots of noise and double-checking the same information again.

Bliss

There are cases where extraction is absolutely necessary. But rather than suggesting people wait, we like to propose a better solution: Combining document classification (or really any data) with Amazon's Mechanical Turk.

I leave it to their website to explore the details but what it essentially means is that humans on their end are given some instructions about what to do ("I need this information in that table") and... well, that's it. They just do it when they receive a file.

Expanding on the above example, they would receive each identified invoice and be tasked with providing data in a standardized format:

  • Supplier
  • Invoice number
  • Date
  • Total
  • Department
  • etc...

This is at the high end of automation and often not required for mid-sized companies. For them, it is often sufficient to have data pre-processed so that it can be routed effectively or found more easily later on.

Heaven

Our hope is that with time and broader adoption of AI technology, fewer people will think of it as "I feed the monster with data, now it shall giveth me answers" but rather get creative within their limitations. There is plenty of room for improvement with just that.

We know how painful it is and how many people actually need to extract data. But unless there comes a major breakthrough in AI research, we will keep on suggesting ways like the above to rethink problems in a machine-like way.

But one thing I can assure you: The moment intelligent extraction from piles of data becomes a thing, we will be among the first to implement it. And boy, will we let the world know!

Now that you're here

Levity is a tool that allows you to train AI models on images, documents, and text data. You can rebuild manual workflows and connect everything to your existing systems without writing a single line of code.‍If you liked this blog post, you'll love Levity.

Sign up

Now that you're here

Levity is a tool that allows you to train AI models on images, documents, and text data. You can rebuild manual workflows and connect everything to your existing systems without writing a single line of code.‍If you liked this blog post, you'll love Levity.

Sign up

Stay inspired

Sign up and get thoughtfully curated content delivered to your inbox.

Thanks!