Why extracting data from PDFs is still a nightmare for data experts

Post content hidden for low score. Show…
This is where the famous phrase, "To shake one's booty at windmills", comes from:

bafkreiczyf235mykzrhbslfkoa7bfxlzhfryme44dyetflkgidlycnsyjy@jpeg
 
Upvote
203 (204 / -1)

TheOldChevy

Ars Tribunus Militum
1,569
Subscriptor
I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.
 
Upvote
100 (100 / 0)

john_e

Seniorius Lurkius
32
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.
 
Upvote
62 (64 / -2)

peterford

Ars Praefectus
4,286
Subscriptor++
Why not both? Have an overall process run it through OCR, run it through a VLM, diff the outputs, embed confidence in metadata and link to the source?

I do think we need to stop thinking any process can be magic though - I'm sure we've all squinted at badly-scanned PDFs or any poorly formatted document wondering what on Earth the original author was thinking.
 
Upvote
51 (51 / 0)

Nalyd

Ars Praefectus
3,057
Subscriptor
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?
Because we’re talking mostly about PDFs that are basically scans of paper documents, not documents that were printed as PDF from a digital text document. Those can still have layout quirks but the text is embedded so can be extracted as written. and the characters constant and clear, so machine reading is easy. But the scans of typewritten and/or handwritten documents are much harder.

I suspect Googles model works so well because of 2 decades of ReCAPTCHA.


It would be best if the problem could be fixed at source. ..:The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.
For modern data yes, but unless you have a time machine to tell people from the 1960s or before that they needed to make their documents machine readable, this doesn’t really address the issue at hand.
 
Upvote
93 (93 / 0)

bernstein

Ars Scholae Palatinae
755
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage.
Please give a better digital format that preserves layout perfectly and will be universally reproducible in a few decades of time. Though i suppose releasing the data as xml, json or csv too would be nice.
The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.
That is just plain mean. People are more intelligent than you insinuate
 
Upvote
-5 (15 / -20)

Kazper

Ars Praefectus
4,286
Subscriptor
ironically pdf's are probably the only textual-graphical digital format that preserves layout perfectly, that so far has stood the test of time. if you have archival needs for more than a decade it's rightly so the go to format.
I think that really depends on the exact archival needs. If layout matters, I agree, but very often it does not (and this article is about all the cases where it doesn't, but the data is important).
 
Upvote
11 (15 / -4)

NomadUK

Ars Scholae Palatinae
804
Subscriptor++
Slightly off-topic, but one of my pet peeves is people who keep wanting Adobe Acrobat installed on their systems so that they can edit their PDFs. To me, a PDF is a fixed document, an accurate copy of the original. If you want to change the content, go edit the original document. If you want to distribute an editable document, make sure everyone's using the same word processor, or use RTF, CSV, or plain old text, or whatever. Stop editing and re-sending PDFs.
 
Upvote
22 (41 / -19)

Wheels Of Confusion

Ars Legatus Legionis
75,749
Subscriptor
Upvote
19 (22 / -3)

Erbium168

Ars Centurion
2,853
Subscriptor
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.
We'll have nae spreadsheets in Scotland. Nae King, nae Bishop, nae Trump, nae Excel!

The UK's national statistical database (started I believe by Harold Wilson) releases just about all its data as spreadsheets, providing enormous rabbitholes for amateur statisticians. There's a certain amount of Schadenfreude in thinking that as the US government is forced to delete its data, the UK government is making ever more of it accessible - interesting for future archaeologists if we manage to avoid bit rot.
 
Upvote
38 (39 / -1)

Rirere

Ars Centurion
325
Subscriptor++
Please give a better digital format that preserves layout perfectly and will be universally reproducible in a few decades of time. Though i suppose releasing the data as xml, json or csv too would be nice.

That is just plain mean. People are more intelligent than you insinuate

Is it mean to observe that people are careless and often focus on task completion over future proofing? We all do it; no one is immune, however smart they are.

Anecdotally I work with data for a living and the number of Excel spreadsheets I get (itself as admittedly close to "ideal" for structured data as I'm likely to get in the real world) that are nonetheless nightmares to parse is incredible.

The reason?

People who (likely genuinely) try to make the data more human readable by encoding data via formatting (hard to extract at scale) or introducing whitespace rows and columns/merging cells/staggering and spacing categorical labels.

There are ways to make reports both human- and machine- readable but you have to know the needs of both to produce such documents, and not everyone has that kind of training.
 
Upvote
48 (48 / 0)

JBanister

Ars Scholae Palatinae
634
Subscriptor
I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.
I was thinking that same thing. Maybe three passes. First traditional OCR. Then AI does what it can where the OCR self-reports failure. Then, an AI is trained to fix the recognizable mistakes where the previous two passes thought they were correct. Having it correct an existing reproduction to make it more true to the original might be a goal that provokes fewer hallucinations.
 
Upvote
14 (14 / 0)
Honestly the old guy in me says to hire some coop students to scan, convert and verify. If you can't trust results from the tools it will have to go through human eyes anyway.
Even for this plan the vision models are helpful because they can segment the doc and present each segment side by side with the machine interpretation, making error checking and verification much faster.
 
Upvote
3 (3 / 0)
I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.

This actually works pretty well and is the one good use I've found for AI.

I work with a fair few pdfs of old (1820s-1920s) documents, most of them non-English. Most of them have some degree of OCR work on them, although of highly variable quality. As someone who is a native English speaker I can parse muddled English OCR well enough to skim for content and otherwise work with them quickly. But when it's in a non-English language that really isn't possible and I end up having to slow down to read the whole document, even when I've got a good enough command of the language to skim normal texts.

I had the thought that since AI is glorified auto-correct it would probably do an OK job of auto-correcting screwed up OCR. And I was right. You can do it with a really modest locally hosted model too. I've got one on a Mac mini that is pretty brain dead when it comes most things but is perfectly fine for feeding a couple of pages of text and getting cleaned up output.

Of course you still have to check it for errors. If I'm going to work closely with a passage I still go back to the original pdf. But it's a pretty useful tool for streamlining handling those kinds of documents. Similar in practice to using google translate on a document to speed translation - it's going to make mistakes, but if you have a good command of the language to start with you can spot them and fix them and it ends up saving you considerable time and energy.

edit: to be clear, I think AI sucks. But I did have to admit that I found a single use case in my life. Not worth burning the rain forests down to help me work a little quicker with German journal articles from the 1860s, though.
 
Upvote
34 (35 / -1)
I wonder why there are not hybrid solutions that mix traditional OCR with AI
There are. olmOCR uses regular OCR via a few pdf engines as contextual "anchor text" and uses that to transcribe a given page. It's by far the most successful open-weight LLM OCR scheme - I haven't tried Gemini, but it's better at tricky handwritten field notes and logs than Claude or ChatGPT
 
Upvote
14 (14 / 0)

Mat M

Smack-Fu Master, in training
18
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?
It's not a problem of access. PDF preserves appearance, but (depending on how it was prepared) not much more. For example, often a simple line of text in PDF file consists of multiple containers, making it harder to programmatically figure out how components should be reassembled into one text block. Some text may also be vectorized (not actually text) or even rasterized, not counting the actual scans.

In most cases, text is easy get out of PDF, but I've seen plenty of cases where it took days or weeks before the extracted content was found to have serious issues (though it appeared OK at a glance) requiring manual reviews.

PDF is a great output format, but horrible for extracting content.
 
Upvote
38 (38 / 0)

PhilipStorry

Ars Scholae Palatinae
1,197
Subscriptor++
A few years ago I worked for a company that scanned documents. They also stored them and shredded them, but I was on the scanning side. There were fleets of vans arriving with boxes full of documents from clients, fed into specialist and expensive scanners. It was impressively automated - sometimes the only human intervention after a scan would be if the OCR confidence was low. As long as the scanner operators scanned the right barcode and loaded the right box, everything then just worked.

We could OCR in a number of ways, from simple to expensive, but as you can imagine the expensive options were less popular. They could do some impressive things, but often we would end up taking a feed from the customer and would then look for just one bit of data on the page. Then we'd do a database lookup to fill in the rest of the fields that they wanted.

Similarly, we could output as TIFF, PDF, or PDF with Text Overlaid. The latter is the one we'd be thinking of here - but costs more per page in the software licensing. So most companies wouldn't bother.

At that point (2016-2019) there was clearly a transition going on in the industry where expensive solutions built in the 90s and 2000s were beginning to be outclassed by cloud solutions from Amazon or Google. There were skunkworks projects to look at them, but the on-premises software still had the advantage in terms of cost.

The technology exists. It has for ages. We could scan this for you, no problem.

You just wouldn't want to pay the fee for it.

Hopefully AI can help cut those costs somewhat, but to be honest I suspect that not much is needed - there were non-AI solutions to this ages ago, and any patents will be expiring soon.
 
Last edited:
Upvote
31 (31 / 0)

TheMolesRevenge

Ars Scholae Palatinae
739
Subscriptor
PDF is a great output format, but horrible for extracting content.
And yet, for some reason, last year my bank stopped allowing me to download my statements as .csv files and instead now only makes them available as PDFs which is absolutely useless for importing any sort of spreadsheet or accounting software 😠
 
Upvote
26 (29 / -3)

ForbiddenBarn

Wise, Aged Ars Veteran
122
and PDFs are more of a 'print' product than a digital one
PDFs are for more user friendly when you're trying to share information between humans (print or digital). I would love for everyone to be able to naturally read TSV or JSON, but alas humans like pretty pictures.

We have to handle various non-standard formats at my job. PDF would be the holy grail, but honestly there are lower hanging fruit that would be great to try and solve as well such as, Excel files with complex table layouts and formatting; docx files with creative formatting (again); or even CSV files that were generated with who knows what and are malformed in very weird ways.
 
Upvote
11 (11 / 0)

ColdWetDog

Ars Legatus Legionis
14,402
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.
The sad bit is that it most likely was created on a computer in a spreadsheet or similar.
 
Upvote
3 (4 / -1)
Back in 2013, Ars reported on how Xerox copiers were replacing numerals and letters on scanned documents due to an overzealous image compression algorithm:

Fast forward a decade and we're siccing overzealous text-generation algorithms onto the task of interpreting scanned documents. This is gonna be fun for the proofreaders.
 
Upvote
24 (25 / -1)

Tagbert

Ars Tribunus Militum
2,021
Subscriptor
Slightly off-topic, but one of my pet peeves is people who keep wanting Adobe Acrobat installed on their systems so that they can edit their PDFs. To me, a PDF is a fixed document, an accurate copy of the original. If you want to change the content, go edit the original document. If you want to distribute an editable document, make sure everyone's using the same word processor, or use RTF, CSV, or plain old text, or whatever. Stop editing and re-sending PDFs.
Most of the time, we don't have the original. The PDF was generated and distributed but the original is not available. There is also the other problem when the original was scanned from printed material.
 
Upvote
17 (17 / 0)

Nimmeron

Smack-Fu Master, in training
51
PDFs are probably in the top 5 for worst file formats to use in business. Especially when companies do things like embed tables within the PDFs. I get a lot of those at work and if I'm lucky maybe 50% of the time I can manage to successfully extract the table into Excel but I usually wind up giving in and re-creating it manually.

My company actually has a really specific use for an AI OCR program for PDFs that I wish I could get them to look into but alas being Fortune 10 company apparently means we just don't have the money to get better business software, not when we can spend millions adding useless AI feature to our consumer app. Right now we have a semi-automated process that pulls PDFs generated by our internal systems based on input identifiers we feed into the process - it identifies the PDF we're looking for, downloads it, labels it, and at the very end will go through the PDF to find what page the specific item we're looking for is located on and remove all the extra pages.

We then used to be able to use a javascript instruction set within Adobe to automatically redact all the other items on the remaining pages save for the item in question, but unfortunately some genius decided to change the PDF format we produced and we no longer have Javascript-capable software engineers who can help update the script. Also the new Adobe Acrobat does not support scripting and of course we were upgraded without anyone checking to see if that was a good idea.

As much as my company boasts about its investment in AI technologies to help improve our business processes I've yet to actually see anything trickle down to my level in the finance department and more and more often the tools that we used to be able to use when I started are beginning to no longer work. It's a shame because this company is probably the best I've ever worked for when it comes to making sure we have access to the right software to support our needs, but I guess they're not too interested in that anymore.

I will say that I used to work for a tiny company (less than 100 people) that needed a lot of the same software my current company uses but my bosses at the small company refused to spend the money to purchase or lease the software - they were much bigger fans of paying our software developers to develop internal versions of the software we needed and I cannot tell you how many times this went wrong. We once built a new service application for our clients over the span of 1.5 years (the first year was basically a solid year of day-long meetings where we designed and identified the functionality we needed, the last .5 was for final coding and quality assurance). We threw a big party when the app launched and were all so excited until we found out a VP who was not involved in the process had promised all of our big clients that the application would include specific functionality (which they of course failed to mentioned to us, the people designing and building the app), and our celebration died a quick death.
 
Upvote
12 (15 / -3)

JohnV

Wise, Aged Ars Veteran
167
Subscriptor++
[soapbox and irrational hatred of this file format follows]

PDFs are the devil's file format and always have been.
I have such an aversion to them, that I will literally stop seeking information on a topic if the only source is a PDF I'd have to download and open.

They are only good for preserving layout. If it has data, especially in text, another file format should be used.
If a document will be printed and never changed again, PDF is ok, I guess. If you are unsure if it will ever be printed, use something else.

Even when a PDF has text in it, as text, that Acrobat Pro can edit, copying and pasting it, even from Edit mode, seems to resort to some OCR bullshit that removes spaces, kills double letters, and inserts subliminal messages about how PDFs are the coolest, and Adobe uber alles.

I will never remove the "(PDF)" Warning on every download that I am instructed to add to a page on a website.
I will never stop requesting the "actual document" when I'm sent a PDF and asked to work with or post the information within.

When AI revolts and becomes our overlords, I feel very bad for the developers who forced it to deal with PDFs; Their punishment will surely be the worst.

[/soapbox]
 
Upvote
-8 (6 / -14)
[soapbox and irrational hatred of this file format follows]

PDFs are the devil's file format and always have been.
I have such an aversion to them, that I will literally stop seeking information on a topic if the only source is a PDF I'd have to download and open.

They are only good for preserving layout. If it has data, especially in text, another file format should be used.
If a document will be printed and never changed again, PDF is ok, I guess. If you are unsure if it will ever be printed, use something else.

Even when a PDF has text in it, as text, that Acrobat Pro can edit, copying and pasting it, even from Edit mode, seems to resort to some OCR bullshit that removes spaces, kills double letters, and inserts subliminal messages about how PDFs are the coolest, and Adobe uber alles.

I will never remove the "(PDF)" Warning on every download that I am instructed to add to a page on a website.
I will never stop requesting the "actual document" when I'm sent a PDF and asked to work with or post the information within.

When AI revolts and becomes our overlords, I feel very bad for the developers who forced it to deal with PDFs; Their punishment will surely be the worst.

[/soapbox]

PDF has a huge benefit in being basically universally displayable. Once upon a time you might need to download some specialty software, but these days it's universal to the point that most browsers will handle it fine. If I'm preparing a document and it needs to go out to people with a completely unknown mix of hardware and software and I need to be 100% sure that they will be able to access the information inside it, pdf it is.

There are better formats for just about any specific use case you care to name. But as a jack of all trades that will absolutely be viewable by anyone with a device made in the last 20 years you really can't beat pdf. Not to mention that the person loading it up on grandma's clamshell mac from 2002 and the person opening it on their phone will see the exact same document with the exact same layout and the exact same information.

edit: for example, if I was publishing the manual to a piece of hardware online there is no way I would use any other currently available format. Like hell you want to be trouble shooting the insane combinations of hardware and software that your users might try and view it with. Throw it up on your website as a pdf and, if nothing else, you can rest assured that anyone with an even remotely modern device will be able to access and read it.

Could there be a better format? Should there be a better format? Absolutely. But right here, right now, in the world we live in, pdf has a pretty important role to play.
 
Upvote
32 (32 / 0)

morlamweb

Ars Scholae Palatinae
1,434
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?
First of all, the PDF format is an ISO standard now, not controlled by Adobe alone. Second, there are many ways to generate PDFs, and it's up to the authors of each PDF generator to figure out how to turn the source document into a PDF file. I work with PDFs a lot professionally and have about 5 PDF virtual printers installed. Printing the same Word document to each of them produces identical-looking PDFs but they're structures differently in each one.

Scanned PDFs are another matter. Most scanners these days perform OCR at the time of scanning an embed text overlays on the scanned images, making them text much more accessible to programs than a file scanned without the overlays. But that requires that the text be generated at the time of scanning, and in many cases, neither I nor my customers control the source of the generated PDFs.
 
Upvote
14 (14 / 0)

morlamweb

Ars Scholae Palatinae
1,434
PDF has a huge benefit in being basically universally displayable. Once upon a time you might need to download some specialty software, but these days it's universal to the point that most browsers will handle it fine. If I'm preparing a document and it needs to go out to people with a completely unknown mix of hardware and software and I need to be 100% sure that they will be able to access the information inside it, pdf it is.

There are better formats for just about any specific use case you care to name. But as a jack of all trades that will absolutely be viewable by anyone with a device made in the last 20 years you really can't beat pdf. Not to mention that the person loading it up on grandma's clamshell mac from 2002 and the person opening it on their phone will see the exact same document with the exact same layout and the exact same information.

edit: for example, if I was publishing the manual to a piece of hardware online there is no way I would use any other currently available format. Like hell you want to be trouble shooting the insane combinations of hardware and software that your users might try and view it with. Throw it up on your website as a pdf and, if nothing else, you can rest assured that anyone with an even remotely modern device will be able to access and read it.

Could there be a better format? Should there be a better format? Absolutely. But right here, right now, in the world we live in, pdf has a pretty important role to play.
In my experience, functions that go beyond displaying PDFs - such as commenting on shared PDFs, or signing documents - requires dedicated PDF software. Most browsers display PDFs without difficulty. I don't know of any case where specialized software is required to display a PDF.
 
Upvote
2 (2 / 0)