Why extracting data from PDFs is still a nightmare for data experts

captobie · Mar 11, 2025

This feels more like like a feature than a problem to me.

Zool26 · Mar 11, 2025

accidentally follow instructions in the text (thinking they are part of a user prompt)

Hi Bobby Tables, is that you?

ubercurmudgeon · Mar 11, 2025

This is where the famous phrase, "To shake one's booty at windmills", comes from:

bafkreiczyf235mykzrhbslfkoa7bfxlzhfryme44dyetflkgidlycnsyjy@jpeg

TheOldChevy · Mar 11, 2025

I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.

john_e · Mar 11, 2025

It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.

jwbaker · Mar 11, 2025

Can confirm the general utility of Gemini Flash 2 for this. I used NotebookLM to process a huge pile of local campaign finance disclosures that would otherwise have been a massive chore.

peterford · Mar 11, 2025

Why not both? Have an overall process run it through OCR, run it through a VLM, diff the outputs, embed confidence in metadata and link to the source?

I do think we need to stop thinking any process can be magic though - I'm sure we've all squinted at badly-scanned PDFs or any poorly formatted document wondering what on Earth the original author was thinking.

Ultor · Mar 11, 2025

Tabula needs a mention here. A great open-source tool that still works well. https://tabula.technology/

Nalyd · Mar 11, 2025

coremelt said:
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?

Because we’re talking mostly about PDFs that are basically scans of paper documents, not documents that were printed as PDF from a digital text document. Those can still have layout quirks but the text is embedded so can be extracted as written. and the characters constant and clear, so machine reading is easy. But the scans of typewritten and/or handwritten documents are much harder.

I suspect Googles model works so well because of 2 decades of ReCAPTCHA.

john_e said:
It would be best if the problem could be fixed at source. ..:The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.

For modern data yes, but unless you have a time machine to tell people from the 1960s or before that they needed to make their documents machine readable, this doesn’t really address the issue at hand.

bernstein · Mar 11, 2025

ironically pdf's are probably the only textual-graphical digital format that preserves layout perfectly, that so far has stood the test of time. if you have archival needs for more than a decade it's rightly so the go to format.

bernstein · Mar 11, 2025

john_e said:
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage.

Please give a better digital format that preserves layout perfectly and will be universally reproducible in a few decades of time. Though i suppose releasing the data as xml, json or csv too would be nice.

john_e said:
The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.

That is just plain mean. People are more intelligent than you insinuate

Kazper · Mar 11, 2025

bernstein said:
ironically pdf's are probably the only textual-graphical digital format that preserves layout perfectly, that so far has stood the test of time. if you have archival needs for more than a decade it's rightly so the go to format.

I think that really depends on the exact archival needs. If layout matters, I agree, but very often it does not (and this article is about all the cases where it doesn't, but the data is important).

NomadUK · Mar 11, 2025

Slightly off-topic, but one of my pet peeves is people who keep wanting Adobe Acrobat installed on their systems so that they can edit their PDFs. To me, a PDF is a fixed document, an accurate copy of the original. If you want to change the content, go edit the original document. If you want to distribute an editable document, make sure everyone's using the same word processor, or use RTF, CSV, or plain old text, or whatever. Stop editing and re-sending PDFs.

Nerdboi · Mar 11, 2025

Honestly the old guy in me says to hire some coop students to scan, convert and verify. If you can't trust results from the tools it will have to go through human eyes anyway.

AndrewZ · Mar 11, 2025

I can see the issue here. every PDF document is a piece of software code written in the PostScript language. To get to each paragraph of text and each embedded image of text, you have to parse the code. And of course sometimes it's encrypted. So, maybe pay Adobe for a coding solution?

monkeycid · Mar 11, 2025

ubercurmudgeon said:
This is where the famous phrase, "To shake one's booty at windmills", comes from:

There you go again, twerking at windmills.

Wheels Of Confusion · Mar 11, 2025

ubercurmudgeon said:
This is where the famous phrase, "To shake one's booty at windmills", comes from:

DON'T DEAD
OPEN INSIDE

View: https://www.reddit.com/r/dontdeadopeninside/comments/5uip33/dont_dead_open_inside/

Frankly I'm kind of interested specifically into how LLMs can avoid peppering their output with | instead of I, l, 1, or ! like traditional OCR has a tendency to do.

Erbium168 · Mar 11, 2025

john_e said:
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.

We'll have nae spreadsheets in Scotland. Nae King, nae Bishop, nae Trump, nae Excel!

The UK's national statistical database (started I believe by Harold Wilson) releases just about all its data as spreadsheets, providing enormous rabbitholes for amateur statisticians. There's a certain amount of Schadenfreude in thinking that as the US government is forced to delete its data, the UK government is making ever more of it accessible - interesting for future archaeologists if we manage to avoid bit rot.

Rirere · Mar 11, 2025

bernstein said:
Please give a better digital format that preserves layout perfectly and will be universally reproducible in a few decades of time. Though i suppose releasing the data as xml, json or csv too would be nice.

That is just plain mean. People are more intelligent than you insinuate

Is it mean to observe that people are careless and often focus on task completion over future proofing? We all do it; no one is immune, however smart they are.

Anecdotally I work with data for a living and the number of Excel spreadsheets I get (itself as admittedly close to "ideal" for structured data as I'm likely to get in the real world) that are nonetheless nightmares to parse is incredible.

The reason?

People who (likely genuinely) try to make the data more human readable by encoding data via formatting (hard to extract at scale) or introducing whitespace rows and columns/merging cells/staggering and spacing categorical labels.

There are ways to make reports both human- and machine- readable but you have to know the needs of both to produce such documents, and not everyone has that kind of training.

JBanister · Mar 11, 2025

TheOldChevy said:
I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.

I was thinking that same thing. Maybe three passes. First traditional OCR. Then AI does what it can where the OCR self-reports failure. Then, an AI is trained to fix the recognizable mistakes where the previous two passes thought they were correct. Having it correct an existing reproduction to make it more true to the original might be a goal that provokes fewer hallucinations.

jwbaker · Mar 11, 2025

Nerdboi said:
Honestly the old guy in me says to hire some coop students to scan, convert and verify. If you can't trust results from the tools it will have to go through human eyes anyway.

Even for this plan the vision models are helpful because they can segment the doc and present each segment side by side with the machine interpretation, making error checking and verification much faster.

kimbykip · Mar 11, 2025

Purely from the dispassionate pursuit of science, I'm curious to know how Gemini 2.0 Flash Pro Experimental (the current industry leader) would handle processing doom.pdf

Edit: this PDF must be opened in a Chromium browser

Cyrano4747 · Mar 11, 2025

TheOldChevy said:
I wonder why there are not hybrid solutions that mix traditional OCR with AI. I mean that AI can have a higher chance or reading something, but it risks hallucinations, while traditional OCR may not decipher some characters but can respect strict rules (recognise all line, recognise table structure...). One could be used as the proof of the other, significantly reducing hallucinations.

This actually works pretty well and is the one good use I've found for AI.

I work with a fair few pdfs of old (1820s-1920s) documents, most of them non-English. Most of them have some degree of OCR work on them, although of highly variable quality. As someone who is a native English speaker I can parse muddled English OCR well enough to skim for content and otherwise work with them quickly. But when it's in a non-English language that really isn't possible and I end up having to slow down to read the whole document, even when I've got a good enough command of the language to skim normal texts.

I had the thought that since AI is glorified auto-correct it would probably do an OK job of auto-correcting screwed up OCR. And I was right. You can do it with a really modest locally hosted model too. I've got one on a Mac mini that is pretty brain dead when it comes most things but is perfectly fine for feeding a couple of pages of text and getting cleaned up output.

Of course you still have to check it for errors. If I'm going to work closely with a passage I still go back to the original pdf. But it's a pretty useful tool for streamlining handling those kinds of documents. Similar in practice to using google translate on a document to speed translation - it's going to make mistakes, but if you have a good command of the language to start with you can spot them and fix them and it ends up saving you considerable time and energy.

edit: to be clear, I think AI sucks. But I did have to admit that I found a single use case in my life. Not worth burning the rain forests down to help me work a little quicker with German journal articles from the 1860s, though.

htotfalitm · Mar 11, 2025

TheOldChevy said:
I wonder why there are not hybrid solutions that mix traditional OCR with AI

There are. olmOCR uses regular OCR via a few pdf engines as contextual "anchor text" and uses that to transcribe a given page. It's by far the most successful open-weight LLM OCR scheme - I haven't tried Gemini, but it's better at tricky handwritten field notes and logs than Claude or ChatGPT

Mat M · Mar 11, 2025

coremelt said:
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?

It's not a problem of access. PDF preserves appearance, but (depending on how it was prepared) not much more. For example, often a simple line of text in PDF file consists of multiple containers, making it harder to programmatically figure out how components should be reassembled into one text block. Some text may also be vectorized (not actually text) or even rasterized, not counting the actual scans.

In most cases, text is easy get out of PDF, but I've seen plenty of cases where it took days or weeks before the extracted content was found to have serious issues (though it appeared OK at a glance) requiring manual reviews.

PDF is a great output format, but horrible for extracting content.

PhilipStorry · Mar 11, 2025

A few years ago I worked for a company that scanned documents. They also stored them and shredded them, but I was on the scanning side. There were fleets of vans arriving with boxes full of documents from clients, fed into specialist and expensive scanners. It was impressively automated - sometimes the only human intervention after a scan would be if the OCR confidence was low. As long as the scanner operators scanned the right barcode and loaded the right box, everything then just worked.

We could OCR in a number of ways, from simple to expensive, but as you can imagine the expensive options were less popular. They could do some impressive things, but often we would end up taking a feed from the customer and would then look for just one bit of data on the page. Then we'd do a database lookup to fill in the rest of the fields that they wanted.

Similarly, we could output as TIFF, PDF, or PDF with Text Overlaid. The latter is the one we'd be thinking of here - but costs more per page in the software licensing. So most companies wouldn't bother.

At that point (2016-2019) there was clearly a transition going on in the industry where expensive solutions built in the 90s and 2000s were beginning to be outclassed by cloud solutions from Amazon or Google. There were skunkworks projects to look at them, but the on-premises software still had the advantage in terms of cost.

The technology exists. It has for ages. We could scan this for you, no problem.

You just wouldn't want to pay the fee for it.

Hopefully AI can help cut those costs somewhat, but to be honest I suspect that not much is needed - there were non-AI solutions to this ages ago, and any patents will be expiring soon.

TheMolesRevenge · Mar 11, 2025

Mat M said:
PDF is a great output format, but horrible for extracting content.

And yet, for some reason, last year my bank stopped allowing me to download my statements as .csv files and instead now only makes them available as PDFs which is absolutely useless for importing any sort of spreadsheet or accounting software

ForbiddenBarn · Mar 11, 2025

and PDFs are more of a 'print' product than a digital one

PDFs are for more user friendly when you're trying to share information between humans (print or digital). I would love for everyone to be able to naturally read TSV or JSON, but alas humans like pretty pictures.

We have to handle various non-standard formats at my job. PDF would be the holy grail, but honestly there are lower hanging fruit that would be great to try and solve as well such as, Excel files with complex table layouts and formatting; docx files with creative formatting (again); or even CSV files that were generated with who knows what and are malformed in very weird ways.

ColdWetDog · Mar 11, 2025

john_e said:
It would be best if the problem could be fixed at source. Step forward, the Scottish Public Pensions Agency, who every year issue a table of pension rates with 58 entries - nicely formatted in two columns in a PDF, with numbers formatted as £n,nnn.pp and percentages as nn%. Meaning that if you need that information in machine-readable form you have to copy / paste it into a spreadsheet, remove the formatting, rearrange it into a single column, and export it, with consequent risks of introducing errors at every stage. The idea of issuing the rates in machine-readable form in the first place, I'm sure never crossed their minds.

The sad bit is that it most likely was created on a computer in a spreadsheet or similar.

CopperTower · Mar 11, 2025

I've recently been extracting error code information from various technical datasheets, and it's such a pain in the neck that the easiest solution for me is often to screenshot it and use macOS Preview's OCR to highlight it so I can copy it out, usually column by column into Excel.

A Man A Plan A Canal Panama · Mar 11, 2025

And yet, asking a human to double check the valurs in output product vs the input, is exactly the worst sort of job for human attention and abilities. Error prone.

you goddamn idiot. · Mar 11, 2025

Back in 2013, Ars reported on how Xerox copiers were replacing numerals and letters on scanned documents due to an overzealous image compression algorithm:

Confused photocopiers randomly rewriting scanned documents

Fast forward a decade and we're siccing overzealous text-generation algorithms onto the task of interpreting scanned documents. This is gonna be fun for the proofreaders.

Tagbert · Mar 11, 2025

NomadUK said:
Slightly off-topic, but one of my pet peeves is people who keep wanting Adobe Acrobat installed on their systems so that they can edit their PDFs. To me, a PDF is a fixed document, an accurate copy of the original. If you want to change the content, go edit the original document. If you want to distribute an editable document, make sure everyone's using the same word processor, or use RTF, CSV, or plain old text, or whatever. Stop editing and re-sending PDFs.

Most of the time, we don't have the original. The PDF was generated and distributed but the original is not available. There is also the other problem when the original was scanned from printed material.

Nimmeron · Mar 11, 2025

PDFs are probably in the top 5 for worst file formats to use in business. Especially when companies do things like embed tables within the PDFs. I get a lot of those at work and if I'm lucky maybe 50% of the time I can manage to successfully extract the table into Excel but I usually wind up giving in and re-creating it manually.

My company actually has a really specific use for an AI OCR program for PDFs that I wish I could get them to look into but alas being Fortune 10 company apparently means we just don't have the money to get better business software, not when we can spend millions adding useless AI feature to our consumer app. Right now we have a semi-automated process that pulls PDFs generated by our internal systems based on input identifiers we feed into the process - it identifies the PDF we're looking for, downloads it, labels it, and at the very end will go through the PDF to find what page the specific item we're looking for is located on and remove all the extra pages.

We then used to be able to use a javascript instruction set within Adobe to automatically redact all the other items on the remaining pages save for the item in question, but unfortunately some genius decided to change the PDF format we produced and we no longer have Javascript-capable software engineers who can help update the script. Also the new Adobe Acrobat does not support scripting and of course we were upgraded without anyone checking to see if that was a good idea.

As much as my company boasts about its investment in AI technologies to help improve our business processes I've yet to actually see anything trickle down to my level in the finance department and more and more often the tools that we used to be able to use when I started are beginning to no longer work. It's a shame because this company is probably the best I've ever worked for when it comes to making sure we have access to the right software to support our needs, but I guess they're not too interested in that anymore.

I will say that I used to work for a tiny company (less than 100 people) that needed a lot of the same software my current company uses but my bosses at the small company refused to spend the money to purchase or lease the software - they were much bigger fans of paying our software developers to develop internal versions of the software we needed and I cannot tell you how many times this went wrong. We once built a new service application for our clients over the span of 1.5 years (the first year was basically a solid year of day-long meetings where we designed and identified the functionality we needed, the last .5 was for final coding and quality assurance). We threw a big party when the app launched and were all so excited until we found out a VP who was not involved in the process had promised all of our big clients that the application would include specific functionality (which they of course failed to mentioned to us, the people designing and building the app), and our celebration died a quick death.

JohnV · Mar 11, 2025

[soapbox and irrational hatred of this file format follows]

PDFs are the devil's file format and always have been.
I have such an aversion to them, that I will literally stop seeking information on a topic if the only source is a PDF I'd have to download and open.

They are only good for preserving layout. If it has data, especially in text, another file format should be used.
If a document will be printed and never changed again, PDF is ok, I guess. If you are unsure if it will ever be printed, use something else.

Even when a PDF has text in it, as text, that Acrobat Pro can edit, copying and pasting it, even from Edit mode, seems to resort to some OCR bullshit that removes spaces, kills double letters, and inserts subliminal messages about how PDFs are the coolest, and Adobe uber alles.

I will never remove the "(PDF)" Warning on every download that I am instructed to add to a page on a website.
I will never stop requesting the "actual document" when I'm sent a PDF and asked to work with or post the information within.

When AI revolts and becomes our overlords, I feel very bad for the developers who forced it to deal with PDFs; Their punishment will surely be the worst.

[/soapbox]

Cyrano4747 · Mar 11, 2025

JohnV said:
[soapbox and irrational hatred of this file format follows]

PDFs are the devil's file format and always have been.
I have such an aversion to them, that I will literally stop seeking information on a topic if the only source is a PDF I'd have to download and open.

They are only good for preserving layout. If it has data, especially in text, another file format should be used.
If a document will be printed and never changed again, PDF is ok, I guess. If you are unsure if it will ever be printed, use something else.

Even when a PDF has text in it, as text, that Acrobat Pro can edit, copying and pasting it, even from Edit mode, seems to resort to some OCR bullshit that removes spaces, kills double letters, and inserts subliminal messages about how PDFs are the coolest, and Adobe uber alles.

I will never remove the "(PDF)" Warning on every download that I am instructed to add to a page on a website.
I will never stop requesting the "actual document" when I'm sent a PDF and asked to work with or post the information within.

When AI revolts and becomes our overlords, I feel very bad for the developers who forced it to deal with PDFs; Their punishment will surely be the worst.

[/soapbox]

PDF has a huge benefit in being basically universally displayable. Once upon a time you might need to download some specialty software, but these days it's universal to the point that most browsers will handle it fine. If I'm preparing a document and it needs to go out to people with a completely unknown mix of hardware and software and I need to be 100% sure that they will be able to access the information inside it, pdf it is.

There are better formats for just about any specific use case you care to name. But as a jack of all trades that will absolutely be viewable by anyone with a device made in the last 20 years you really can't beat pdf. Not to mention that the person loading it up on grandma's clamshell mac from 2002 and the person opening it on their phone will see the exact same document with the exact same layout and the exact same information.

edit: for example, if I was publishing the manual to a piece of hardware online there is no way I would use any other currently available format. Like hell you want to be trouble shooting the insane combinations of hardware and software that your users might try and view it with. Throw it up on your website as a pdf and, if nothing else, you can rest assured that anyone with an even remotely modern device will be able to access and read it.

Could there be a better format? Should there be a better format? Absolutely. But right here, right now, in the world we live in, pdf has a pretty important role to play.

morlamweb · Mar 11, 2025

coremelt said:
Why don't they ask Adobe to create a solution? Adobe created and controls the PDF format. They could certainly create a tool to batch convert PDF to more machine readable formats.

EDIT: Acrobat pro can already do this, why are they using OCR?

First of all, the PDF format is an ISO standard now, not controlled by Adobe alone. Second, there are many ways to generate PDFs, and it's up to the authors of each PDF generator to figure out how to turn the source document into a PDF file. I work with PDFs a lot professionally and have about 5 PDF virtual printers installed. Printing the same Word document to each of them produces identical-looking PDFs but they're structures differently in each one.

Scanned PDFs are another matter. Most scanners these days perform OCR at the time of scanning an embed text overlays on the scanned images, making them text much more accessible to programs than a file scanned without the overlays. But that requires that the text be generated at the time of scanning, and in many cases, neither I nor my customers control the source of the generated PDFs.

morlamweb · Mar 11, 2025

Cyrano4747 said:
PDF has a huge benefit in being basically universally displayable. Once upon a time you might need to download some specialty software, but these days it's universal to the point that most browsers will handle it fine. If I'm preparing a document and it needs to go out to people with a completely unknown mix of hardware and software and I need to be 100% sure that they will be able to access the information inside it, pdf it is.

There are better formats for just about any specific use case you care to name. But as a jack of all trades that will absolutely be viewable by anyone with a device made in the last 20 years you really can't beat pdf. Not to mention that the person loading it up on grandma's clamshell mac from 2002 and the person opening it on their phone will see the exact same document with the exact same layout and the exact same information.

edit: for example, if I was publishing the manual to a piece of hardware online there is no way I would use any other currently available format. Like hell you want to be trouble shooting the insane combinations of hardware and software that your users might try and view it with. Throw it up on your website as a pdf and, if nothing else, you can rest assured that anyone with an even remotely modern device will be able to access and read it.

Could there be a better format? Should there be a better format? Absolutely. But right here, right now, in the world we live in, pdf has a pretty important role to play.

In my experience, functions that go beyond displaying PDFs - such as commenting on shared PDFs, or signing documents - requires dedicated PDF software. Most browsers display PDFs without difficulty. I don't know of any case where specialized software is required to display a PDF.

Why extracting data from PDFs is still a nightmare for data experts

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Praefectus

Ars Tribunus Militum

Seniorius Lurkius

Ars Praefectus

Ars Praefectus

Seniorius Lurkius

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Centurion

Ars Legatus Legionis

Ars Centurion

Ars Centurion

Ars Scholae Palatinae

Ars Praefectus

Smack-Fu Master, in training

Ars Centurion

Seniorius Lurkius

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Legatus Legionis

Seniorius Lurkius

Ars Tribunus Militum

Ars Scholae Palatinae

Confused photocopiers randomly rewriting scanned documents​

Ars Tribunus Militum

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Centurion

Ars Scholae Palatinae

Ars Scholae Palatinae

Confused photocopiers randomly rewriting scanned documents