How to create PDF with scanned pages but selectable text?

Today I recieved a PDF from our supplier and it contained several printed and scanned pages with signatures etc. I opened it in Acrobat Reader DC. But to my surprise the text from the evidently scanned images could be selected and copied as a text. See the screenshot:

PDF scanned with selectable text

There is evidently some OCR behind this since the copied text contains mistakes. But how is this possible? I have never seen this before, how can this be created?

Answer

This has (contrary to some other answers here) most probably nothing to do with Acrobat at all.

Most (all?!) professional document scanners and most semi-professional ones will automatically perform OCR when you choose “Save as PDF” and have the “searchable” checkbox ticked in the settings. The cheaper “consumer grade” models will do the OCR on the attached PC, typical network scanners do it internally.

The word “searchable” means nothing more and nothing less than that the scanner will perform OCR, then generate a page with the scanned bitmaps within, and overlay them with invisible characters from the OCR, each placed over the respective character on the bitmap.

That way, you can search, and also select, copy, and paste the “bitmap” as if by magic. It’s no magic at all, however. In reality, you’re just copying invisible text.

The scanner may also do some additional magic such as compositing the large image from many small tiles which are also reused. This results in a much smaller document size than would actually be possible, but may also lead to funny surprises (not so funny if they happen to you!) such as the Xerox alters your bills story, ironically even when no OCR is done, depending on the firmware.

Attribution
Source : Link , Question Author : Vojtěch Dohnal , Answer Author : Lightness Races in Orbit

Leave a Comment