Searching Scanned Documents

CJE-4D · Post by **CJE-4D** » Fri Jun 03, 2022 5:57 pm

Quite a lot of useful info on Acorns is available on line as scans but not being able to search them is a pain but I think I've found a solution

I was checking out a spare Acorn transformer we have here so I googled the part number:

0194 012 acorn

The first result is a PDF at http://classicacorn.computinghistory.or ... 020trm.pdf

Google actually displays:

Acorn A3010/A3020/A4000 Technical Reference Manual
http://classicacorn.computinghistory.org.uk › ...
PDF
This manual describes the Acorn A3010, Acorn A3020 ... in the Network Expansion Specification (Acorn Part No. ... 0194,012 TANSFMA 25VA 240VAC 2R FX.

This was very interesting as when viewing the PDF it is not searchable as the PDF is scanned pages. So it seems google has OCRed the PDF and indexed the text. Thank you Google.

I've even found that it works with the option "site:"

Site:http://classicacorn.computinghistory.or ... 020trm.pdf 0194,012

I hope this info might be useful, unless of course you all knew of this method or have a better one!

scruss · Post by **scruss** » Sat Jun 04, 2022 6:11 pm

That's helpful, but it's a shame these scans don't have searchable OCR data in the first place. Especially since the document you linked to had been created in OmniPage - a very capable OCR package - but seemingly set to be deliberately unsearchable!

I put it up on Internet Archive as a new item here: Acorn A3010/A3020/A4000 Technical Reference Manual. Give it a bit of time (at least until the this item is currently being modified/updated by the task: book_op message goes away) and it should be fully browsable.

fuzzel · Post by **fuzzel** » Sat Jun 04, 2022 9:35 pm

I've noticed that if I do a search in my work copy of Outlook (Office 365), say for a person or an amount, the search results will pick up results, not only from the email message itself, but also from within Word, Excel and pdf documents and it does it very quickly too. It must OCR and index as soon as an email arrives or is sent for future reference. Maybe I should get a home version and email all my documents, magazines, manuals etc to myself?

paulb · Post by **paulb** » Sat Jun 04, 2022 10:07 pm

fuzzel wrote: ↑Sat Jun 04, 2022 9:35 pm I've noticed that if I do a search in my work copy of Outlook (Office 365), say for a person or an amount, the search results will pick up results, not only from the email message itself, but also from within Word, Excel and pdf documents and it does it very quickly too. It must OCR and index as soon as an email arrives or is sent for future reference. Maybe I should get a home version and email all my documents, magazines, manuals etc to myself?

It probably indexes everything, yes. These cloud services are probably rather aggressive when it comes to opportunistic data aggregation, for dubious commercial purposes if not outright surveillance. The OCR bit would only apply to documents that do not already have the text within them in machine-readable form. So, unless the PDFs only contained bitmaps, there might well be no OCR performed at all: most sanely-produced PDF documents would at least include textual data. However, PDF being closer to a page description format than a structured information format (like HTML, SGML or XML), the art of obtaining the depicted text can be a tricky one.

I suppose the insight in the original post was that the PDFs were just scanned bitmaps packaged up into a PDF container, but that Google routinely performs OCR on PDFs. I assumed that the PDFs concerned already had embedded OCR data in them (that is the case for many of the Chris's Acorns scans), and that this OCR data is being exposed to search engines, but upon investigation it seemed not to be the case (these scans coming from Classic Acorn).

scruss · Post by **scruss** » Sun Jun 05, 2022 9:53 pm

Windows and Mac OS index everything (and I am very okay with that) but I thought they stopped short of OCR. Google put a fair bit of work into the Tesseract OCR engine so that even scanned PDFs could become search results. That's the engine behind the scanner in Google Drive on mobile devices: scan a picture, it ends up as text

Iggypop · Post by **Iggypop** » Sun Jun 05, 2022 10:53 pm

You can make any pdf searchable with NAPS for Windows. Setup OCR to the correct language and then simple drop the scanned pdf to the application window and it will be imported. Than simple save the PDF and now you can search, copy, and select text in the new PDF document..

Igor

scruss · Post by **scruss** » Tue Jun 07, 2022 12:31 am

I don't use Windows, but I use OCRmyPDF on Linux and Mac. It's a very simple command line program, but it will hammer every last core on your computer by processing pages in parallel. Hence, it's extremely fast but it does make your computer very hot.

The Acorn-A3020-Technical_Reference_Manual.pdf I uploaded to Internet Archive was run through OCRmyPDF

stardot.org.uk

Searching Scanned Documents

Searching Scanned Documents

Re: Searching Scanned Documents

Re: Searching Scanned Documents

Re: Searching Scanned Documents

Re: Searching Scanned Documents

Re: Searching Scanned Documents

Re: Searching Scanned Documents