Scanning advice needed for A4 books

for all subjects/topics not covered by the other forum categories
Post Reply
User avatar
flaxcottage
Posts: 5717
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire
Contact:

Scanning advice needed for A4 books

Post by flaxcottage »

There are a number of books and magazines awaiting scanning at the Educational Archive. These are bound using glue so that they will not open out flatly to go on my normal scanner. The items are valuable and so cannot be cut to feed pages into the scanner.

In addition, a number of these items have text that starts/ends about 6mm (1/4" in real money) from the binding. This prevents the pages being completely scanned if the book is hung with one page flat on the scanner and the rest hanging down at 90 deg.

There seem to be a couple of options available.

1. Buy one of the overhead book scanners, such as the Czur ET-24 Pro or the Canon Iris 6 Business.

2. Take photos with my camera, upload the images to my PC and OCR them using appropriate software.

Is there anyone out there in forumland who uses an overhead scanner and had success scanning an A4 book accurately?

Alternatively does anyone use OCR software that can work accurately on camera images?

About 4 years ago I tried the Czur ET-16 overhead scanner and found it to be just an expensive toy, giving very poor output and using software that just did not work. It was returned as being not fit for purpose. I also investigated 'industry standard' OCR to PDF software and found that lacking too, hence the two questions.
- John

Check out the Educational Software Archive at www.flaxcottage.com
dr_d_gee
Posts: 29
Joined: Thu Feb 01, 2024 6:53 pm
Contact:

Re: Scanning advice needed for A4 books

Post by dr_d_gee »

The main problem in dealing with camera images will be the distortion of text near the binding — assuming you can hold the camera in an appropriate position and light the work adequately.

I'm not sure what you mean by industry standard software, what you've tried or what operating system you are using. I haven't tried any recent version of OmniPage.I downloaded a trial version of Abby FineReader for Mac but was unimpressed with its accuracy — it was not as good as the free OCR program Tesseract. PDF OCR on the Mac uses Tesseract; there is a (free) community edition and a Professional Edition.

On the Windows platform I've tried programs that use Microsoft's OCR (which is said to be a licensed older version of OmniPage). It does not perform as well as Tesseract. It has a particular tendency to miss out entire words in otherwise correctly recognised text.

The main issue with Tesseract is that it can have difficulty with light text on a dark background.

The best OCR I have come across, by some margin, is TextSniper on the Mac. However, it works by taking a screenshot of an image displayed on the computer screen and copying the text to the clipboard. It does not provide the PDF output you require. It can OCR accurately text that Tesseract cannot recognise at all, including light on dark and text on patterned backgrounds. However, it is completely thrown by drop caps which Tesseract handles without issue.

Tesseract has an unusual history. It was originally developed by Hewlett-Packard in Bristol, but work stopped in about 1998. Much more recently it was acquired by Google and made open source. It needs to be at least version 4; 3 is much less accurate. In its native state it is a command line program that produces a text file from an image. I'm not aware of any program on Windows that uses Tesseract and produces PDF output, but there may be one I'm not aware of.
User avatar
flaxcottage
Posts: 5717
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire
Contact:

Re: Scanning advice needed for A4 books

Post by flaxcottage »

Thanks for the info.

The OCR stuff I tested was Adobe Acrobat, ABBYY, Omnipage and another, I think it was Carrera or something like that.

Yes the curvature of the page does throw OCR software. With a bit of effort I can get a flat photo. I tried that with a free online OCR and it was pretty accurate. The snag is it was soooo slow.
- John

Check out the Educational Software Archive at www.flaxcottage.com
dr_d_gee
Posts: 29
Joined: Thu Feb 01, 2024 6:53 pm
Contact:

Re: Scanning advice needed for A4 books

Post by dr_d_gee »

Tesseract-based solutions are "reasonably" fast — say 15 seconds for an A4 page. With the online program, you will be uploading the file to their servers and that will take time (uploading is always slower than downloading). Text Sniper is almost instantaneous but it doesn't produce PDF output.
User avatar
flaxcottage
Posts: 5717
Joined: Thu Dec 13, 2012 8:46 pm
Location: Derbyshire
Contact:

Re: Scanning advice needed for A4 books

Post by flaxcottage »

For flat images I found onlineocr.net very useful. That enabled me to compile a searchable PDF of the Micro-Scope magazine Issue 1 that I was missing. It is pretty accurate, though sometimes thrown by diagrams.

I have researched this topic quite a lot over the past few days and have bitten the bullet and ordered a Czur ET-24 scanner. The distributers, D&H Innovations, were very helpful. They were able to offer me a used version for a substantial discount as they were of the opinion that I had a 50:50 chance of scanning the books I had and that returning the used version would be better for them as well as me if it did not work.

I really hope it does work because it will make archiving so much easier. Fingers crossed. [-o<
- John

Check out the Educational Software Archive at www.flaxcottage.com
Post Reply

Return to “off-topic”