OCR of magazine listings

This is where we keep track of who is scanning what, so we can avoid duplicating work and also define conventions/standards etc
Moderators: DaveH, andrew_rowland, Dave_E, bfoley, sabre150, Moderators
Post Reply
User avatar
SimonSideburns
Posts: 656
Joined: Mon Aug 26, 2013 9:09 pm
Location: Purbrook, Hampshire
Contact:

OCR of magazine listings

Post by SimonSideburns »

I'm on a few groups on facebook in the Speccy scene and I was musing about the possibility of OCRing listings from magazines so as to avoid typing the listings in by hand. It may be of interest to members here also, so I thought I'd mention it here.

I have tried a couple of free OCR programs for Windows but they have been totally hopeless. They assume the page being scanned is in a particular written language and not a computer listing. Also they do struggle to cope with the many styles of computer printout that these listings have been printed in.

One of my scanners, part of a multi-function device (an A3 Brother MFC-J6910DW) has a multi-page scanning facility, but I separated out all the pages of the magazine (Best of Sinclair Programs '84), fed the whole thing into the multi page feeder to be scanned and tried to scan the pages into separate files (one picture per page or something) but my laptop didn't seem to be happy with that and crashed telling me it had run out of memory without even saving a single image. Looks like I might have to scan each page one at a time. So much for that!

The vocabulary of a listing (apart from variable names) is going to be limited to keywords, symbols and possibly graphical characters, so we would need to find some way of training the OCR software to recognise those, but the problem is finding (or writing) OCR software that learns as it goes along and is 'happy' working in this manner.

I'm looking for recommendations, suggestions, solutions, etc.
Just remember kids, Beeb spelled backwards is Beeb!
User avatar
sydney
Posts: 2925
Joined: Wed May 18, 2005 10:09 am
Location: Newcastle upon Tyne
Contact:

Re: OCR of magazine listings

Post by sydney »

Could you load the games from the cover disc images then print them out.
User avatar
sydney
Posts: 2925
Joined: Wed May 18, 2005 10:09 am
Location: Newcastle upon Tyne
Contact:

Re: OCR of magazine listings

Post by sydney »

That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.
User avatar
1024MAK
Posts: 12800
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: OCR of magazine listings

Post by 1024MAK »

Erm, what cover disk?
Only the later magazines had cover disks, and for example, a lot of home computers magazines never had cover disks. True, some had cover tapes. But even where these exist, it still leaves vast numbers of magazines where only the printed listing was published.

When I try OCR on documents where either the characters are different to what it expects, or the wording is different to what it expects, it makes a complete hash of it.

As Simon indicates, the listings were often printed in various typefaces, various sizes, sometimes it was a dot matrix printout. And some magazines liked to jazz them up (diffent colours, both the type and the background). Or worst, print it on top of a picture.

All of which makes OCR recognition a nightmare :(

Even a human eye cannot always work out what has been printed :twisted:

Mark
User avatar
1024MAK
Posts: 12800
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: OCR of magazine listings

Post by 1024MAK »

sydney wrote:That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.
I hope you had a better snooze than I did...

Mark
User avatar
sydney
Posts: 2925
Joined: Wed May 18, 2005 10:09 am
Location: Newcastle upon Tyne
Contact:

Re: OCR of magazine listings

Post by sydney »

1024MAK wrote:Erm, what cover disk?
...
Mark
8bs.com has lots (ALL???) of cover disks/tapes, even when a cover disk or tape was not available - they have been typed in for you!

http://8bs.com/catalogue.htm

A&B
Beebug
Electron User
The Micro User

Whether or not this is actually useful is another matter altogether. :lol:
User avatar
lurkio
Posts: 4351
Joined: Wed Apr 10, 2013 12:30 am
Location: Doomawangara
Contact:

Re: OCR of magazine listings

Post by lurkio »

davidb seemed to have got some good results with the Tesseract OCR program recently:
davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
:?:
User avatar
davidb
Posts: 3398
Joined: Sun Nov 11, 2007 10:11 pm
Contact:

Re: OCR of magazine listings

Post by davidb »

lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:
I think it does use a dictionary sometimes, however, so the programs might get "translated" a bit. It's worth a try. If anyone wants me to try a few other OCR tools available in Debian then just let me know. :)
User avatar
Wouter Scholten
Posts: 235
Joined: Wed May 02, 2001 11:14 pm
Location: NL
Contact:

Re: OCR of magazine listings

Post by Wouter Scholten »

lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:
davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
:?:

I used tesseract many years ago already to create e.g. the text in the ad of tubelink and included that with the Advanced basic diskimage to give a little information as there was no manual (and none has surfaced as yet) nor other information on this 6502 2p basic. It worked well (I saved files to uncompressed tiff from the scanner), but only with clean scans/high res. I tried it recently on some lower res scans (e.g. the beebug pdfs from 8bs.com, converted to tiff via application-grab-window with xv) and the result was quite poor even with some changes to the image that might help the OCR program.
Post Reply

Return to “coordination of magazine scanning projects”