I'm on a few groups on facebook in the Speccy scene and I was musing about the possibility of OCRing listings from magazines so as to avoid typing the listings in by hand. It may be of interest to members here also, so I thought I'd mention it here.
I have tried a couple of free OCR programs for Windows but they have been totally hopeless. They assume the page being scanned is in a particular written language and not a computer listing. Also they do struggle to cope with the many styles of computer printout that these listings have been printed in.
One of my scanners, part of a multi-function device (an A3 Brother MFC-J6910DW) has a multi-page scanning facility, but I separated out all the pages of the magazine (Best of Sinclair Programs '84), fed the whole thing into the multi page feeder to be scanned and tried to scan the pages into separate files (one picture per page or something) but my laptop didn't seem to be happy with that and crashed telling me it had run out of memory without even saving a single image. Looks like I might have to scan each page one at a time. So much for that!
The vocabulary of a listing (apart from variable names) is going to be limited to keywords, symbols and possibly graphical characters, so we would need to find some way of training the OCR software to recognise those, but the problem is finding (or writing) OCR software that learns as it goes along and is 'happy' working in this manner.
I'm looking for recommendations, suggestions, solutions, etc.
OCR of magazine listings
- SimonSideburns
- Posts: 657
- Joined: Mon Aug 26, 2013 9:09 pm
- Location: Purbrook, Hampshire
- Contact:
OCR of magazine listings
Just remember kids, Beeb spelled backwards is Beeb!
Re: OCR of magazine listings
Could you load the games from the cover disc images then print them out.
Re: OCR of magazine listings
That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.
- 1024MAK
- Posts: 12806
- Joined: Mon Apr 18, 2011 5:46 pm
- Location: Looking forward to summer in Somerset, UK...
- Contact:
Re: OCR of magazine listings
Erm, what cover disk?
Only the later magazines had cover disks, and for example, a lot of home computers magazines never had cover disks. True, some had cover tapes. But even where these exist, it still leaves vast numbers of magazines where only the printed listing was published.
When I try OCR on documents where either the characters are different to what it expects, or the wording is different to what it expects, it makes a complete hash of it.
As Simon indicates, the listings were often printed in various typefaces, various sizes, sometimes it was a dot matrix printout. And some magazines liked to jazz them up (diffent colours, both the type and the background). Or worst, print it on top of a picture.
All of which makes OCR recognition a nightmare
Even a human eye cannot always work out what has been printed
Mark
Only the later magazines had cover disks, and for example, a lot of home computers magazines never had cover disks. True, some had cover tapes. But even where these exist, it still leaves vast numbers of magazines where only the printed listing was published.
When I try OCR on documents where either the characters are different to what it expects, or the wording is different to what it expects, it makes a complete hash of it.
As Simon indicates, the listings were often printed in various typefaces, various sizes, sometimes it was a dot matrix printout. And some magazines liked to jazz them up (diffent colours, both the type and the background). Or worst, print it on top of a picture.
All of which makes OCR recognition a nightmare
Even a human eye cannot always work out what has been printed
Mark
For a "Complete BBC Games Archive" visit www.bbcmicro.co.uk NOW!
BeebWiki - for answers to many questions...
Fault finding index • Acorn BBC Model B minimal configuration • Logic Levels for 5V TTL Systems
BeebWiki - for answers to many questions...
Fault finding index • Acorn BBC Model B minimal configuration • Logic Levels for 5V TTL Systems
- 1024MAK
- Posts: 12806
- Joined: Mon Apr 18, 2011 5:46 pm
- Location: Looking forward to summer in Somerset, UK...
- Contact:
Re: OCR of magazine listings
I hope you had a better snooze than I did...sydney wrote:That wasn't as helpful as I thought- feel free to ignore it as I'd just woken from a nap.
Mark
For a "Complete BBC Games Archive" visit www.bbcmicro.co.uk NOW!
BeebWiki - for answers to many questions...
Fault finding index • Acorn BBC Model B minimal configuration • Logic Levels for 5V TTL Systems
BeebWiki - for answers to many questions...
Fault finding index • Acorn BBC Model B minimal configuration • Logic Levels for 5V TTL Systems
Re: OCR of magazine listings
8bs.com has lots (ALL???) of cover disks/tapes, even when a cover disk or tape was not available - they have been typed in for you!1024MAK wrote:Erm, what cover disk?
...
Mark
http://8bs.com/catalogue.htm
A&B
Beebug
Electron User
The Micro User
Whether or not this is actually useful is another matter altogether.
Re: OCR of magazine listings
davidb seemed to have got some good results with the Tesseract OCR program recently:
davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
Re: OCR of magazine listings
I think it does use a dictionary sometimes, however, so the programs might get "translated" a bit. It's worth a try. If anyone wants me to try a few other OCR tools available in Debian then just let me know.lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:
- Wouter Scholten
- Posts: 235
- Joined: Wed May 02, 2001 11:14 pm
- Location: NL
- Contact:
Re: OCR of magazine listings
lurkio wrote:davidb seemed to have got some good results with the Tesseract OCR program recently:
davidb wrote:I ran the pages through Tesseract and cleaned up the output. It's not really designed for this kind of text. Feel free to fix any errors I've introduced ...
I used tesseract many years ago already to create e.g. the text in the ad of tubelink and included that with the Advanced basic diskimage to give a little information as there was no manual (and none has surfaced as yet) nor other information on this 6502 2p basic. It worked well (I saved files to uncompressed tiff from the scanner), but only with clean scans/high res. I tried it recently on some lower res scans (e.g. the beebug pdfs from 8bs.com, converted to tiff via application-grab-window with xv) and the result was quite poor even with some changes to the image that might help the OCR program.