1997-01-03 - Re: OCR and Machine Readable Text

Header Data

From: Rabid Wombat <wombat@mcfeely.bsfs.org>
To: “/\anonymous/\” <panther@iglou.com>
Message Hash: a1be8874de6aad5b870dcc9682cb4aaceb1e7da78ebf227082d014591bfef489
Message ID: <Pine.BSF.3.91.970103135903.8406A-100000@mcfeely.bsfs.org>
Reply To: <32CD435E.24DC@iglou.com>
UTC Datetime: 1997-01-03 20:19:27 UTC
Raw Date: Fri, 3 Jan 1997 12:19:27 -0800 (PST)

Raw message

From: Rabid Wombat <wombat@mcfeely.bsfs.org>
Date: Fri, 3 Jan 1997 12:19:27 -0800 (PST)
To: "/**\\anonymous/**\\" <panther@iglou.com>
Subject: Re: OCR and Machine Readable Text
In-Reply-To: <32CD435E.24DC@iglou.com>
Message-ID: <Pine.BSF.3.91.970103135903.8406A-100000@mcfeely.bsfs.org>
MIME-Version: 1.0
Content-Type: text/plain



Accuracy will depend on the quality of the original being scanned, as 
well as the capability of the OCR system; flat originals scan much better 
than the "bent open" pages of a book or magazine, heavy stock tends to 
let less "bleed" through from the reverse side, fonts with extreme 
kerning are more difficult, point size is a factor, etc.

I've seen 97%+ w/ Calera, (about 2 years ago) when using flat, first
generation high quality photocopies w/ minimal skew and courier or similar
typeface. OTOH, the same system did not scan well at all w/ badly skewed
photocopies (caused by the "bend" induced by the binding of the original).
If you are scanning medical journals, take a look at your originals and
also at where the errors are occuring. 

You can also use a spell checker (after building up a suitable dictionary 
for your application) to cut out some of the error.

I'd guess your results to be less satisfactory for other applications 
where extreme accuracy is a must. "3", "8", and "B" for example, are 
often confused; not a big problem w/ a medical journal, but plays havoc 
w/ code, accouting data, etc.

-r.w.

On Fri, 3 Jan 1997, /**\anonymous/**\ wrote:

> Alan Olsen wrote:
> > I used to work for a company that would transfer entire archives of medical
> > journals.  Much of it we would just OCR.  Some of it we would send off
> > shore.  The OCR software was about 95% reliable and this was over 5 years
> > ago.  (And we were using 286 boxes for much of the OCR work.  Not a heavy
> > technoligical investment.)  I am sure that things have improved a great
> > deal since then.  (My new scanner included OCR software.  I will have to
> > run a test and report the findings.
> 
> 	I'd like to know what OCR software you were using.  All tests we
> completed at my place of employment were very poor quality wise.  We
> showed
> a %65 accuracy rate.  Not very good when you need to transfer a five
> year
> backlog of medical and technical journals.  This was using a high
> resolution
> scanner with a package that was bundled along with it.  About a year
> ago,
> my employer considered transfering data taken off of forms into a
> relational
> database using an OCR program.  Again, we found the findings to be too
> innacurate for our needs.  I may have just been using the wrong programs
> for
> the job, but the findings were depressing...
> 
> panther
> 
> > ---
> > |   If you're not part of the solution, You're part of the precipitate.  |
> > |"The moral PGP Diffie taught Zimmermann unites all| Disclaimer:         |
> > | mankind free in one-key-steganography-privacy!"  | Ignore the man      |
> > |`finger -l alano@teleport.com` for PGP 2.6.2 key  | behind the keyboard.|
> > |         http://www.ctrl-alt-del.com/~alan/       |alan@ctrl-alt-del.com|
> 





Thread