1997-01-03 - Re: OCR and Machine Readable Text

Header Data

From: Alan Olsen <alan@ctrl-alt-del.com>
To: Bill Stewart <stewarts@ix.netcom.com>
Message Hash: 320aa8e2b1873dac5559cbd0acb563788d9ee5d7e9d573b65b24750768a3bec0
Message ID: <3.0.1.32.19970102225436.01072284@mail.teleport.com>
Reply To: N/A
UTC Datetime: 1997-01-03 07:00:31 UTC
Raw Date: Thu, 2 Jan 1997 23:00:31 -0800 (PST)

Raw message

From: Alan Olsen <alan@ctrl-alt-del.com>
Date: Thu, 2 Jan 1997 23:00:31 -0800 (PST)
To: Bill Stewart <stewarts@ix.netcom.com>
Subject: Re: OCR and Machine Readable Text
Message-ID: <3.0.1.32.19970102225436.01072284@mail.teleport.com>
MIME-Version: 1.0
Content-Type: text/plain


At 08:06 PM 1/2/97 -0800, Bill Stewart wrote:
>It's really embarassing to have to pay salaries of "public employees"
>who can't come up with better arguments than the paper/magnetic/OCR nonsense
>but don't have the guts to stop trying and admit they've wrong.
>Does the President still make $200K/year salary?  You'd think he'd either
>read what he signs or tell his employees to only ask him to sign
>at least half-way credible stuff.  The old regulations used to pretend that
>foreigners were too dumb to implement computer programs from algorithms;
>now they're pretending that foreigners are too dumb to type.*
>People used to say we have the best politicians money can buy, 
>but you ought to be able to buy better politicians than that.

Or beter excuses.


>At 10:31 AM 12/30/96 -0800, Tim May wrote:
>>And not only is OCR able these days to handle general fonts easily enough,
>>but almost all printed code is in fixed-width fonts, i.e., non-proportional
>>fonts. This makes OCR easy. 
>
>The basic difference between "easily OCRed source code" and "not easily
>OCRed source code" is pretty much limited to two things
>1) Half-decent print quality (black on white in Courier at 300dpi should
do....)
>        As Tim says, this stuff is child's play.  Back when OCRs were
>        $10,000 machines with cutting-edge 68010 processors,
>        reading Courier was pretty easy but it helped to put in checksums;
>        these days you don't really need that.  (It also didn't like
>        wet-process 240-dpi laser printing or faxes, but modern OCR software
>        can generally deal with good-quality faxes and 
>2) Bound pages vs. loose pages (printing with perforated pages
>        or selling the source code in loose-leaf might count as an
>        "attractive nuisance" :-), but a band-saw can solve that problem
>        for the OCR user unless it's printed on Tyvek or something silly :-)

Even an exacto-knife would work.  For proportional fonts it depends on how
nasty the kerning gets and the shape of the characters.  (And a san serif
font without too many kerning pairs should go though fine.)  The technology
for this has progressed quite a bit in the last few years.  

Next thing you know, OCR software will be export controlled as well.  (Or
they will require something silly, like having all code samples in
caligraphy fonts.)

>In the Karn case, the Feds made the silly argument that the
>floppy disk version had the files neatly separated, while the
>paper version split files between pages and had page numbers at the
>bottoms of the pages that weren't part of the source code.
>Even the $10K 68010 wonderbox could handle page headers/footers and margins, 
>and modern software can do decent translations into different
>word processor formats.

And even if it didn't, just selecting and deleting the margin areas would
not be all that difficult.  (Ooohhh...  A couple of extra hours is really
going to slow someone down.)

>>For just the amount of money we've spent (in our consulting fees) on
>>discussing just this issue of OCRing, the entire content of the MIT PGP
>>source code book AND Schneier's AC could have been manually inputted by
>>Barbadans or Botswanas, or probably even by Europeans.

I used to work for a company that would transfer entire archives of medical
journals.  Much of it we would just OCR.  Some of it we would send off
shore.  The OCR software was about 95% reliable and this was over 5 years
ago.  (And we were using 286 boxes for much of the OCR work.  Not a heavy
technoligical investment.)  I am sure that things have improved a great
deal since then.  (My new scanner included OCR software.  I will have to
run a test and report the findings.

>There's one German university that OCRed the MIT PGP source code book.
>The PGP folks passed out copies of their new 3.0 Pre-Alpha and an update
>at a recent Cypherpunks meeting.  See
>http://www.pgp.com/newsroom/sourcebook.cgi 
>for ordering informaiton.  It's been donated to some local libraries, 
>such as San Jose CA, and I hope they'll send it to the Library of Congress
>and various non-US university and other public libraries - the recent
>rules change clarifying that it's ok to export source code should make
>this much easier.   

The page listed does not contain order information.  Do you know costs
and/or order info?

>[* OK, it's not really possible to type or proofread perl code accurately :-)

Yeah, just look at what happened with John Orwant's _Perl 5 interactive
Course_.  The book is being recalled due to all the typographical errors
from the pubisher.  Reading some Perl code is also quite impossible.  (For
the reasons behind this, i recommend Charlie Stross's article on the tpoic
on page 36 of _The Perl Journal_ #4.)

>More to the point, OCRs aren't always real good about `backquotes' and other
>little blotchy marks that some languages use, and even humans don't always
>get them right.]

Many character sets are not very good at displaying "little used"
characters clearly.  (Some of the cheaper fonts do not even include them.)
Backticks are a special problem.  The latest Camel book has all sorts of
problems with hard to recognise backticks.

BTW, there is an article on Perl and randomness in The Perl Journal #4 by
John Orwant. Pretty basic for most Cypherpunks, but good reading none the
less...

---
|   If you're not part of the solution, You're part of the precipitate.  |
|"The moral PGP Diffie taught Zimmermann unites all| Disclaimer:         |
| mankind free in one-key-steganography-privacy!"  | Ignore the man      |
|`finger -l alano@teleport.com` for PGP 2.6.2 key  | behind the keyboard.|
|         http://www.ctrl-alt-del.com/~alan/       |alan@ctrl-alt-del.com|





Thread