1996-10-17 - A home for stego in source code ?

Header Data

From: peter.allan@aeat.co.uk (Peter M Allan)
To: cypherpunks@toad.com
Message Hash: 04d3dc305497ae859ddc658176ff703289f260dd768baec90fb3f13f7d860795
Message ID: <9610171756.AA00490@clare.risley.aeat.co.uk>
Reply To: N/A
UTC Datetime: 1996-10-17 17:56:32 UTC
Raw Date: Thu, 17 Oct 1996 10:56:32 -0700 (PDT)

Raw message

From: peter.allan@aeat.co.uk (Peter M Allan)
Date: Thu, 17 Oct 1996 10:56:32 -0700 (PDT)
To: cypherpunks@toad.com
Subject: A home for stego in source code ?
Message-ID: <9610171756.AA00490@clare.risley.aeat.co.uk>
MIME-Version: 1.0
Content-Type: text/plain




The monarchist lightning-hater has been cracking the whip again.
> Where is highly sophisticated stego?

Adam Back suggested several areas for investigation, including English text.

Some weeks ago somebody suggested using synonyms in natural language
to express stegodata.  A synonym for a word in the message would be
chosen from a defined set, and the choice that was made would express
a bit or several.  This seems related to the clever "alternate
dictionary" idea in RFC-1938.  But a lack of true synonyms (or
programs that understand natural language) make it look pretty
hopeless as a practical proposition.

SOURCE CODE A POSSIBLE CHANNEL

Even small programs often have several hundred symbols.  The names
given to these are mostly arbitrary, and they can be replaced
(consistently) by other names by a program.  In fact there "shrouded"
source code is sometimes released to make it less clear to the human
reader than the original source.

In this application the names would be chosen randomly from a set of
words and syllables (perhaps with some rules about which are
plausible together).  There is also a choice of StyleWithCaps and
style_with_underscores.

If the symbols (in order of first appearance) all have a CRC
calculated on them I'd guess you could get 4 bits each without
difficulty.

There are also comments.  These might be natural language, debugging
code and never-really-got-this-working code.  Maybe a CRC on each
comment is worth 8 bits.   Comments of the form

   /* Thanks to P.McNicholas, M.Davies and WorpleSword
      for advising on the next section **/

could be added with random names quite well (although the user might
have to mark the complicated code fragments suitable for receiving
such attributions).  Names used could be lifted from a list of
known helpful programmers ...

And then there is layout.  The kind of thing that is standardised with indent(1).
This would be anti-standard in layout, choosing spacing and linebreaks and
spaces with the stegomessage in mind.

And some code constructions are capable of rewriting to add info to the message.

0==x , x==0
x>y  , x<y
i++  , ++i   (sometimes)
for  , while
if   , switch (sometimes)

Or variables could be added to hold intermediate values, and lengthen
the source as our desired effect.

There may be more scope for plausible stego in code that doesn't have
to work.  Next time you see a load of verbose mistyped rubbish called
"Help me with my program!!!" it could be someone who got this idea
ahead of me.

I've been dipping into "Understanding and Writing Compilers"
(Bornat, 1979).  (free, withdrawn from library)
It seems the early phases of a compiler build the symbol table as a tree
and do various bits of analysis on the source.  A good starting point for
a program might be a compiler - rip out the translation and optimisation
phases and add the stegoreader and stegogenerator.


KEYED V. UNKEYED STEGO

Assuming that the data hidden in stegofiles is random (keys or cyphertext)
the enemy is supposed to be unable to tell whether there is really a message
there.  An example of unkeyed stego is the use of LSB of each byte in an image
or audio file.  In contrast, the subliminal channel mechanisms written about by
Schneier use a key known to the intended reader.  In this application keyed
stego would seem to have an advantage over unkeyed in that the source could be
made to keep to a house style while retaining some stego capacity.  The seeds
for the CRCs and style info (resembling ./indent.pro ~/indent.pro) could be
user-defined.  I'd still wouldn't want to hide plaintext under it, but it
would give more flexibilty for the users.


Would anyone like to write a version ? Then we could test my guesses
about how many bits can be hidden.  It probably won't be close to 
1 in 8, but I think 1 in 30 should be within reach.




 -- Peter Allan    peter.allan@aeat.co.uk





Thread