1993-03-12 - Tagging data to detect thieves

Header Data

From: Marc.Ringuette@GS80.SP.CS.CMU.EDU
To: cypherpunks@toad.com
Message Hash: 835974f1c33129fbe49ff06e1ab98eb42a324d1f819a000046f4e49aec55aa4b
Message ID: <9303120305.AA09556@toad.com>
Reply To: N/A
UTC Datetime: 1993-03-12 03:05:22 UTC
Raw Date: Thu, 11 Mar 93 19:05:22 PST

Raw message

From: Marc.Ringuette@GS80.SP.CS.CMU.EDU
Date: Thu, 11 Mar 93 19:05:22 PST
To: cypherpunks@toad.com
Subject: Tagging data to detect thieves
Message-ID: <9303120305.AA09556@toad.com>
MIME-Version: 1.0
Content-Type: text/plain

I've done some further thinking on the text tagging problem,
spurred by a question on sci.crypt about tagging pictures
(under the subject line "Permanent signatures for pictures").

Here's a summary.


Let's say Dow Jones wants to sell newswire subscriptions to individuals,
but someone is anonymously forwarding their articles to a newsgroup.  Can
they succeed in tagging the text to detect the thief?  The idea is to
make some small twiddle to each subscriber's copy of the text, so that
the stolen copy can be matched with some subscriber and their
subscription cancelled.

Short answer:  the thieves win.  At first, I thought the answer
was the opposite.


There are two issues which must be addressed in order to show that
the tagger wins:

  1.  The taggee must not be able to "smooth away" all of the tag bits.
  2.  The taggee must not be able to cross-correlate multiple copies
      of the data in question in order to produce a "clean" version.

Regarding issue #1, the basic techique is to alter a few features of
your data which are important enough that your opponent can't afford to
randomize ALL such bits.  In the case of text, small changes in word
choice are a good candidate.  Two criteria are:

   A.  The changes must be "important" enough that the thief can't
       smooth them all away.
   B.  The changes shouldn't be "important" enough that the newswire
       becomes worthless!

The tagger has an advantage in this case, though.  He can change, say,
1 in 1000 of these "important, non-smoothable-away" candidate bits.  If
the thief wants to cancel them out and only has a single copy of the
picture, he must somehow canonicalize _all_ of the candidate tag bits,
or some very large proportion of them.

So if your tagging process does a little bit of "damage" to your
data, like in the map-maker case of adding an extra small street
here and there, then the opponent must either try to detect exactly
where your damage is, or must make wholesale changes to the data (such
as removing all small roads altogether).  The thief, in trying to cover
up your damage, must make a thousand times as much damage.  Choose your
damage level appropriately so that your level of damage isn't too much
but the thief's is.


Issue #2, thieves cross-correlating between multiple copies of the data,
is a bit more subtle.  

Here's the scenario:

  Dow Jones has 10,000 customers, 64 of whom are in a conspiracy to steal
  and re-sell the newswire.  Dow Jones tries various tagging strategies,
  altering whitespace and word choice individually for each subscriber.
  The thieves try to cross-correlate between their copies of the text
  in order to "cancel out" the tags from the copy which they wish to re-sell.

Can Dow Jones detect the thieves and cancel their subscriptions?

In the discussion below, when Dow Jones "twiddles a bit" of their
newswire, they do so by substituting a word's synonym at a chosen
location, using a separate (possibly biased) coin flip for each
subscriber.  Here are the strategies I've considered.  

  Dow Jones strategy:  twiddle some bits with probability 0.5.
     If the thieves use majority vote, each thief will have a reasonably
     high correlation with the output bits.  (In fact, the probability of a
     match will exceed 50% by approximately the chance of a tie vote among
     the thieves, which is about 0.8/sqrt(n) where n is the number of
     thieves.  This computation is a bit hairy.)
  Thief countermeasure:  reliably detect which bits are being twiddled
      (by cross-checking between, say, 64 different subscriptions)
      and flip a fair coin to determine the output.  There's a chance
      of only 2 in 2^64 that the thieves fail to detect the twiddle.

  Dow Jones strategy:  twiddle some bits with low probability (e.g. p=0.01).
      Reasonably often, the bit values will be the same for all thieves.
      If the thieves use the flip-a-coin strategy, we can determine which
      tag bits they've failed to detect, and identify them that way.
  Thief countermeasure:  use a majority vote.

  Dow Jones strategy:  hybrid of the two.
  Thief countermeasure:  hybrid of the two.  Flip a coin if the vote is 
       fairly even, go with the majority if the vote is uneven.  For
       example, get 64 subscriptions, go with the majority vote if 
       fewer than 16 dissenters, flip a fair coin otherwise.

This last strategy for the thieves is the one I can't beat.

Theoretical help, anyone?

-- Marc Ringuette (mnr@cs.cmu.edu)