1993-11-13 - Re:Bandwidth limitations, DNA binary coding

Header Data

From: VACCINIA@UNCVX1.OIT.UNC.EDU
To: cypherpunks@toad.com
Message Hash: 37afe1cb710af1be2f02d7953973574c36f78212a9828aaaeccf73d0b3820658
Message ID: <01H59LQZ0IG20028BH@UNCVX1.OIT.UNC.EDU>
Reply To: N/A
UTC Datetime: 1993-11-13 17:19:53 UTC
Raw Date: Sat, 13 Nov 93 09:19:53 PST

Raw message

From: VACCINIA@UNCVX1.OIT.UNC.EDU
Date: Sat, 13 Nov 93 09:19:53 PST
To: cypherpunks@toad.com
Subject: Re:Bandwidth limitations, DNA binary coding
Message-ID: <01H59LQZ0IG20028BH@UNCVX1.OIT.UNC.EDU>
MIME-Version: 1.0
Content-Type: text/plain


-----BEGIN PGP SIGNED MESSAGE-----

Perry writes:
>15 symbols, HALF a byte (actually a touch less.) One nybble can express 16
>possible symbols (or one Hex digit, or whatever.)

Oops, I stand corrected, 15 symbols half a byte. What I was trying to convey 
is that GenBank (the repository for genomic sequence data) has a specific 
format for binary representation of DNA sequence data. Most genomic analysis 
programs use GenBank sequence format now (some use EMBL which is similar) and 
probably will in the future. Thus, the half byte per GATC symbol is defined as 
convention, not by the fewest binary digits neccesary for encoding them. It 
may waste bandwidth but that's no problem for fiber optics. Which is how this 
thread started.

>plus, of course, the genome is highly compressable -- lots of repeated
>sequences, especially in interons.

This brings up an interesting topic. There are four classes of DNA: Foldback 
DNA, highly repetitive DNA, middle-repetitive DNA and single-copy DNA. 
Foldback DNA consists of palindromic sequences which form hairpin like 
structures. Highly repetitive DNA is made up of short sequences from several 
to hundreds of bases long (repeated around 5 x 10^5 times). Middle repetitive 
DNA consists of longer sequences, hundreds or thousands of bases long (these 
appear hundreds of times in the genome). Single-copy DNA sequences are usually 
genes themselves, of which (in humans) it is estimated that there are around 
1 x 10^5.

Since the genome is highly redundant (in mammals up to 60% of the genome is 
repetitive sequence), you could probably compress alot of it just by 
designating symbols for specific repetitive elements. Most of the repetitive 
nature of the genome is found as highly repetitive sequence localized as
tandem arrays (not in introns). However, a second class of element known as 
SINEs and LINEs are found in introns, gene flanking regions and intergenic 
regions. The most widely characterized SINE is the Alu sequence, which is 
approximately 300 bases long and scattered throughout the genome over 
5 x 10^5 times. This constitutes 5-6% of the genome! That's a lot of 
compressability.

I often wonder if the redundancy is a way to encrypt a species genome, thus 
keeping different species from genetic communication. The "key" being millions 
of random base pairings which allow like species to decrypt their own genetic 
code and successfully have progeny. Pairings between species that are too
dissimilar would be a refractory event because the key is not homologous. 
By the way, genes are made up of exons and introns.

Scott G. Morham              ! The First, 
Vaccinia@uncvx1.oit.unc.edu  !           Second
PGP Public Key by Request    !                  and Third Levels
                             !        of Information Storage and Retrieval
                             ! DNA,                       
                             !     Biological Neural Nets,
                             !                            Cyberspace

-----BEGIN PGP SIGNATURE-----
Version: 2.3a

iQCVAgUBLOR1gj2paOMjHHAhAQFNZwP+Lv7Xv4bityeHd2L53fgY4seWKZX/Mkrw
YmHv5hPpusiXx6jt2tVGPnPyH0TVtdFb5Cy1YVnvLydgU4FPblJAO7chWuc5EPXn
7/SQ29AuGrDnWu9gEGaQiqEUgn40idPgvDVVQPikAX8tn5OmWo8vygMwIYgicQUh
Po8BHvPSLfg=
=ek9F
-----END PGP SIGNATURE-----





Thread