4.1
Representing Sequence Data
The majority of this book deals with manipulating symbols that
represent the biological
sequences
of DNA and proteins. The symbols used in bioinformatics to represent
these sequences are the same symbols biologists have been using in
the literature for this same purpose.
As stated earlier,
DNA
is composed of four building blocks: the
nucleic acids, also
called nucleotides or bases. Proteins are composed of 20 building
blocks, the amino acids, also called residues. Fragments of proteins
are called peptides. Both DNA and proteins are essentially
polymers, made from
their building blocks attached end to end. So it's possible to
summarize the structure of a DNA molecule or protein by simply giving
the sequence of bases or amino acids.
These are brief definitions; I'm assuming you are either
already familiar with them or are willing to consult an introductory
textbook on molecular biology for more specific details. Table 4-1 shows bases; add a sugar and you get the
nucleotides adenosine, guanosine, cytidine, thymidine, and uridine.
You can further add a phosphate and get the nucleotides adenylic
acid, guanylic acid, cytidylic acid, thymidylic acid, and uridylic
acid. A nucleic acid is a chemically linked
sequence of nucleotides. A peptide is a small
number of joined amino acids; a longer chain is a
polypeptide. A protein
is a biologically functional unit made of one or more polypeptides. A
residue is an amino acid in a polypeptide
chain.
For expediency, the names of the nucleic acids and the amino acids
are often represented as one- or three-letter codes, as shown in
Table 4-1 and Table 4-2. (This book mostly uses the one-letter codes for
amino acids.)
Table 4-1. Standard IUB/IUPAC nucleic acid codes
Code
|
Nucleic Acid(s)
|
A
|
Adenine
|
C
|
Cytosine
|
G
|
Guanine
|
T
|
Thymine
|
U
|
Uracil
|
M
|
A or C (amino)
|
R
|
A or G (purine)
|
W
|
A or T (weak)
|
S
|
C or G (strong)
|
Y
|
C or T (pyrimidine)
|
K
|
G or T (keto)
|
V
|
A or C or G
|
H
|
A or C or T
|
D
|
A or G or T
|
B
|
C or G or T
|
N
|
A or G or C or T (any)
|
Table 4-2. Standard IUB/IUPAC amino acid codes
One-letter code
|
Amino acid
|
Three-letter code
|
A
|
Alanine
|
Ala
|
B
|
Aspartic acid or Asparagine
|
Asx
|
C
|
Cysteine
|
Cys
|
D
|
Aspartic acid
|
Asp
|
E
|
Glutamic acid
|
Glu
|
F
|
Phenylalanine
|
Phe
|
G
|
Glycine
|
Gly
|
H
|
Histidine
|
His
|
I
|
Isoleucine
|
Ile
|
K
|
Lysine
|
Lys
|
L
|
Leucine
|
Leu
|
M
|
Methionine
|
Met
|
N
|
Asparagine
|
Asn
|
P
|
Proline
|
Pro
|
Q
|
Glutamine
|
Gln
|
R
|
Arginine
|
Arg
|
S
|
Serine
|
Ser
|
T
|
Threonine
|
Thr
|
V
|
Valine
|
Val
|
W
|
Tryptophan
|
Trp
|
X
|
Unknown
|
Xxx
|
Y
|
Tyrosine
|
Tyr
|
Z
|
Glutamic acid or Glutamine
|
Glx
|
The nucleic acid codes in Table 4-1 include
letters for the four basic
nucleic acids; they also define single
letters for all possible groups of two, three, or four nucleic acids.
In most cases in this book, I use only A, C, G, T, U, and N. The
letters A, C, G, and T represent the nucleic acids for DNA. U
replaces T when DNA is transcribed into ribonucleic acid (RNA).
N is the common
representation for "unknown," as when a sequencer
can't determine a base with certainty. Later on, in Chapter 9, we'll need the other codes, for groups
of nucleic acids, when programming restriction maps. Note that the
lowercase versions of these single-letter codes is also used on
occasion, frequently for DNA, rarely for protein.
The computer-science terminology is a little different from the
biology terminology for the codes in Table 4-1 and Table 4-2. In
computer-science parlance, these tables define two
alphabets, finite
sets of symbols that can make
strings. A sequence
of symbols is called a string. For instance,
this sentence is a string. A language is a (finite or infinite) set
of strings. In this book, the languages are mainly DNA and protein
sequence data. You often hear bioinformaticians referring to an
actual sequence of DNA or protein as a "string," as
opposed to its representation as sequence data. This is an example of
the terminologies of the two disciplines crossing over into one
another.
As you've seen in the tables, we'll be representing data
as simple letters, just as written on a page. But computers actually
use additional codes to represent simple letters. You won't
have to worry much about this; just remember that when using your
text editor to save as
ASCII, or plain text.
ASCII is a way for computers to store textual (and control) data in
their memory. Then when a program such as a text editor reads the
data, and it knows it's reading ASCII, it can actually draw the
letters on the screen in a recognizable fashion because it's
programmed to know that particular code. So the bottom line is: ASCII
is a code to represent text on a computer.[1]