-
Exercise 10.1
-
Go to the NCBI, EMBL, and EBI web sites and become familiar with their use.
-
Exercise 10.2
-
Read the GenBank format documentation, gbrel.txt.
-
Exercise 10.3
-
Write a subroutine that passes a hash by value. Now rewrite it to
pass the hash by reference.
-
Exercise 10.4
-
Design a module of subroutines to handle the following kinds of data:
a flat file containing records consisting of gene names on a line and
extra information of any sort on succeeding lines, followed by a
blank line. Your subroutines should be able to read in the data and
then do a fast lookup on the information associated with a gene name.
You should also be able to add new records to the flat file. Now
reuse this module to build an address book program.
-
Exercise 10.5
-
Descend further into the FEATURES table. Parse the features in the
table into their next level by parsing the feature names, locations,
and qualifiers. Check the document gbrel.txt for
definitions of the structures of the fields.
-
Exercise 10.6
-
Write a program that takes a long DNA sequence as input and outputs
the counts of all four-base subsequences (256 of them in all), sorted
by frequency. A four-base subsequence starts at each location 1, 2,
3, and so on. (This kind of word-frequency analysis is common to many
fields of study, including linguistics, computer science, and music.)
-
Exercise 10.7
-
Extend the program in Exercise 10.6 to count all the sequences in a
GenBank library.
-
Exercise 10.8
-
Given an amino acid, find the frequency of occurrence of the adjacent
amino acids coded in a DNA sequence; or in a GenBank library.
-
Exercise 10.10
-
Extract all the words (excluding words like "the" or
other unnecessary words) from the annotation of a library of GenBank
records. For each word found, add the offset of the GenBank record in
the library to a DBM file that has keys equal to the words, and
values that are strings with offsets separated by spaces. In other
words, one key can have a space-separated list of offsets for a
value. Then you can quickly find all records containing a word like
"fibroblast" with a simple lookup, followed by extracting
the offsets and seeking into the library with
those offsets. How big is your DBM file compared to the GenBank
library? What might be involved in constructing a search engine for
the annotations in all of GenBank? For human DNA only?
-
Exercise 10.10
-
Write a program to make a custom library of oncogenes from the GBPRI
division of GenBank.