< BACKCONTINUE >

10.6 Exercises

Exercise 10.1

Go to the NCBI, EMBL, and EBI web sites and become familiar with their use.

Exercise 10.2

Read the GenBank format documentation, gbrel.txt.

Exercise 10.3

Write a subroutine that passes a hash by value. Now rewrite it to pass the hash by reference.

Exercise 10.4

Design a module of subroutines to handle the following kinds of data: a flat file containing records consisting of gene names on a line and extra information of any sort on succeeding lines, followed by a blank line. Your subroutines should be able to read in the data and then do a fast lookup on the information associated with a gene name. You should also be able to add new records to the flat file. Now reuse this module to build an address book program.

Exercise 10.5

Descend further into the FEATURES table. Parse the features in the table into their next level by parsing the feature names, locations, and qualifiers. Check the document gbrel.txt for definitions of the structures of the fields.

Exercise 10.6

Write a program that takes a long DNA sequence as input and outputs the counts of all four-base subsequences (256 of them in all), sorted by frequency. A four-base subsequence starts at each location 1, 2, 3, and so on. (This kind of word-frequency analysis is common to many fields of study, including linguistics, computer science, and music.)

Exercise 10.7

Extend the program in Exercise 10.6 to count all the sequences in a GenBank library.

Exercise 10.8

Given an amino acid, find the frequency of occurrence of the adjacent amino acids coded in a DNA sequence; or in a GenBank library.

Exercise 10.10

Extract all the words (excluding words like "the" or other unnecessary words) from the annotation of a library of GenBank records. For each word found, add the offset of the GenBank record in the library to a DBM file that has keys equal to the words, and values that are strings with offsets separated by spaces. In other words, one key can have a space-separated list of offsets for a value. Then you can quickly find all records containing a word like "fibroblast" with a simple lookup, followed by extracting the offsets and seeking into the library with those offsets. How big is your DBM file compared to the GenBank library? What might be involved in constructing a search engine for the annotations in all of GenBank? For human DNA only?

Exercise 10.10

Write a program to make a custom library of oncogenes from the GBPRI division of GenBank.

< BACKCONTINUE >

© 2002, O'Reilly & Associates, Inc.