BNFO 601 
Integrated Bioinformatics
Scenarios
Fall 2007 
Identifying DNA foreign to a genome

Scientific story

In brief: Genes that provide bacteria with exotic abilities, such as pathogenesis, often arise by horizontal transfer from other organisms. You would like to identify all genes in the sequenced genome of a bacterium that have foreign origins. Current methods work well with large blocks of DNA (i.e. many tens of genes in length) but not so well with individual genes, because they do not extract sufficient amount of DNA from a single gene to permit the characteristics of foreign genes to reliably rise above random variation. You would like to adapt a technique that makes greater use of the information within genes and use it to identify foreign genes.
Bioinformatic tools
Markov models
Contrary to all those disclaimers from investment advisors, past performance CAN predict future behavior.
Molecular biology concepts: Compositional inhomogeneities in genomic sequences

Perl focus: Using hashes

Papers

Ute Hentschel and Jörg Hacker (2001). Pathogenicity islands: the tip of the iceberg [Review]. Microbes and Infection 3:545-548
A quick review of pathogenicity islands (referred to in Notes for Nov 24)
Samuel Karlin (2001). Review: Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends in Microbiology 9:335-343
Review of methods to detect pathogenicity islands (main focus of Notes for Nov 24)
Jan Mrázek, Devaki Bhaya, Arthur R. Grossman, and Samuel Karlin (2001). Highly expressed and alien genes of the Synechocystis genome. Nucleic Acids Research 29:1590-1601
Attempt to apply methods to detect pathogenicity islands for the detection of individual foreign genes. Most of the article is concerned with highly expressed genes, however. (referred to in first set of Notes)
Notes
Detection of anomolous regions of a genome (PDF)  (Presentation)
Construction of programs to detect anomalous genes (PDF)
Basic Theory of Markov chains (ppt)
Programs
Hamlet.pl - Creates Markov model based on text in input file and uses it to create pseudotext
DATA: HamletSpeech.txt - Possible input for Hamlet.pl
DATA: Carols.txt - Possible input for Hamlet.pl
DATA: IRS-1040.doc - data for Hamlet.pl (IRS instructions)
DATA: Candide.txt - data for Hamlet.pl (1st two chapters, in French)
DATA: German.txt - data for Hamlet.pl (random spam, in German)
Display_hash.pl - Displays the contents of a hash in a logical format
FastA_module.pm - Reads in FastA-formatted files. You'll use it in MakeMarkov.pl

Storable.pm -  required for Display_hash.pl
MakeMarkov.pl - Creates Markov model based on set of DNA sequences. You'll write this based on Hamlet.pl
DATA: 6803PHX.nt - Training set of DNA sequences from bona fide genes of the Synechocystis PCC 6803
UseMarkov.pl - Assesses open reading frames using Markov model
DATA: 6803Orfs.nt - All protein-encoding genes from Synechocystis PCC 6803
Problem Set: Problem Set 8
Alternate results (used in PS8.1h): 6803orfs_codon_bias.xls
Data file (used in PS8.6): aa_info.txt

Questionnaire