Gene finding with Hidden Markov Models

Goals for today

Object oriented programming in R: S4 objects (short note)
Then back to computational biology …
Measures of performance
Markov models
Hidden Markov Models
A simple model for gene finding

Object Oriented programming

A complete exposition on object orient (OO) programming would take a complete course
Our goal here is just to introduce you to this important concept
OO is a very powerful paradigm that really helps to organize code.
In R, there are three different types of OO programming, ordered by “strictness” (from informal to formal):

Let’s not worry about why there are so many flavours of OO programming, but just focus on as a compromise.

The basic concepts of OO programming

OO programming begins with the concept of a . Here we can make new classes on demand.
For example, we can think of genomes as a class. Individual objects of this class correspond to a single genome. So there would be an object for the genome of yeast and an object for human etc.
To create a class in R, we can use the function from the R base language.
The function has several required arguments, and many optional arguments that more advanced practioners would use to ensure good programming style.
Now each object can have several attributes or properties. These are stored in of the object. For example each genome object would have a name (of the organism), and perhaps a to contain the chromsomes, plus many more attributes for annotations.

A genome class

Genome <- setClass (
  "Genome",        # Set the name of the class. Often people start with uppercase
  
  slots = c(
    organism_name = "character",
    chromosomes = "DNAStringSet",
    annotations = "data.frame",
    chromosome_names = "data.frame",
    GO = "data.frame"
  ), # end of slots
  
  prototype = list(          # set the default values for each slot (optional)
    organism_name = "",    chromosomes = DNAStringSet(), 
    annotations = data.frame(), chromosome_names = data.frame(),
    GO = data.frame()
  ), # end of prototype
  
  # This fucntion can check that the data in your object is consistent;
  validity = function(object) {}  # it doesn't do anything. 
) # end of genome class

Instantiating objects of class GENOME

Now we can create objects of class passing different arguments to fill the slots.

bland <- Genome()
(bland)                # the slots have default settings
## An object of class "Genome"
## Slot "organism_name":
## [1] ""
## 
## Slot "chromosomes":
## DNAStringSet object of length 0
## 
## Slot "annotations":
## data frame with 0 columns and 0 rows
## 
## Slot "chromosome_names":
## data frame with 0 columns and 0 rows
## 
## Slot "GO":
## data frame with 0 columns and 0 rows
bland@organism_name    # note the use of the @ operator to access slots
## [1] ""
bland@chromosomes
## DNAStringSet object of length 0

Instantiating objects of class GENOME

# data_path <- "/cloud/project/data"      #  for RStudio Cloud
anno <- read_rds( file.path( data_path, "annotations_1.0.rds" ))
sc <- read_rds( file.path( data_path, "sc_1.0.rds" ))
sc_meta <- read_rds( file.path( data_path, "sc_meta_1.0.rds" ))
go <- read_rds( file.path( data_path, "GO.rds" ))

Scerevisiae <- Genome(organism_name = "Saccharomyces cerevisiae",
                     chromosomes = sc,
                     chromosome_names = sc_meta,
                     annotations = anno,
                     GO = go)
Scerevisiae@organism_name
## [1] "Saccharomyces cerevisiae"
Scerevisiae@annotations
## # A tibble: 23,058 x 9
##    seqid source type       start    end score strand phase attributes           
##    <fct> <fct>  <fct>      <int>  <int> <dbl> <fct>  <fct> <chr>                
##  1 chrI  SGD    chromosome     1 230218    NA <NA>   <NA>  ID=chrI;dbxref=NCBI:…
##  2 chrI  SGD    telomere       1    801    NA -      <NA>  ID=TEL01L;Name=TEL01…
##  3 chrI  SGD    X_element    337    801    NA -      <NA>  ID=TEL01L_X_element;…
##  4 chrI  SGD    X_element…    63    336    NA -      <NA>  ID=TEL01L_X_element_…
##  5 chrI  SGD    telomeric…     1     62    NA -      <NA>  ID=TEL01L_telomeric_…
##  6 chrI  SGD    gene         335    649    NA +      <NA>  ID=YAL069W;Name=YAL0…
##  7 chrI  SGD    CDS          335    649    NA +      0     Parent=YAL069W_mRNA;…
##  8 chrI  SGD    mRNA         335    649    NA +      <NA>  ID=YAL069W_mRNA;Name…
##  9 chrI  SGD    gene         538    792    NA +      <NA>  ID=YAL068W-A;Name=YA…
## 10 chrI  SGD    CDS          538    792    NA +      0     Parent=YAL068W-A_mRN…
## # … with 23,048 more rows
length(Scerevisiae@chromosomes[[3]])
## [1] 316620
write_rds( Scerevisiae, file.path( data_path, "Scerevisiae_vers_1.0.rds" ))

Methods associated wtih classes

Sometimes we want to have specific methods (functions) that operate on our objects.
Recall the function from last class that takes a binding site and searchers for it in the target sequence.
We won’t explore this functionality here.

Now back to computational biology

What is a model?
Why are they important?
How do we build a model?
How do we parameterize or train a model?
How do we evaluate its performance?
How do we validate a model?

TF binding sites revisited

We left off last class with this figure describing the distribution of log-odds ratio of 4 TFs across chromosome 3 of yeast.

The question is where to set the point along the -axis to call scores to the right as putative TF bindings sites and scores to the left as random hits.

Classifiers

If you set a point along the -axis, any point to the right corresponds to a score from a position somewhere in the chromosome. This position in the chromosome would be classified as a positive hit, an actual binding site for the TF.
Any point to the left corresponds to a score from a position somewhere in the chromosome that is not strong enough to be classified as binding site. It is classified as a negative. hit.
The reality, which you do not know but which you are trying to “learn”, is that each position along the chromosome is either true binding site or false binding site.
If we predict a position in the chromosome to be a binding site, and it truly is, we call it a True Positive (TP).
If we predict it not to be a binding site and it really is not a binding site, we call it a True Negative (TN).
If we predict a position in the chromosome to be a binding site, and it is not, we call it a False Positive (FP).
If we predict a position in the chromosome not to be a binding site, and it truly is, we call it a False Negative (FN).

TP, FP, TN, FN

Measures of performance

(Many other related notions, some of which we will explore later.)
But TP, FP, TN, and FN are difficult to know for the TF binding site problem. Which are the most problematic?

Training and testing

Another issue in the approach above is that the motifs that were used to train our model (the position weight matrix) were used in the evaluation of its performance.
The required reading starts to touch on this issue and we will explore it more today.
Briefly here, whatever data you use to train your model should be kept completely separate than the data used to validate your model.
One simple approach in this direction would be to train the model using only half of the binding sites (eg 3 of 6) and use the other half to compute TP, FP, TN, FN. Why is this not quite correct either?

Goals for today

Object oriented programming in R: S4 objects (short note)
Then back to computational biology …
Measures of performance
Markov models
Hidden Markov Models
A simple model for gene finding
(We switch now to some hand written notes.)

Points of reflection

What are some of the differences between predicting genes in eukaryotic versus prokaryotic genomes?
Besides nucleotide frequency, what other genomic properties could be used to predict the location of genes?
What are the advantages of object oriented programming?

BIOL 480

Computational Biology Western

Gene finding with Hidden Markov Models

M Hallett

21/07/2020

Goals for today

Object Oriented programming

The basic concepts of OO programming

A genome class

Instantiating objects of class GENOME

Instantiating objects of class GENOME

Methods associated wtih classes

Now back to computational biology

TF binding sites revisited

Classifiers

TP, FP, TN, FN

Measures of performance

Training and testing

Goals for today

Points of reflection

BIOL 480