Midterm

The midterm must be submitted by 2:30pm on Tuesday, October 26th, 2021 to \({\tt bioinfo.western@gmail.com}\)

All answers in one email.

Subject header of email is Midterm, Lastname, Firstname, student ID.

You can use which ever media you prefer to answer your questions.

This is open book, so you can use whatever resources you would like, but you must cite them. You are not allowed to speak to each other or other experts in this field (eg students who previously took the course).

\({\bf 40}\) total marks.

Point form, clear sentences for the following please.

Question 1 [4 points]

How do cloud based approaches reduce the financial burden for researchers and companies?

Question 2 [6 points]

What is the difference between an algorithm and a computer program? What is the difference between a computer program and a computer language? What is the difference between a computer language and an interactive development environment (IDE)?

Question 3 [10 points]

For each item below, indicate whether you believe this is a data science, bioinformatics or computational biology issue. It could also be a combination of zero or more of these. At most one sentence to justify your answer.

Part a. Establishing a communication system (e.g. Slack) and a Cloud space for writing/sharing documents (e.g. Drive) for a group project

Part b. Building software that takes as input thousands of RNA-sequencing samples and evaluates the technical quality (number of reads, number of reads that do not align to the genome etc.) of each sample and automatically detects those samples that are likely to be of poor quality.

Part c. In a project aimed to profile the transcriptome of 1,000 tumors, identifying genes that are highly correlated with one aother, and with a clinical variable such as time to recurrence.

Part d. Your collaborator has a new assay (sometimes called a screen) that identifies the sites of ubiquitination (or some other type of modification to genomes or proteins) across a whole genome. She gives you a file of all the sites that were identified for some specific organism. Your job is to design a web portal and software to make this available for users to examine and download.

Part e. Your collaborator gives you the results of her ubiquitination screen from part d. She asks you to design software that predicts ubiquitination sites in a new organism (not the one where the original screen was performed.)

Question 4 [10 points]

Suppose you have a tibble that describes the genetic code and some amino acid properties (called \({\tt genetic\_code}\)). For example, it might look something like this.

## # A tibble: 10 x 8
##    pos1  pos2  pos3  amino_acid_long class      polarity charge   mass
##    <chr> <chr> <chr> <chr>           <chr>      <chr>    <chr>   <dbl>
##  1 U     U     U     Phe             aromatic   nonpolar neutral  165.
##  2 U     U     C     Phe             aromatic   nonpolar neutral  165.
##  3 U     U     A     Leu             aliphatic  nonpolar neutral  131.
##  4 U     U     G     Leu             aliphatic  nonpolar neutral  131.
##  5 C     U     U     Leu             aliphatic  nonpolar neutral  131.
##  6 C     U     C     Leu             aliphatic  nonpolar neutral  131.
##  7 C     U     A     Leu             aliphatic  nonpolar neutral  131.
##  8 C     U     G     Leu             aliphatic  nonpolar neutral  131.
##  9 U     C     U     Ser             hydroxilic polar    neutral  105.
## 10 U     A     A     STOP            <NA>       <NA>     <NA>      NA

(Please note that I have not included the entire genetic code here. You can assume the rest of the genetic code is present in the tibble \({\tt genetic\_code}\).)

Show code for each of the following questions.

Part a. For some amino acid \(Z\), return all of the codons that code for \(Z\).

Part b. For a given codon specified by \({\tt pos1, pos2, pos3}\), report which amino acid it codes for.

Part c. Report all amino acids that begin with nucleic acid C.

Part c. Identify the amino acid which has the most codons that code for it.

Part d. The number of amino acids that are nonpolar, not neutral and have a mass above 100.

Part e. The average mass of all aromatic amino acids.

Question 5 [10 points]

Using the \({\tt small\_brca}\) tibble, show code for the following:

Part a. What is the average age of women at time of diagnosis?

Part b. For each stage (\({\tt ajcc\_pathologic\_tumor\_stage}\)), what was the average age of death amongst those women that died? (You can ignore any missing values.)

Good luck!