Processing math: 100%
  • Exercise 1 - entropy
    • 1.1
    • 1.2
    • 1.3
  • Exercise 2 - partition function
    • 2.1
    • 2.2
    • 2.3
    • 2.4
    • 2.5
  • Exercise 3 - Base pair and unpaired probabilities
    • 3.1
    • 3.2
    • 3.3

Exercise 1 - entropy

We want to have a short look at the concept of entropy. Let X be a discrete random variable with probabilities p(i),(i=1,2,,n). The amount of information to be associated with an outcome of probability p(i) is defined as I(p(i))=log2(1p(i)).

In information theory, the expected value of the information I(X) for the random variable X is called entropy. The entropy is given by

H(X)=E[I(X)]=ni=1p(i)log2(1p(i)).

Note that all logarithms in this exercise are base 2 (standard in information theory).

1.1

Compute the entropy of the probability distribution p(1)=12,p(2)=14,p(3)=18,p(4)=18.

Compute the expected value of the information of the random variable E[I(X)] or entropy H(X) as described above.

H(X)=p(1)log(1p(1))+p(2)log(1p(2))+p(3)log(1p(3))+p(4)log(1p(4))=12+24+38+38=148

1.2

Compute the entropy of the distribution p(i)=14,(i=1,2,3,4).

H(X)=4p(i)log(1p(i))=2

1.3

How can you explain the difference in entropy for both probability distributions?

Which of the two probability distributions has the higher information content, because you know more about an event before an event happened?

For the probability distribution in part 1. the information content of the probability distribution is higher than for 2. as we get more information from the distribution itself (e.g. we know that we are more likely to observe i=1).\ The information content for the event is higher for part 2. as we can learn more from the event (see coin example from the lecture).

This results in a higher entropy for the probability distribution in part 2..

Exercise 2 - partition function

Given the RNA sequence AUCACCGC and a minimal loop length of 1. Compute the partition function of the molecule using an energy function similar to the Nussinov algorithm: E(P)=|P| that is the energy is the number of non-crossing base pairs of a given structure (only GC and AU pairs are to be considered).

Table of Boltzmann weights for RT=1:

2.1

Compute Z via exhaustive structure space enumeration. (using the provided Boltzmann weights)

To compute Z you need to know all possible structures.

2.2

Compute the structure probability for each structure.

Pr[P|S]=e(E(P)RT)Pe(E(P)RT)

2.3

Compare the structure probabilities in relation to each other. What is the structure probability of the open chain? What can you observe?

The open chain is more likely than any other structure including the optimal structure (here P={(2,4)(5,7)}).

2.4

Is this result expected?

Base pairs usually stabilize a structure.

This is not expected, the open chain should not have a higher probability than the optimal structure.

2.5

If not how can you fix it? Test your idea by recalculating the structure probabilities!

How can you ensure that the open chain gets the lowest Boltzmann weight without changing the idea of scoring the number of base pairs for computing the energy of a structure?

The reason why the open chain has the highest probability is because the Bolzmann weight decreases with an increasing number of base pairs. The solution is to negate the energy function, i.e. E(P)=|P|

We can now search for the optimal structure by energy minimization and the mfe structure is the most probable structure, as expected.

Exercise 3 - Base pair and unpaired probabilities

Given the corrected probabilities of the previous exercise for sequence AUCACCGC.

3.1

Compute and visualize the base pair probabilities in a dot-plot using your corrected energy function. (use the probability values instead of dots.)

Compute the base pair probabilities: PR[(i,j)|S]=P(i,j)Pr[P|S]

3.2

Compute the probability Pu(4) and Pu(5) to be unpaired at sequence position 4 and 5, resp., and Pu(4,5) that the subsequence 4..5 is not involved in any base pairing.

PRu[(i,j)]=Zui,jZ

Pu(4)=0.06+0.16+0.16=0.38Pu(5)=0.06+0.16+0.16=0.38Pu(4,5)=0.06+0.16=0.22

3.3

Why can’t we compute Pu(4,5) from Pu(4) and Pu(5) using their product, i.e. why is Pu(4,5)Pu(4)Pu(5)?

Are Pu(4) and Pu(5) independent?

The set of structures considered for Pu(4,5) is a subset of both structure sets used to compute Pu(4) and Pu(5). Thus, the two probabilities are not independent and can not be multiplied.