In the first step of the BLAT (Blast-like Alignment Tool) algorithm, regions that are likely to be homologous are detected. In this exercise sheet, we will investigate the search stage of the BLAT algorithm on the example of the mouse genome.
We want to determine whether a region from the human genome aligns to a part of the mouse genome. Therefore, a homologous region to our query sequence will be detected during the search stage.
We assume that:
In the following exercises, we will focus on the DNA sequences.
For the \(K\)-mer sizes \(K=7\) and \(K=14\), respectively.
What is the probability of having a perfect match between a specific \(K\)-mer in a homologous region and a \(K\)-mer in the query sequence?
\[ p_{1} = M^K \]
For \(K=7\): \[ p_{1} = 0.98^{7} = 0.8681 \] For \(K=14\): \[ p_{1} = 0.98^{14} = 0.7536 \]
What is the probability that at least one non-overlapping \(K\)-mer in the homologous region matches perfectly with the corresponding \(K\)-mer in the query sequence?
\[ P = 1 - (1 - p_{1})^{T} = 1 - (1 - M^{K})^{T},\\ \text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor \]
For \(K=7\): \[ P = 1 - (1 - 0.98^7)^{\left\lfloor\frac{50}{7}\right\rfloor} = 0.9999... \] For \(K=14\): \[ P = 1 - (1 - 0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor} = 0.9850... \]
Calculate the number of False Positives (FPs), i.e. the number of \(K\)-mers that are expected to match by chance.
\[\begin{align*} F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \bigg(\frac{1}{A}\bigg)^{K},\\ &A: \text{the alphabet size}\\ &Q: \text{the query length in bases}\\ &G: \text{the genome size in bases} \end{align*}\]
Warning
This only holds under the assumption that all letters in the alphabet are equally likely!
For \(K=7\): \[ F = (34-7+1) \times \frac{2500000000}{7} \times \frac{1}{4}^{7} = 610351.5625 \] For \(K=14\): \[ F = (34-14+1) \times \frac{2500000000}{14} \times \frac{1}{4}^{14} = 13.9698 \]
Observe the True positive rate (TPR) and the number of False Positives for the \(7\)-mers and \(14\)-mers that you computed in part 1B and 1C. What observation do you make?
When increasing the \(K\)-mer size both the TPR and the number of False Positives are reduced. But the number of False Positives reduces drastically, compared to the TPR.
In order to increase the True Positive Rate, we want to allow single mismatches when checking for exact matches.
For the \(K\)-mer sizes \(K=7\) and \(K=14\), respectively.
What is the probability that at least one non-overlapping \(K\)-mer in the homologous region matches perfectly with the corresponding \(K\)-mer in the query sequence? Given that we allow one mismatch.
\[ P = 1 - (1 - (M^{K}+K\times M^{K-1}\times (1-M)))^{T},\\ \text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor \]
For \(K=7\): \[ P = 1 - (1 - (0.98^{7} + 7*0.98^{6}*0.02))^{\left\lfloor\frac{50}{7}\right\rfloor} \approx 1 \] For \(K=14\): \[ P = 1 - (1 - (0.98^{14} + 14*0.98^{13}*0.02))^{\left\lfloor\frac{50}{14}\right\rfloor} \approx 1 \]
Calculate the number of False Positives (FPs), i.e. the number of \(K\)-mers that are expected to match by chance. Given that we allow one mismatch.
\[\begin{align*} F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \Bigg[\bigg(\frac{1}{A}\bigg)^{K} + K \times \bigg(\frac{1}{A}\bigg)^{K-1} \times \bigg(1-\frac{1}{A}\bigg)\Bigg],\\ &A: \text{the alphabet size}\\ &Q: \text{the query length in bases}\\ &G: \text{the genome size in bases} \end{align*}\]
For \(K=7\): \[ F = (34-7+1) \times \frac{2500000000}{7} \times (\frac{1}{4}^{7} + 7 \times \frac{1}{4}^6 \times (1-\frac{1}{4})) = 1.3427... * 10^{7} \] For \(K=14\): \[ F = (34-14+1) \times \frac{2500000000}{14} \times (\frac{1}{4}^{14} + 14 \times \frac{1}{4}^{13} \times (1-\frac{1}{4})) = 600.7031 \]
What development do we see when observing the TPR and FPs for the updated formulae?
We observe an increase in both the TPR and the number of False positives.
Finally, we want to reduce the number of False Positive results. To that end, instead of requiring single perfect matches, we now want at least \(n\) perfect matches.
For the \(K\)-mer sizes \(K=7\) and \(K=14\), respectively.
What is the probability that at least \(2\) non-overlapping \(K\)-mer in the homologous region match perfectly with corresponding \(K\)-mers in the query sequence?
\[ P_{n} = (M^{K})^{n} \times (1-M^{K})^{T-n} \times \frac{T!}{n! \times (T-n)!}\\ \text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor\\\\ P = P_{n} + P_{n+1} + ... + P_{T} \]
For \(K=7\): \[ P_{2} = 0.98^{7 \times 2} \times (1-0.98^{7})^{\left\lfloor\frac{50}{7}\right\rfloor-2} \times \frac{\frac{50}{7}\text{!}}{2\text{!}\times(\frac{50}{7}-2)\text{!}}\\ \text{...}\\ \text{...}\\ \] \[\begin{align*} P &= P_{2} + P_{3} + P_{4} + P_{5} + P_{6} + P_{7}\\ &= 0.000631 + 0.006925 + 0.045591 + 0.180074 + 0.395142 + 0.371601\\ &= 0.999967... \end{align*}\]
For \(K=14\): \[ P_{2} = 0.98^{14 \times 2} \times (1-0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor-2} \times \frac{\frac{50}{14}!}{2!\times(\frac{50}{14}-2)!}\\ \text{...}\\ \text{...}\\ \] \[\begin{align*} P &= P_{2} + P_{3}\\ &= 0.419776 + 0.428050\\ &= 0.847827... \end{align*}\]
We can observe a decrease in the TPR. Does it still make sense to use this method? Why?
While it is true that this also decreases the TPR. It drastically decreases the amount of False Positives, which greatly improves the overall results. In real examples, the size of the homologous regions is typically higher than 50 nucleotides that we chose for this example. When considering our example with \(H=100\), we would end up with a TPR close to 1.0, while having nearly no False Positives.