Saturday, April 9, 2016

Bioinformatics: Gotoh Algorithm

Hi Guys!

Today we will discuss about Gotoh Algorithm. Gotoh algorithm is a global alignment algorithm. It aims to align 2 sequences in such a way that it will yield optimum score according to a scoring schema. It has the same aim as Needleman-Wunsch algorithm (in fact all global alignment algorithms have the same aim) except that Gotoh algorithm uses affine gap function (linear vs affine gap function).

Gotoh algorithm takes in 2 sequences, scoring schema, and affine gap function as its input. The scoring schema consists of scores for matches and mismatches. Affine gap function should be in this format:
$g(k) = \alpha + \beta * k$, for some parameter $\alpha$ and $\beta$.
$\alpha$ is called gap open cost and $\beta$ is called gap extend cost.
$k$ refers to the length of the gap.

For our example, we will use:
  • Sequence A: GCATGCCAT
  • Sequence B: CATGCATCGAC
  • Match Score: +2
  • Mismatch Score: -1
  • Gap open cost ($\alpha$): -2
  • Gap extend cost ($\beta$): -1

1. Initialization

In order to do Gotoh algorithm, we need 3 matrices/tables of size $(n+1) * (m+1)$, where $m$ is the length of sequence A and $n$ is the length of sequence B. For our convenience, let's call them Table A, Table B, and Table C.

Next, We fill assign initial values to each of the tables. Let's start with Table A.
(1) $V^A_{0, 0} = 0$
(2) $V^A_{0, j} = g(j)$
(3) $V^A_{i, 0} = g(i)$

Table B:
(1) $V^B_{0, 0} = 0$
(2) $V^B_{0, j} = -\infty$
(3) $V^B_{i, 0} = X$ (Not used)


Table C:
(1) $V^C_{0, 0} = 0$
(2) $V^C_{0, j} = X$ (Not used)
(3) $V^C_{i, 0} = -\infty$
$V^T_{i,j}$ denotes the value of the cell at position $i,j$ in table T. $i$ is the row number and $j$ is the column number (both starts from 0). $g(k)$ refers to the affine gap cost function above.

Note: We use $-\infty$ because of our example's scoring schema. The use of $\infty$ and $-\infty$ depends on the scoring schema.

The initialized tables would look like these:
Fig 1. Initialized Tables
Fig 1. Initialized Tables

2. Completing the Table

Now that we have the initialized table, we can start filling in values for the matrices/tables.
The formula for finding a value of a cell in table A is:
$V^A_{i, j} = \max \left\lbrace V^B_{i, j}, V^C_{i, j}, V^A_{i-1, j-1} + S(A_j, B_i)\right\rbrace$
$A_j$ refers to the $j^th$ letter of sequence A and $B_i$ refers to the $j^th$ letter of sequence B. $S(A_j, B_i)$ refers to the score of the 2 letters according to our scoring schema (match/mismatch).

The formula above tells us that we need to find out the values of the cell in Table B and C in the same position first. So, here's the formula to compute the value of a cell in Table B.
$V^B_{i, j} = \max \left\lbrace V^B_{i-1, j} + \beta, V^A_{i-1, j} + g(1) \right\rbrace$
The formula basically tells us to get the value of our top neighbour in Table A and B, add $g(1)$ and $\beta$ respectively, and choose the highest one.

The formula to compute the value of a cell in Table C:
$V^C_{i, j} = \max \left\lbrace V^C_{i, j-1} + \beta, V^A_{i, j-1} + g(1) \right\rbrace$
The formula basically tells us to get the value of our top neighbour in table A and C, add $g(1)$ and $\beta$ respectively, and choose the highest one.

Fig 2. Visualization
Fig 2. Visualization

Note: We use $\max$ here because of our example's scoring schema. This is related to the use of $\infty$ and $-\infty$. Remember that our aim to use this algorithm is to make as many matches as possible in the alignment. That is why we assign positive value as match score and negative values as mismatch and gap. Since we want to make as many matches as possible, we would want to use $\max$. The intuition is that the higher the score, the better because that means we have more matches. Since we use $\max$ as the aggregation function, we should use $-\infty$ because if we use $\infty$, the whole table will be filled with $\infty$, which is not good since we won't be able to get our solution.
One example when we use $\min$ instead of $\max$ is when we have match score = -2, mismatch score = 1, gap open cost = 2, and gap extend cost = 1. This scoring schema is like a reverse of our example's scoring schema. We want to make as many matches as possible so the lesser the score, the better. Therefore, we use $\min$ and because we use $\min$, we should use $\infty$ instead of $-\infty$.

Fig 3. Cell 1, 1
Fig 3. Cell 1, 1

Now, let's try to fill value for $V^A_{1, 1}$. We will need to find out values for $V^B_{1, 1}$ and $V^C_{1, 1}$ first.
$V^B_{1, 1} = \max\left\lbrace V^B_{0, 1} + -1, V^A_{0, 1} + (-2 + -1) \right\rbrace =  \max\left\lbrace -\infty + -1, -3 + -3 \right\rbrace = -6$
$V^C_{1, 1} = \max\left\lbrace V^C_{1, 0} + -1, V^A_{1, 0} + (-2 + -1) \right\rbrace = \max\left\lbrace -\infty + -1, -3 + -3 \right\rbrace = -6$

Now let's fill $V^A_{1, 1}$.
$V^A_{1, 1} = \max \left\lbrace -6, -6, V^A_{0, 0} + S(A_1, B_1)\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, 0 + S(G, C)\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, 0 + -1\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, -1\right\rbrace = -1$

Note: It is recommended to track where a cell's value originates from (by using arrows, for example) . Be aware that the origin could be on another table (from the formula) so the arrows might get messy.

Keep repeating until you fill out all cells in Table A (which means you also need to fill all cells of Table B and C).

The completed tables will look like these (I used -999 instead of $-\infty$ because i don't know how to put inifinity in spreadsheet. It doesn't affect the solution because -999 is low enough relatve to other values.):

Fig 4. Completed Tables
Fig 4. Completed Tables

3. Traceback and Constructing the Solution

Similar to Needleman-Wunsch algorithm, we start our traceback from the buttom right corner of Table A to the top left corner of Table A (from $V^A_{n, m} to V^A_{0, 0}$). From the formulas in the previous section, we know that the value of a cell can only has a number of "origins":
  • A value of a cell in Table A ($V^A_{i, j}$) can only originate from
    • its diagonal neighbour in Table A ($V^A_{i-1, j-1}$). This case indicates that matching the 2 letters is our best action. After that, we move to the "origin" cell
    • a cell in Table B or C at the same position ($V^B_{i, j}$ or $V^C_{i, j}$). In this case, we do nothing (because the position stay the same) and we move to the "origin" cell.
    • cells containing initialized values, i.e. row 0 and column 0, will have either its top neighbour or left neighbour as its "origin".
  • A value of a cell in Table B ($V^B_{i, j}$) can only originate from:
    • its top neighbour in Table B (V^B_{i-1, j}) or in Table A (V^A_{i-1, j}). In both cases, we add a gap to the top sequence (in our case, sequence A is at the top) and match it with $i^{th}$ letter of the side sequence (in our case, sequence B is at the side) and then move to the "origin" cell
  • A value of a cell in Table C ($V^C_{i, j}$) can only originate from:
    • its left neighbour in Table C ($V^C_{i, j-1}$) or in Table A ($V^A_{i, j-1}$). In both cases, we add a gap to the side sequence and match it with $j^{th}$ letter of the top sequence and then move to the "origin" cell.
 Now let's apply what we just learn to our example. Fig 4 shows the path I choose.
Fig 4. Traceback path
Fig 4. Traceback path
The first 2 arrows are diagonal so we will match the last 2 letters of both sequences.
AT
AC
Current position: A, 9, 7

Our next arrow points to a cell in Table B at the same position. Therefore, we do nothing.
AT
AC
Current position: B, 9,7

Our next arrow points to the top neighbour in Table A. Therefore, we add a gap to top sequence (sequence A) and match it with $7^th$ letter of side sequence (sequence B).
-AT
GAC
Current position: A, 8, 7

Next arrow is diagonal so we match the 2 letters.
C-AT
CGAC
Current position: A, 7, 6

Next arrow points to a cell in Table B at the same position so we do nothing.
C-AT
CGAC
Current position: B, 7, 6

Next arrow is pointing to its top neighbour so we add gap to top sequence and match it with corresponding letter of side sequence.
-C-AT
TCGAC
Current position: B, 6, 6

Next arrow is pointing to its top neighbour in Table A. Therefore, we do the same thing.
--C-AT
ATCGAC
Current position: A, 5, 6

Next 5 arrows are diagonal so we will match the next 5 letters of both sequences.
CATCG --C-AT
CATCG ATCGAC
Current position: A, 0, 1

Last arrow are horizontal arrow so we will add gap to the side sequence and match it with corresponding letter of top sequence.
GCATCG --C-AT
-CATCG ATCGAC
Current position: A, 0, 0

And we are dont with the algortihm! You have your solution with score of 3 as indicated by the bottom, right corner of Table A.

You can find the solution in spreadsheet here.

Monday, April 4, 2016

Bioinformatics: Linear Gap Function vs Affine Gap Function

The main difference between the 2 is:
  • A gap function is called linear if it is in the form of:
    $g(k) = \beta * k$, for some parameter $\beta$.
    $k$ refers to the length of the gap.
  • A gap function is called affine if it is in the form of:
    $g(k) = \alpha + \beta * k$, for some parameter $\alpha$ and $\beta$.
    $\alpha$ is called gap open cost and $\beta$ is called gap extend cost.
    $k$ refers to the length of the gap.
Affine gap function prefers one long gap instead of many small gaps of the same length (because of the gap open cost). Linear gap function does not take into account the length of the gap (doesn't have gap open cost). Affine gap function is more common and gives more realistic result than linear gap function.

Needleman-Wunsch algorithm is used to do global alignment with linear scoring function.
Gotoh algorithm is used to do global alignment with affine gap function.

For example:We have 2 sequences: ATCG and ATGCCG and match score = 2, mismatch score = -1. If we align the 2 sequence using linear gap function, we will get 2 possible results (by using Needleman-Wunsch algorithm):
  1. AT--CG
    ATGCCG
  2. AT-C-G
    ATGCCG
However, if you try to align the 2 sequence with affine gap function with same match/mismatch score (by using Gotoh algorithm), you will only get the first result because the gaps in the second result will incur cost of $2 * (\alpha + \beta * 1)$ while the gaps in the first result will incur cost of $\alpha + \beta * 2$. Therefore, $2\alpha + 2\beta < \alpha + 2\beta$. Since the number of matches is the same (4 matches), the alignment score of the first result is strictly bigger than the second result.

Friday, April 1, 2016

Bioinformatics: Smith-Waterman Algorithm with Example

Hi Guys!

Today we will discuss Smith-Waterman algorithm. This algorithm is a local alignment algorithm i.e. it aligns a part of both sequence which yields highest score. As input, we need 2 sequences and a scoring schema.

For our example we will use:
  • Sequence A: GCATGCCAT
  • Sequence B: CATGCATCGAC
  • Match Score: +2
  • Mismatch Score: -1
  • Gap cost: -2

1. Initialization

To start, we will need to construct a $(m+1) * (n+1)$ table/grid, where $n$ is the length of sequence A and $m$ is the length of sequence B (for our example, a 12 * 10 table), and give it initial values:
(1)  $V_{0, j} = 0, \text{ for all } 0 \leq j \leq m$
(2)  $V_{i, 0} = 0, \text{ for all } 1 \leq i \leq n$
$V_{i,j}$ denotes the value of the cell at position $i,j$. $i$ is the row number and $j$ is the column number (both starts from 0). $g$ refers to the gap cost above.

In other words, we are assigning 0s to all cells in first row (row 0) and first column (column 0).

The resulting table should look like this

Fig 1. Initialized Table
Fig 1. Initialized table

2. Completing the Table

Next, we need to give values to all cells in the table. To do that, we use this formula:
$V_{i, j} = \max \left\lbrace V_{i-1, j-1} + S(A_j, B_i), V_{i, j-1} + g, V_{i-1, j} + g, 0 \right\rbrace$
where $S(A_j, B_i)$ is the function to get the match/mismatch score for $j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B. For example: $S(C, C)$ will give you the match score and $S(A, C)$ will give you the mismatch score.

The formula is similar to Needleeman-Wunsch algorithm, except that we limit the value of the cells to be at least 0.

From the formula above, we know that a cell value could originate from:
  • top neighbour cell
  • left neighbour cell
  • top left diagonal cell

For example, let's take cell at 1, 1:

Fig 2. Cell 1, 1
Fig 2. Cell 1, 1
From top-left diagonal neighbour: $V_{1-1, 1-1} + S(A_1, B_1) = 0 + S(G, C) = 0 + -1 = -1$
From left neighbour: $V_{1, 1-1} + g = 0 + -2 = -2$
From top neighbour: $V_{1-1, 1} + g = 0 + -2 = -2$
$V_{1, 1} = \max \left\lbrace -1, -2, -2, 0 \right\rbrace = 0$

Proceed to fill all the cells row-by-row or column-by-column. It is recommended to keep track where a cell value originates from. Cell with value 0 has no "origin".

The completed table should look like this:

Fig 3. Completed Table
Fig 3. Completed Table

3. Traceback and Constructing the Solution

Now that we have the complete table, we can start the traceback process. We will start the traceback from the cell with highest value in the table to a cell with lowest possible value, 0.
Note: It is possible to have multiple cell containing the highest value. In that case, there are more than 1 optimum solution.

In our example, the highest value is 12 and there is only 1 cell which has that value at position 7, 9. That means that our best/optimum local alignment has a score of 12.

Therefore, we will start our traceback from cell 7, 9 because it has highest value of 12 to the first cell with value 0.

As explained above, we know that there are only 3 possible "origins" for a single cell, $V_{i,j}$: left neighbour ($V_{i,j-1}$), top neighbour ($V_{i-1,j}$), and top-left diagonal neighbour ($V_{i-1,j-1}$).
  • If our "origin" is from left neighbour, then that means inserting a gap (-) into the side sequence (in our case, sequence B) is our best action and match the gap with $j^{th}$ letter of sequence A.
  • If our "origin" is from top neighbour, that means inserting a gap into the top sequence (in our case, sequence A) is our best action and match the gap with $i^{th}$ letter of sequence B.
  • If our "origin" is from top-left diagonal neighbour, that means matching the 2 letters ($j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B) is our best action.


Our first arrow is a diagonal so we match 7th letter of sequence B with 9th letter in sequence A.
T
T

Our next 2 arrows is also a diagonal so we do the same thing as the previous step.
CAT
CAT

Our fourth arrow is a horizontal arrow (i.e. originated from left neighbour cell), so we will insert a gap into the side sequence (sequence B) and match it with 6th letter of sequence A.
CCAT
-CAT

The next 4 arrows are all diagonal so we did the same thing as in the first step (matching).
CATGCCAT
CATG-CAT

There you go! A local alignment of score 12.

You can find the finished table here in a spreadsheet here.