Today we will discuss Smith-Waterman algorithm. This algorithm is a local alignment algorithm i.e. it aligns a part of both sequence which yields highest score. As input, we need 2 sequences and a scoring schema.
For our example we will use:
- Sequence A: GCATGCCAT
- Sequence B: CATGCATCGAC
- Match Score: +2
- Mismatch Score: -1
- Gap cost: -2
1. Initialization
To start, we will need to construct a $(m+1) * (n+1)$ table/grid, where $n$ is the length of sequence A and $m$ is the length of sequence B (for our example, a 12 * 10 table), and give it initial values:(1) $V_{0, j} = 0, \text{ for all } 0 \leq j \leq m$
(2) $V_{i, 0} = 0, \text{ for all } 1 \leq i \leq n$
$V_{i,j}$ denotes the value of the cell at position $i,j$. $i$ is the row number and $j$ is the column number (both starts from 0). $g$ refers to the gap cost above.
In other words, we are assigning 0s to all cells in first row (row 0) and first column (column 0).
The resulting table should look like this
Fig 1. Initialized table |
2. Completing the Table
Next, we need to give values to all cells in the table. To do that, we use this formula:$V_{i, j} = \max \left\lbrace V_{i-1, j-1} + S(A_j, B_i), V_{i, j-1} + g, V_{i-1, j} + g, 0 \right\rbrace$
where $S(A_j, B_i)$ is the function to get the match/mismatch score for $j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B. For example: $S(C, C)$ will give you the match score and $S(A, C)$ will give you the mismatch score.
The formula is similar to Needleeman-Wunsch algorithm, except that we limit the value of the cells to be at least 0.
From the formula above, we know that a cell value could originate from:
- top neighbour cell
- left neighbour cell
- top left diagonal cell
For example, let's take cell at 1, 1:
Fig 2. Cell 1, 1 |
From left neighbour: $V_{1, 1-1} + g = 0 + -2 = -2$
From top neighbour: $V_{1-1, 1} + g = 0 + -2 = -2$
$V_{1, 1} = \max \left\lbrace -1, -2, -2, 0 \right\rbrace = 0$
Proceed to fill all the cells row-by-row or column-by-column. It is recommended to keep track where a cell value originates from. Cell with value 0 has no "origin".
The completed table should look like this:
Fig 3. Completed Table |
3. Traceback and Constructing the Solution
Now that we have the complete table, we can start the traceback process. We will start the traceback from the cell with highest value in the table to a cell with lowest possible value, 0.Note: It is possible to have multiple cell containing the highest value. In that case, there are more than 1 optimum solution.
In our example, the highest value is 12 and there is only 1 cell which has that value at position 7, 9. That means that our best/optimum local alignment has a score of 12.
As explained above, we know that there are only 3 possible "origins" for a single cell, $V_{i,j}$: left neighbour ($V_{i,j-1}$), top neighbour ($V_{i-1,j}$), and top-left diagonal neighbour ($V_{i-1,j-1}$).
- If our "origin" is from left neighbour, then that means inserting a gap (-) into the side sequence (in our case, sequence B) is our best action and match the gap with $j^{th}$ letter of sequence A.
- If our "origin" is from top neighbour, that means inserting a gap into the top sequence (in our case, sequence A) is our best action and match the gap with $i^{th}$ letter of sequence B.
- If our "origin" is from top-left diagonal neighbour, that means matching the 2 letters ($j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B) is our best action.
Our first arrow is a diagonal so we match 7th letter of sequence B with 9th letter in sequence A.
T |
T |
Our next 2 arrows is also a diagonal so we do the same thing as the previous step.
C | A | T |
C | A | T |
Our fourth arrow is a horizontal arrow (i.e. originated from left neighbour cell), so we will insert a gap into the side sequence (sequence B) and match it with 6th letter of sequence A.
C | C | A | T |
- | C | A | T |
The next 4 arrows are all diagonal so we did the same thing as in the first step (matching).
C | A | T | G | C | C | A | T |
C | A | T | G | - | C | A | T |
There you go! A local alignment of score 12.
You can find the finished table here in a spreadsheet here.
No comments:
Post a Comment