Today we'll discussabout Needleman-Wunch algorithm. It is a pairwise alignment algorithm, which means it takes 2 sequences and measure their similarity. It is one of the simplest algorithm to align sequences. This algorithm is guaranteed to produce optimum alignment result. Let's get started!
As stated above, as input we need 2 sequences. We will also need scoring schema and gap cost, which will be used to evaluate the resulting alignment. The scoring schema consists of scores for matches and mismatches. The gap cost refers to the cost of introducing gaps (we will use '-' symbol to represent gaps) into our sequence. Needleman-Wunsch algorithm uses linear gap function (linear vs affine gap function).
For our example we will use:
- Sequence A: GCATGCCAT
- Sequence B: CATGCATCGAC
- Match Score: +2
- Mismatch Score: -1
- Gap cost: -2
1. Initialization
Needleman-Wunsch uses dynamic programming to find the optimum alignment. First, we will need to construct a $(m + 1)$ * $(n + 1)$ table/grid, where $n$ is the length of sequence A and $m$ is the length of sequence B. In our case, the length of sequence A is 9 and the length of sequence B is 11. Therefore, we need to construct a 12 * 10 table.We will also give initial values to the table as our starting point:
(1) $V_{0,0} = 0$
(2) $V_{0,j} = V_{0,j-1} + g, \text{for all j such that } 1 \leq j \leq n$
(3) $V_{i,0} = V_{i-1},0 + g, \text{for all i such that } 1 \leq i \leq m$
$V_{i,j}$ denotes the value of the cell at position $i,j$. $i$ is the row number and $j$ is the column number (both starts from 0). $g$ refers to the gap cost above.
The resulting table will be something like this:
Fig 1. Table after initialization |
2. Completing the Table
Now that we have the initial values set, we will fill the whole table with numbers. The formula for that is:$V_{i,j} = \max \left\lbrace V_{i-1,j-1} + S(A_j, B_i), V_{i,j-1} + g, V_{i-1,j} + g \right\rbrace$
where $S(A_j, B_i)$ is the function to get the match/mismatch score for $j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B. For example: $S(C, C)$ will give you the match score and $S(A, C)$ will give you the mismatch score.
In other words, the value of a cell (which is still empty after initialization) is the maximum value of its:
- top-left diagonal neighbour cell + match/mismatch score,
- left neighbour cell + gap cost, and
- top neighbour cell + gap cost
Fig 2. Cell 1, 1 |
From diagonal neighbour = $V_{1-1,1-1} + S(A_1, B_1) = 0 + S(G, C) = 0 + -1 = -1$
From left neighbour = $V_{1,1-1} + g = V_{1,0} + -2 = -2 + -2 = -4$
From top neighbour = $V_{1-1,1} + g = V_{0,1} + -2 = -2 + -2 = -4$
$V_{1,1} = \max \left\lbrace -1, -4, -4 \right\rbrace = -1$
Keep filling the cells row-by-row or column-by-column until the whole table is filled.
The complete table should be the same as Fig 3.
Fig 3. Complete table |
3. Traceback and Building the Solution Alignment
Now that we have our table fully filled, we can start building the solution alignment by tracing back from $n, m$ cell or bottom-right cell all the way to 0, 0 cell (if haven't noticed already, all values originated from 0, 0). Therefore, we will build the solution from right to left.From the formula above, we know that there are only 3 possible "origins" for a single cell, $V_{i,j}$: left neighbour ($V_{i,j-1}$), top neighbour ($V_{i-1,j}$), and top-left diagonal neighbour ($V_{i-1,j-1}$).
- If our "origin" is from left neighbour, then that means inserting a gap (-) into the side sequence (in our case, sequence B) is our best action and match the gap with $j^{th}$ letter of sequence A.
- If our "origin" is from top neighbour, that means inserting a gap into the top sequence (in our case, sequence A) is our best action and match the gap with $i^{th}$ letter of sequence B.
- If our "origin" is from top-left diagonal neighbour, that means matching the 2 letters ($j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B) is our best action.
Note that it is possible to have multiple optimum solution alignment. Therefore, you may have more than 1 "origin" for a single cell and therefore, multiple paths to traverse from bottom right cell to 0, 0. For our example, we will choose only one path.
Fig 4. The choosen path |
T |
C |
Our next arrow is also a diagonal so similarly, we will match A with A.
A | T |
A | C |
The next arrow is vertical/up/top. so we will insert a gap into sequence A and match it with 9th letter of sequence B
- | A | T |
G | A | C |
Next is diagonal arrow so we will match C with C
C | - | A | T |
C | G | A | C |
Next 2 arrow is vertical so we will insert 2 gaps to sequence A.
- | - | C | - | A | T |
A | T | C | G | A | C |
The next 5 arrows is diagonals so we will just match them all at once
C | A | T | G | C | - | - | C | - | A | T |
C | A | T | G | C | A | T | C | G | A | C |
The last one is a horizontal arrow so we will insert a gap into sequence B and match it with the first letter of sequence .
G | C | A | T | G | C | - | - | C | - | A | T |
- | C | A | T | G | C | A | T | C | G | A | C |
And that's it! We have our optimum alignment of score 5.
You can find the finished table here in a spreadsheet here.
If you have any questions you can post it in the comments below.