Monday, March 28, 2016

Bioinformatics: Needleman-Wunch Algorithm with Example

Hey guys!

Today we'll discussabout Needleman-Wunch algorithm. It is a pairwise alignment algorithm, which means it takes 2 sequences and measure their similarity. It is one of the simplest algorithm to align sequences. This algorithm is guaranteed to produce optimum alignment result. Let's get started!

As stated above, as input we need 2 sequences. We will also need scoring schema and gap cost, which will be used to evaluate the resulting alignment. The scoring schema consists of scores for matches and mismatches. The gap cost refers to the cost of introducing gaps (we will use '-' symbol to represent gaps) into our sequence. Needleman-Wunsch algorithm uses linear gap function (linear vs affine gap function).

For our example we will use:
  • Sequence A: GCATGCCAT
  • Sequence B: CATGCATCGAC
  • Match Score: +2
  • Mismatch Score: -1
  • Gap cost: -2

1. Initialization

Needleman-Wunsch uses dynamic programming to find the optimum alignment. First, we will need to construct a $(m + 1)$ * $(n + 1)$ table/grid, where $n$ is the length of sequence A and $m$ is the length of sequence B. In our case, the length of sequence A is 9 and the length of sequence B is 11. Therefore, we need to construct a 12 * 10 table.

We will also give initial values to the table as our starting point:
(1) $V_{0,0} = 0$
(2) $V_{0,j} = V_{0,j-1} + g, \text{for all j such that } 1 \leq j \leq n$
(3) $V_{i,0} = V_{i-1},0 + g, \text{for all i such that } 1 \leq i \leq m$
$V_{i,j}$ denotes the value of the cell at position $i,j$. $i$ is the row number and $j$ is the column number (both starts from 0). $g$ refers to the gap cost above.

The resulting table will be something like this:

Fig 1. Table after initialization

2. Completing the Table

Now that we have the initial values set, we will fill the whole table with numbers. The formula for that is:
$V_{i,j} = \max \left\lbrace V_{i-1,j-1} + S(A_j, B_i), V_{i,j-1} + g, V_{i-1,j} + g \right\rbrace$
where $S(A_j, B_i)$ is the function to get the match/mismatch score for $j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B. For example: $S(C, C)$ will give you the match score and $S(A, C)$ will give you the mismatch score.

In other words, the value of a cell (which is still empty after initialization) is the maximum value of its:
  • top-left diagonal neighbour cell + match/mismatch score,
  • left neighbour cell + gap cost, and
  • top neighbour cell + gap cost
Note: We use maximum because of our scoring schema. The aim of the algorithm is to match the letters of the 2 sequences as many as possible. If the scoring schema is different, you might need to use minimum instead of maximum. For example: If we use scoring schema of match score = -2, mismatch score = 0, and gap cost = 1, then we will need to use minimum instead of maximum
It is recommended to keep track of the "origin" of a cell value since it can used in the next part: Traceback. You can do this by, for example, adding arrows to the table or keeping a seperate table of origin. Note: This is not mandatory.
Let's take cell at 1, 1 as our example. To get the value for cell 1, 1, we will use the above formula.

Fig 2. Cell 1, 1

From diagonal neighbour = $V_{1-1,1-1} + S(A_1, B_1) = 0 + S(G, C) = 0 + -1 = -1$
From left neighbour = $V_{1,1-1} + g = V_{1,0} + -2 = -2 + -2 = -4$
From top neighbour = $V_{1-1,1} + g = V_{0,1} + -2 = -2 + -2 = -4$
$V_{1,1} = \max \left\lbrace -1, -4, -4 \right\rbrace = -1$

Keep filling the cells row-by-row or column-by-column until the whole table is filled.

The complete table should be the same as Fig 3.

Fig 3. Complete table
The value of the bottom-right most cell, $V_{m, n}$ is our solution alignment's score.

3. Traceback and Building the Solution Alignment

Now that we have our table fully filled, we can start building the solution alignment by tracing back from $n, m$ cell or bottom-right cell all the way to 0, 0 cell (if haven't noticed already, all values originated from 0, 0).  Therefore, we will build the solution from right to left.

From the formula above, we know that there are only 3 possible "origins" for a single cell, $V_{i,j}$: left neighbour ($V_{i,j-1}$), top neighbour ($V_{i-1,j}$), and top-left diagonal neighbour ($V_{i-1,j-1}$).
  • If our "origin" is from left neighbour, then that means inserting a gap (-) into the side sequence (in our case, sequence B) is our best action and match the gap with $j^{th}$ letter of sequence A.
  • If our "origin" is from top neighbour, that means inserting a gap into the top sequence (in our case, sequence A) is our best action and match the gap with $i^{th}$ letter of sequence B.
  • If our "origin" is from top-left diagonal neighbour, that means matching the 2 letters ($j^{th}$ letter of sequence A and $i^{th}$ letter of sequence B) is our best action.
For those of you who choose not to keep track of the "origins" of the cells, we can use the formula to fill a cell to find its origin. For example:; for cell $n, m$, you can only get value 5 from top-left diagonal neighbour. Top neighbour + gap cost will yield 6. Same for left neighbour. Top-left diagonal neighbour + match/mismatch score (in this case, mismatch) will yield 5. Therefore, our origin for this cell is top-left diagonal neighbour.

Note that it is possible to have multiple optimum solution alignment. Therefore, you may have more than 1 "origin" for a single cell and therefore, multiple paths to traverse from bottom right cell to 0, 0. For our example, we will choose only one path.

Fig 4. The choosen path
We will start from bottom right cell (11, 9). Our first "origin" (or let's call it an arrow) is a diagonal. Therefore we will match 9th letter of sequence A with 11th letter of sequence B.
T
C

Our next arrow is also a diagonal so similarly, we will match A with A.
AT
AC

The next arrow is vertical/up/top. so we will insert a gap into sequence A and match it with 9th letter of sequence B
-AT
GAC

Next is diagonal arrow so we will match C with C
C-AT
CGAC

Next 2 arrow is vertical so we will insert 2 gaps to sequence A.
--C-AT
ATCGAC

The next 5 arrows is diagonals so we will just match them all at once
CATGC--C-AT
CATGCATCGAC

The last one is a horizontal arrow so we will insert a gap into sequence B and match it with the first letter of sequence .
GCATGC--C-AT
-CATGCATCGAC

And that's it! We have our optimum alignment of score 5.
You can find the finished table here in a spreadsheet here.

If you have any questions you can post it in the comments below.

No comments:

Post a Comment