AnyThing Tutorials: Bioinformatics: Gotoh Algorithm

Hi Guys!

Today we will discuss about Gotoh Algorithm. Gotoh algorithm is a global alignment algorithm. It aims to align 2 sequences in such a way that it will yield optimum score according to a scoring schema. It has the same aim as Needleman-Wunsch algorithm (in fact all global alignment algorithms have the same aim) except that Gotoh algorithm uses affine gap function (linear vs affine gap function).

Gotoh algorithm takes in 2 sequences, scoring schema, and affine gap function as its input. The scoring schema consists of scores for matches and mismatches. Affine gap function should be in this format:
$g(k) = \alpha + \beta * k$, for some parameter $\alpha$ and $\beta$.
$\alpha$ is called gap open cost and $\beta$ is called gap extend cost.
$k$ refers to the length of the gap.

For our example, we will use:

Sequence A: GCATGCCAT
Sequence B: CATGCATCGAC
Match Score: +2
Mismatch Score: -1
Gap open cost ($\alpha$): -2
Gap extend cost ($\beta$): -1

1. Initialization

In order to do Gotoh algorithm, we need 3 matrices/tables of size $(n+1) * (m+1)$, where $m$ is the length of sequence A and $n$ is the length of sequence B. For our convenience, let's call them Table A, Table B, and Table C.

Next, We fill assign initial values to each of the tables. Let's start with Table A.

(1) $V^A_{0, 0} = 0$

(2) $V^A_{0, j} = g(j)$

(3) $V^A_{i, 0} = g(i)$

Table B:

(1) $V^B_{0, 0} = 0$

(2) $V^B_{0, j} = -\infty$

(3) $V^B_{i, 0} = X$ (Not used)

Table C:

(1) $V^C_{0, 0} = 0$

(2) $V^C_{0, j} = X$ (Not used)

(3) $V^C_{i, 0} = -\infty$

$V^T_{i,j}$ denotes the value of the cell at position $i,j$ in table T. $i$ is the row number and $j$ is the column number (both starts from 0). $g(k)$ refers to the affine gap cost function above.

Note: We use $-\infty$ because of our example's scoring schema. The use of $\infty$ and $-\infty$ depends on the scoring schema.

The initialized tables would look like these:

Fig 1. Initialized Tables

2. Completing the Table

Now that we have the initialized table, we can start filling in values for the matrices/tables.
The formula for finding a value of a cell in table A is:

$V^A_{i, j} = \max \left\lbrace V^B_{i, j}, V^C_{i, j}, V^A_{i-1, j-1} + S(A_j, B_i)\right\rbrace$

$A_j$ refers to the $j^th$ letter of sequence A and $B_i$ refers to the $j^th$ letter of sequence B. $S(A_j, B_i)$ refers to the score of the 2 letters according to our scoring schema (match/mismatch).

The formula above tells us that we need to find out the values of the cell in Table B and C in the same position first. So, here's the formula to compute the value of a cell in Table B.

$V^B_{i, j} = \max \left\lbrace V^B_{i-1, j} + \beta, V^A_{i-1, j} + g(1) \right\rbrace$

The formula basically tells us to get the value of our top neighbour in Table A and B, add $g(1)$ and $\beta$ respectively, and choose the highest one.

The formula to compute the value of a cell in Table C:

$V^C_{i, j} = \max \left\lbrace V^C_{i, j-1} + \beta, V^A_{i, j-1} + g(1) \right\rbrace$

The formula basically tells us to get the value of our top neighbour in table A and C, add $g(1)$ and $\beta$ respectively, and choose the highest one.

Fig 2. Visualization

Note: We use $\max$ here because of our example's scoring schema. This is related to the use of $\infty$ and $-\infty$. Remember that our aim to use this algorithm is to make as many matches as possible in the alignment. That is why we assign positive value as match score and negative values as mismatch and gap. Since we want to make as many matches as possible, we would want to use $\max$. The intuition is that the higher the score, the better because that means we have more matches. Since we use $\max$ as the aggregation function, we should use $-\infty$ because if we use $\infty$, the whole table will be filled with $\infty$, which is not good since we won't be able to get our solution.
One example when we use $\min$ instead of $\max$ is when we have match score = -2, mismatch score = 1, gap open cost = 2, and gap extend cost = 1. This scoring schema is like a reverse of our example's scoring schema. We want to make as many matches as possible so the lesser the score, the better. Therefore, we use $\min$ and because we use $\min$, we should use $\infty$ instead of $-\infty$.

Fig 3. Cell 1, 1

Now, let's try to fill value for $V^A_{1, 1}$. We will need to find out values for $V^B_{1, 1}$ and $V^C_{1, 1}$ first.
$V^B_{1, 1} = \max\left\lbrace V^B_{0, 1} + -1, V^A_{0, 1} + (-2 + -1) \right\rbrace = \max\left\lbrace -\infty + -1, -3 + -3 \right\rbrace = -6$
$V^C_{1, 1} = \max\left\lbrace V^C_{1, 0} + -1, V^A_{1, 0} + (-2 + -1) \right\rbrace = \max\left\lbrace -\infty + -1, -3 + -3 \right\rbrace = -6$

Now let's fill $V^A_{1, 1}$.
$V^A_{1, 1} = \max \left\lbrace -6, -6, V^A_{0, 0} + S(A_1, B_1)\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, 0 + S(G, C)\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, 0 + -1\right\rbrace$
$V^A_{1, 1} = \max \left\lbrace -6, -6, -1\right\rbrace = -1$

Note: It is recommended to track where a cell's value originates from (by using arrows, for example) . Be aware that the origin could be on another table (from the formula) so the arrows might get messy.

Keep repeating until you fill out all cells in Table A (which means you also need to fill all cells of Table B and C).

The completed tables will look like these (I used -999 instead of $-\infty$ because i don't know how to put inifinity in spreadsheet. It doesn't affect the solution because -999 is low enough relatve to other values.):

Fig 4. Completed Tables

3. Traceback and Constructing the Solution

Similar to Needleman-Wunsch algorithm, we start our traceback from the buttom right corner of Table A to the top left corner of Table A (from $V^A_{n, m} to V^A_{0, 0}$). From the formulas in the previous section, we know that the value of a cell can only has a number of "origins":

A value of a cell in Table A ($V^A_{i, j}$) can only originate from

its diagonal neighbour in Table A ($V^A_{i-1, j-1}$). This case indicates that matching the 2 letters is our best action. After that, we move to the "origin" cell
a cell in Table B or C at the same position ($V^B_{i, j}$ or $V^C_{i, j}$). In this case, we do nothing (because the position stay the same) and we move to the "origin" cell.
cells containing initialized values, i.e. row 0 and column 0, will have either its top neighbour or left neighbour as its "origin".

A value of a cell in Table B ($V^B_{i, j}$) can only originate from:

its top neighbour in Table B (V^B_{i-1, j}) or in Table A (V^A_{i-1, j}). In both cases, we add a gap to the top sequence (in our case, sequence A is at the top) and match it with $i^{th}$ letter of the side sequence (in our case, sequence B is at the side) and then move to the "origin" cell

A value of a cell in Table C ($V^C_{i, j}$) can only originate from:

its left neighbour in Table C ($V^C_{i, j-1}$) or in Table A ($V^A_{i, j-1}$). In both cases, we add a gap to the side sequence and match it with $j^{th}$ letter of the top sequence and then move to the "origin" cell.

Now let's apply what we just learn to our example. Fig 4 shows the path I choose.

Fig 4. Traceback path

The first 2 arrows are diagonal so we will match the last 2 letters of both sequences.

A	T
A	C

Current position: A, 9, 7

Our next arrow points to a cell in Table B at the same position. Therefore, we do nothing.

A	T
A	C

Current position: B, 9,7

Our next arrow points to the top neighbour in Table A. Therefore, we add a gap to top sequence (sequence A) and match it with $7^th$ letter of side sequence (sequence B).

-	A	T
G	A	C

Current position: A, 8, 7

Next arrow is diagonal so we match the 2 letters.

C	-	A	T
C	G	A	C

Current position: A, 7, 6

Next arrow points to a cell in Table B at the same position so we do nothing.

C	-	A	T
C	G	A	C

Current position: B, 7, 6

Next arrow is pointing to its top neighbour so we add gap to top sequence and match it with corresponding letter of side sequence.

-	C	-	A	T
T	C	G	A	C

Current position: B, 6, 6

Next arrow is pointing to its top neighbour in Table A. Therefore, we do the same thing.

-	-	C	-	A	T
A	T	C	G	A	C

Current position: A, 5, 6

Next 5 arrows are diagonal so we will match the next 5 letters of both sequences.

C	A	T	C	G	-	-	C	-	A	T
C	A	T	C	G	A	T	C	G	A	C

Current position: A, 0, 1

Last arrow are horizontal arrow so we will add gap to the side sequence and match it with corresponding letter of top sequence.

G	C	A	T	C	G	-	-	C	-	A	T
-	C	A	T	C	G	A	T	C	G	A	C

Current position: A, 0, 0

And we are dont with the algortihm! You have your solution with score of 3 as indicated by the bottom, right corner of Table A.

You can find the solution in spreadsheet here.

AnyThing Tutorials

Saturday, April 9, 2016

Bioinformatics: Gotoh Algorithm

1. Initialization

2. Completing the Table

3. Traceback and Constructing the Solution

1 comment: