imported>SchlurcherBot: Bot: http → https

2025-12-28T00:41:53Z

Bot: http → https

Neue Seite

Der '''Gotoh-Algorithmus''' berechnet das [[Sequenzalignment|Alignment]] von zwei Sequenzen bei affinen Gapkosten. Er verwendet das Programmierparadigma der [[Dynamische Programmierung|dynamischen Programmierung]].

== Affine Gapkosten ==
Mit affinen Gapkosten bezeichnet man Kosten für [[Gap (Bioinformatik)|Gaps]] in Alignments, welche sich durch eine [[lineare Funktion]] <math>g(l) = \text{gap}\_\text{start} + (l - 1) \times \text{gap}\_\text{extend}</math> beschreiben lassen. Dabei ist <math>l</math> die Länge des Gaps. <math>\text{gap}\_\text{start}</math> sind die Kosten für den Start eines neuen Gap, und <math>\text{gap}\_\text{extend}</math> sind die Kosten für die Erweiterung eines existierenden Gaps um eine Stelle<ref>[[Stephen F. Altschul & Bruce W. Erickson]]: ''Optimal sequence alignment using affine gap costs'' Bulletin of Mathematical Biology volume 48, 603–616 (1986), Springer, 1986</ref>.

Biologisch können affine Gapkosten mehr Sinn als rein lineare Gapkosten ergeben, da man eine Insertion bzw. Deletion von mehreren Zeichen in bestimmten Zusammenhängen als wahrscheinlicher betrachten möchte als eine Insertion oder Deletion von einem einzigen Zeichen, z. B. beim Alignieren von [[cDNA]] gegen Genom-[[DNA]] (Gusfield, 1999).

== Matrix-Rekurrenzen ==
Spezifikation des Algorithmus durch [[Matrix (Mathematik)|Matrix]]-Rekurrenzen:

'''Initialisierung:'''

A[0, 0] = 0

B[0, 0] = 0

C[0, 0] = 0

A[i, 0] = C[i, 0] = -inf, 1 <= i <= m

A[0, j] = B[0, j] = -inf, 1 <= j <= n

B[i, 0] = g(i), 1 <= i <= m

C[0, j] = g(j), 1 <= i <= n

'''Rekursion:'''

<math>A(i,j) = \max \begin{Bmatrix}
A(i-1, j-1) + w(u[i - 1], v[j - 1]) \\
B(i-1, j-1) + w(u[i - 1], v[j - 1]) \\
C(i-1, j-1) + w(u[i - 1], v[j - 1] \\
\end{Bmatrix},\; 1\le i\le m,\; 1\le j\le n
</math>

<math>B(i,j) = \max \begin{Bmatrix}
A(i - 1, j)+\text{gap}\_\text{start} \\
B(i - 1, j)+\text{gap}\_\text{extend} \\
C(i - 1, j)+\text{gap}\_\text{start}\\
\end{Bmatrix},\; 1\le i\le m,\; 1\le j\le n
</math>

<math>C(i,j) = \max \begin{Bmatrix}
A(i, j - 1)+\text{gap}\_\text{start} \\
B(i, j - 1)+\text{gap}\_\text{start} \\
C(i, j - 1)+\text{gap}\_\text{extend} \\
\end{Bmatrix},\; 1\le i\le m,\; 1\le j\le n
</math>

* u,v – Sequenzen über einem Alphabet <math>\Sigma</math>
* m = Länge(u), n = Länge(v)
* w(a, b),<math>a, b\in\Sigma</math> - Ähnlichkeits-Funktion
* gap_start – Kosten für den Beginn eines Gaps
* gap_extend – Kosten für die Erweiterung eines Gaps
* g(l) - affine Gap-Kosten Funktion <math>g(l) = \text{gap}\_\text{start} + (l - 1)\times\text{gap}\_\text{extend}</math>
* -inf = <math>-\infty
</math>
* A(i,j) – der optimale Similarity Score des optimalen Alignment des Präfixes u[1..i] von u und des Präfixes v[1..j] von v unter affinen Gapkosten
* B(i,j) – der optimale Similarity Score des optimalen Alignment des Präfixes u[1..i] von u und des Präfixes v[1..j] von v, welches mit einem Gap in u endet
* C(i,j) – der optimale Similarity Score des optimalen Alignment des Präfixes u[1..i] von u und des Präfixes v[1..j] von v, welches mit einem Gap in v endet
Der optimale Score des Alignments ist das Maximum von: A[m, n], B[m, n], C[m, n].

Implementation in Pseudo-Code:
<syntaxhighlight lang="csharp" line="1">
//Initialisierung der Matrizen
a[0, 0] = 0;
b[0, 0] = 0;
c[0, 0] = 0;

FOR i = 1; TO i <= m; i++;
a[i, 0] = c[i, 0] = -inf;
b[i, 0] = g(i);

FOR j = 1; TO j <= n; j++;
a[0, j] = b[0, j] = -inf;
c[0, j] = g(j);

//Berechnen der restlichen Matrix
FOR i = 1; TO i <= m; i++;
FOR j = 1; TO j <= n; j++;
a[i, j] = get_max(a[i - 1, j - 1],
b[i - 1, j - 1],
c[i - 1, j - 1]) +
match_or_mismatch(u[i - 1], v[j - 1]);
b[i, j] = get_max(a[i - 1, j] + gap_start,
b[i - 1, j] + gap_extend,
c[i - 1, j] + gap_start);
c[i, j] = get_max(a[i, j - 1] + gap_start,
b[i, j - 1] + gap_start,
c[i, j - 1] + gap_extend);
//Speichere Ergebnis
result = get_max(a[m, n], b[m, n], c[m, n]);

function g(l)
return gap_start + (l - 1) * gap_extend;

function match_or_mismatch(x, y)
if( x == y)
return match_score;
else
return mismatch_score;
</syntaxhighlight>

Implementation in C# (ohne Fehlerbehandlung):
<syntaxhighlight lang="csharp" line="1">
using System;
using System.IO;
using System.Linq;

namespace DnaAlignment
{
class Program
{
const int GAP_PENALTY_START = -10;
const int GAP_PENALTY_EXTEND = -0.5;
const int MIN_INF = int.MinValue + 100; //Pseudo-Unendlich
const int MATCH = 3;
const int MISMATCH = -3;
static void Main(string[] args)
{
using (StreamReader reader = File.OpenText(args[0]))
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (null == line)
continue;
string[] sSplit = line.Split('|');
if (sSplit.Count() != 2)
continue;
string seqA = sSplit[0].Trim(' ');
string seqB = sSplit[1].Trim(' ');

//Initsialisiere Matrizen
int[,] a = new int[seqA.Length + 1, seqB.Length + 1];
int[,] b = new int[seqA.Length + 1, seqB.Length + 1];
int[,] c = new int[seqA.Length + 1, seqB.Length + 1];
a[0, 0] = 0;
b[0, 0] = 0;
c[0, 0] = 0;
for (int i = 1; i <= seqA.Length; i++)
{
a[i, 0] = c[i, 0] = MIN_INF;
b[i, 0] = G(i);
}
for (int j = 1; j <= seqB.Length; j++)
{
a[0, j] = b[0, j] = MIN_INF;
c[0, j] = G(j);
}
//Berechne Matrizen
for (int i = 1; i <= seqA.Length; i++)
{
for (int j = 1; j <= seqB.Length; j++)
{
a[i, j] = Get_Max_Value(
a[i - 1, j - 1],
b[i - 1, j - 1],
c[i - 1, j - 1]) +
Match_Or_Mismatch(seqA[i - 1], seqB[j - 1]);
b[i, j] = Get_Max_Value(
a[i - 1, j] + GAP_PENALTY_START,
b[i - 1, j] + GAP_PENALTY_EXTEND,
c[i - 1, j] + GAP_PENALTY_START);
c[i, j] = Get_Max_Value(
a[i, j - 1] + GAP_PENALTY_START,
b[i, j - 1] + GAP_PENALTY_START,
c[i, j - 1] + GAP_PENALTY_EXTEND);
}
}
Console.WriteLine(Get_Max_Value(a[seqA.Length, seqB.Length],
b[seqA.Length, seqB.Length],
c[seqA.Length, seqB.Length]));
}
}
private static int G(int index)
{
return GAP_PENALTY_START + (index - 1) * GAP_PENALTY_EXTEND;
}
private static int Get_Max_Value(int a, int b, int c)
{
return Math.Max(Math.Max(a, b), c);
}
private static int Match_Or_Mismatch(char a, char b)
{
return a == b ? MATCH : MISMATCH;
}
}
}
</syntaxhighlight>

== Effizienz ==
Die Laufzeit und der Speicherplatzbedarf des Algorithmus liegen in [[Landau-Symbole|O]](nm). Dies ist eine Verbesserung zum [[Needleman-Wunsch-Algorithmus]], welcher für beliebige Gapkosten eine Laufzeit von <math>O(nm^2+n^2m)</math> hat.
Im schlimmsten Fall ist die Matrix quadratisch (<math>n=m</math>), was beim Needleman-Wunsch-Algorithmus zu einer kubischen Laufzeit von <math>O(n^3 )</math> führt. Durch den Gotoh-Algorithmus mit seinen linearen Gap-Kosten verringert sich die Komplexität auf <math>O(n^2 )</math>.

Wenn nur der Score des optimalen Alignments berechnet werden soll, können alle Matrizen auch spalten- bzw. zeilenweise berechnet werden, da jeder Eintrag nur von der vorherigen Spalte bzw. Zeile abhängt. Also sinkt der Speicherplatzbedarf auf <math>O(m+n)</math>.

Der Gotoh-Algorithmus kann auch mit der [[Teile und herrsche (Informatik)|Divide-and-Conquer]]-Methode implementiert werden, so dass das optimale Alignment mit linearem Platzbedarf berechnet werden kann. Siehe [[Hirschberg-Algorithmus]].

== Literatur ==
* {{Literatur
|Autor=O. Gotoh
|Titel=An improved algorithm for matching biological sequences
|Sammelwerk=[[Journal of Molecular Biology]]
|Band=162
|Datum=1982
|Seiten=705–708
|Online=[https://jaligner.sourceforge.net/references/gotoh1982.pdf jaligner.sourceforge.net]
|Format=PDF
|KBytes=206}}
* {{Literatur
|Autor=Eugene W. Myers und Webb Miller
|Titel=Optimal alignments in linear space
|Sammelwerk=Bioinformatics
|Band=4
|Nummer=1
|Datum=1988
|Seiten=11–17
|Kommentar=Anwendung der linearen Speicherplatz Technik des [[Hirschberg-Algorithmus]] auf den Gotoh-Algorithmus
|Online=[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.327&rep=rep1&type=pdf citeseerx.ist.psu.edu]
|Format=PDF
|KBytes=
|DOI=10.1093/bioinformatics/4.1.11}}
* J. Stoye: [https://www.techfak.uni-bielefeld.de/~stoye/dropbox/preprint102.pdf Divide-and-Conquer Multiple Sequence Alignment] (PDF; 938 kB). Dissertation Thesis. Universität Bielefeld, Forschungsbericht der Technischen Fakultät, Abteilung Informationstechnik, Report 97-02, S. 26–27, 1997. {{ISSN|0946-7831}}
* D. Gusfield: ''Algorithms on Strings, Trees and Sequences.'' 11.8, 1999, ISBN 0-521-58519-8, S. 235–244.
* Saad Mneimneh: ''Computational Biology Lecture.'' 6: ''Affine gap penalty function, multiple sequence alignment.''<ref>{{Internetquelle |url=https://www.cs.hunter.cuny.edu/~saad/courses/compbio/lectures/lecture6.pdf |titel=Affine gap penalty function, multiple sequence alignment |autor=Saad Mneimneh |hrsg= |werk= |datum= |sprache=en |zugriff=2015-08-15 |format=PDF}}</ref>

== Weblinks ==
* [http://bibiserv.techfak.uni-bielefeld.de/cgi-bin/adp_AffineGlobSim bibiserv.techfak.uni-bielefeld.de] [[Common Gateway Interface|CGI]]-[[Skriptsprache|Script]] zur Berechnung von globalen Alignments mit affinen Gapkosten an der Universität Bielefeld

== Einzelnachweise ==
<references />

[[Kategorie:Bioinformatik]]
[[Kategorie:Optimierungsalgorithmus]]
[[Kategorie:Dynamische Programmierung]]

Gotoh-Algorithmus - Versionsgeschichte

imported>SchlurcherBot: Bot: http → https