Masoumeh Haghpanahi2015-01-31T16:54:44+00:00http://mashagh.github.ioMasoumeh HaghpanahiBiomedical data fusion via sequence alignment—Part 22015-01-21T00:00:00+00:00http://mashagh.github.io/seq-alignment2<style>
.ps-code {
background-color: ivory;
padding: 10px 20px;
font-family: monospace, serif;
}
</style>
<p>This is the second part of my discussion on comparing and merging multiple time-series data sequences together. In <a href="/sequence-alignment-part-one">Part 1</a>, I briefly discussed how we can use Dynamic Programming to align two data sequences. In this post, I will describe a fast algorithm to extend sequence alignment to more than two sequences.</p>
<p>It’s easy to show that the time complexity of the <a href="http://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm">Needleman-Wunsch</a> algorithm applied to two sequences with approximately same length <script type="math/tex">\bar{L}</script> is <script type="math/tex">O(\bar{L}^2)</script>. The algorithm can be extended directly to simultaneously align <script type="math/tex">N>2</script> sequences, but this comes with a huge cost: an exponential time complexity of <script type="math/tex">O(2^N \bar{L}^N)</script>.</p>
<p>So what can we do? This is another instance where use of an appropriate <a href="http://en.wikipedia.org/wiki/Data_structure"><strong>data structure</strong></a> can help <strong>algorithm design</strong> to make impracticals feasible. Let me take a quick detour and briefly describe the capabilities of this data-structure first. If you are not familiar with data structures and their purposes in general, you should familiarize yourself with this concept.</p>
<p>Union-find is a data structure that keeps track of partitioning of a set S with size <script type="math/tex">n</script> into a disjoint set of groups with fast implementation of the following operations:</p>
<ul>
<li><code>find(x)</code>: returns the name of the group that contains element <script type="math/tex">x</script> in <script type="math/tex">O(\log n)</script>,</li>
<li><code>union(G1,G2)</code>: merges two groups G1 and G2 in <script type="math/tex">O(1)</script>, and</li>
<li><code>makeUnionFind(S)</code>: initializes the data structure with each element of set S in a separate group in <script type="math/tex">O(n)</script>.</li>
</ul>
<p>Back to the problem at hand, now that we know the properties of union-find, we use pairwise sequence alignment (between every distinct pair of sequences), and use a <strong>union-find</strong> data structure to keep track of the alignments. I’ll discuss later in this post why using this technique will hugely decrease the time complexity of multiple sequence alignment.</p>
<p>So here is what I’m going to do: I’ll first describe the algorithm by writing the pseudocode, and then provide its Python implementation for those who would like to try it out for themselves. We need a little bit of extra notation to describe the pseudocode. Let <script type="math/tex">X^{(i)}, i = 1\cdots N</script> denote the set of <script type="math/tex">N</script> data sequences, each containing <span><script type="math/tex">|X^{(i)}|</script></span> elements (elements being time points, in case of time-series data). Also let <script type="math/tex">X^{(i)}_j</script> refer to the <script type="math/tex">j</script>th element of <script type="math/tex">i</script>th sequence. </p>
<p><img src="/images/2015-01-21-seq-alignment2/pseudocode.png" alt="pseudocode" style="width:100%" /></p>
<p>At first, the union-find data structure is initialized with each element of each sequence in its own separate group. Next, every distinct pair of data sequences are aligned together, and their matched elements are put in the same group. At the end of pairwise sequence alignments, the union-find data structure contains the alignment information. Using the time complexities of pairwise sequence alignment and different union-find operations, it is easy to show that the time complexity of multiple sequence alignment using a union-find data structure is <script type="math/tex">O(N\bar{L}+ N^2\bar{L}\log(N\bar{L}) + N^2\bar{L}^{2})</script>. This is a huge gain over the exponential time complexity described at the beginning of this post. </p>
<p>With the above description, it should be fairly easy to go through the Python implementation and find correspondences with the pseudocode. I have used <a href="https://gist.github.com/saran87/4751455">this</a> Python implementation of a union-find data structure. </p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">Classes</span> <span class="kn">import</span> <span class="n">Node</span><span class="p">,</span> <span class="n">UnionFind</span>
<span class="k">def</span> <span class="nf">seqVoting</span><span class="p">(</span><span class="n">seq_list</span><span class="p">):</span>
<span class="c"># Creating the node list</span>
<span class="n">node_list</span> <span class="o">=</span> <span class="n">UnionFind</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">seq_list</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">seq_list</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">Node</span><span class="p">((</span><span class="n">i</span><span class="p">,</span><span class="n">item</span><span class="p">))</span>
<span class="n">node_list</span><span class="o">.</span><span class="n">addnode</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">seq_list</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">seq_list</span><span class="p">)):</span>
<span class="n">seq1_aligned</span><span class="p">,</span><span class="n">seq2_aligned</span> <span class="o">=</span> <span class="n">seqAlign</span><span class="p">(</span><span class="n">seq_list</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
<span class="n">seq_list</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="n">tol</span><span class="p">)</span>
<span class="n">matchUnion</span><span class="p">(</span><span class="n">seq1_aligned</span><span class="p">,</span><span class="n">seq2_aligned</span><span class="p">,</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">,</span><span class="n">node_list</span><span class="p">)</span>
<span class="k">return</span> <span class="n">node_list</span></code></pre></div>
<p><code>seqAlign</code> is the code for pairwise sequence alignment, described in the previous post, and <code>matchUnion</code> which transfers the alignment information to the union-find data structure has the following implementation:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">matchUnion</span><span class="p">(</span><span class="n">seq1_aligned</span><span class="p">,</span> <span class="n">seq2_aligned</span><span class="p">,</span> <span class="n">id1</span><span class="p">,</span> <span class="n">id2</span><span class="p">,</span> <span class="n">node_list</span><span class="p">):</span>
<span class="n">nan_ind</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">r_</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">isnan</span><span class="p">(</span><span class="n">seq1_aligned</span><span class="p">))[</span><span class="mi">0</span><span class="p">],</span>
<span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">isnan</span><span class="p">(</span><span class="n">seq2_aligned</span><span class="p">))[</span><span class="mi">0</span><span class="p">]]</span>
<span class="n">non_nan_ind</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">seq1_aligned</span><span class="p">))</span>
<span class="k">if</span> <span class="n">i</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">nan_ind</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">non_nan_ind</span><span class="p">:</span>
<span class="n">n1</span> <span class="o">=</span> <span class="n">node_list</span><span class="o">.</span><span class="n">getNode</span><span class="p">((</span><span class="n">id1</span><span class="p">,</span> <span class="n">seq1_aligned</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span>
<span class="n">n2</span> <span class="o">=</span> <span class="n">node_list</span><span class="o">.</span><span class="n">getNode</span><span class="p">((</span><span class="n">id2</span><span class="p">,</span> <span class="n">seq2_aligned</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span>
<span class="k">if</span> <span class="n">node_list</span><span class="o">.</span><span class="n">findset</span><span class="p">(</span><span class="n">n1</span><span class="p">)</span> <span class="o">!=</span> <span class="n">node_list</span><span class="o">.</span><span class="n">findset</span><span class="p">(</span><span class="n">n2</span><span class="p">):</span>
<span class="n">node_list</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">n2</span><span class="p">,</span> <span class="n">n1</span><span class="p">)</span></code></pre></div>
<p>In summary, the algorithms described in the last two posts provide an efficient way to merge information of multiple time-series data. It should be evident by now that <strong>aligning</strong> two data sequences is an important first step for any data merging/fusion algorithm. Once data sequences are aligned, different deterministic or probabilistic merging algorithms (such as majority voting, mean or median data fusion, or Bayesian voting) can be applied to the aligned sequences to further analyze the data. </p>
<p>There are numerous applications for this technique, many of which I am not aware of, but you can find one application for this algorithm in <a href="https://www.researchgate.net/publication/270658970_Scoring_consensus_of_multiple_ECG_annotators_by_optimal_sequence_alignment">this</a> paper, which shows the importance of sequence alignment for comparing ECG delineation results with their corresponding gold standards. </p>
Biomedical data fusion via sequence alignment—Part 12015-01-20T00:00:00+00:00http://mashagh.github.io/seq-alignment-1<p>It is a recurring theme in bioengineering where it is required or highly desired to merge data from multiple sources of information. This can be data from different sensors monitoring the same event, or the resulting data from applying different methodologies to extract features from a biological signal. Either way, we have a set of time-series data and we want to assess the amount of similarity between them and merge their information together. An important step to measure the similarity between different data sequences is to draw a correspondence between their elements, or in other words, to <strong>align</strong> the sequences with each other. In this and the upcoming post, I will discuss a simple method that can be used solely <em>or</em> prior to data fusion techniques to significantly improve the results of your data merging pipeline. </p>
<p>The idea is very simple and comes from DNA sequence alignment, where different DNA strands are compared to identify regions of similarity between them. The underlying algorithm for sequence alignment defines a set of penalties for occurrence of a mismatch, a gap, or a match at corresponding positions of the aligned sequences. This can be explained much easier with an example. Consider two DNA sequences “AGGGGGCT” and “AGGGCA”; one possible alignment between the two strands is:</p>
<p><img src="/images/2015-01-20-seq-alignment-1/seq_exp.png" alt="Alignment example" style="width:300px" /></p>
<p>The aligned sequences have 5 matches, 1 mismatch and 2 gaps (nucleotides in one sequence that are not matched with any nucleotides in the other sequence). If we set penalties for occurrence of mismatches, gaps and matches to <script type="math/tex">\sigma_{\textrm{mis}} = +2</script>, <script type="math/tex">\sigma_{\textrm{gap}} = +1</script>, and <script type="math/tex">\sigma_{\textrm{match}} = 0</script>, then the total <strong>cost</strong> for aligning these two sequences is <script type="math/tex">5\times 0 + 1 \times 2 + 2\times 1</script>.</p>
<p>Once penalties are set, <a href="http://en.wikipedia.org/wiki/Dynamic_programming">Dynamic Programming</a> is used to find the <em>optimal</em> sequence alignment that minimizes the total cost.</p>
<p>At this point, you might be wondering how we can apply this technique for aligning two time-series data? The answer lies in appropriate definition of penalties. In most applications dealing with time, two measurements are considered to be matched if they are close enough to each other; i.e., their time difference is less than a threshold <script type="math/tex">tol</script>. Using the threshold parameter (which can be defined and set based on the application), we choose the penalties to be <span><script type="math/tex">\sigma(t_1,t_2) = \frac{|t_1-t_2|}{tol/2}</script></span>, and <script type="math/tex">\sigma_{\textrm{gap}} = 1</script>. In other words, any two matched time points contribute to the total cost by a weight proportional to their time difference, and any time point reported in only one sequence and missed by the other contributes to the cost by a weight of +1. Notice how <script type="math/tex">\sigma(t_1,t_2)</script> contains both <script type="math/tex">\sigma_{match}</script> and <script type="math/tex">\sigma_{mis}</script> definitions within itself.</p>
<p>With this penalty function we can use the same Dynamic Programming that is used for DNA sequence alignment (sometimes referred to as the <a href="http://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm">Needleman-Wunsch</a> algorithm), for aligning two time-series data. The pseudocode in the <a href="http://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm">Wiki page</a> describing Needleman-Wunsch is pretty self-contained for implementing the algorithm in your favorite programming language.</p>
<p>Using Dynamic Programming for comparing similarity of two sequences might seem an overkill, but the power of this technique becomes more obvious for comparing more than two sequences. You might have simple heuristics for comparing a sequence with its corresponding gold standard, e.g., but I’m pretty sure you’ll agree with me that those heuristics won’t work or will have very large time complexities when applied to more than two sequences. That is why I strongly recommend you to read my next post on multiple sequence alignment as well. </p>