Flak 1.0 - Ultra-fast Fuzzy Whole-genome Alignment
Flak 1.0 - Ultra-fast Fuzzy Whole-genome Alignment   Download FLAK 1.0


Support FLAK!


Native Approximate Sequence Alignment

FLAK provides native support for approximate string matching. This capability arises from the exploitation of a fuzzy hash map to store a reference genome, where approximately matching k-mer sequences are grouped together into fuzzy sets, enabling an average O(1) time complexity for insertion, deletion and search operations. In classical set theory, membership of a crisp set A of X is defined as a function ƒA(x), called the characteristic function of A:

Characteristic Function

Membership of a classical set is therefore predicated on a boolean characteristic function, requiring an exact match to align two subsequences. However, a fuzzy set A of X can be defined by a function μA(x), called the membership function of A, which spans the continuum of real numbers in the interval 0 . . . 1:

Fuzzy Function

Using this approach, approximately matching k-mer subsequences, 32-mers in the FLAK system, can be grouped together and accessed rapidly in a hash map.


The Fuzzy Membership Function μA(x) in FLAK

The membership function μA(x) is implemented using an optimised variation of the Levenshtein distance algorithm. For a query sequence S and a reference sequence T, both of size k, the membership value returned by the algorithm is μA(x) = 1 − levenshtein(S, T)/ k . Although, in common with other approximate string-matching algorithms based on dynamic programming, the Levenshtein distance has a space and time complexity of O(n2), for each 32-mer search the algorithm is only applied to the fuzzy sets in the matching bucket of a hash map. A table of the possible Levenshtein distances for a 32-mer and their corresponding fuzzy set membership values is shown below.


Edit DistanceμA(x)Edit DistanceμA(x)Edit DistanceμA(x)
Edit distances and their corresponding fuzzy membership values.
         Fuzzy Membership Function
Fuzzy membership values v/s Levenshtein Edit Distance.


β Cut-off Threshold

Fuzzy hash maps implement the μA(x) membership function with a minimal membership constraint, called a β cut-off threshold, that excludes weak sequences from a fuzzy set. Consequently, each matching k-mer in a fuzzy set has a membership value μA(x) | β < μA(x) < 1. The β cut-off threshold controls the specificity of an alignment and is chosen by the user of FLAK before comparing two genomes. The β threshold is also used by FLAK to terminate the execution of the Levenshtein distance algorithm when the threshold cannot possibly be reached. This contributes significantly to the overall speed of the FLAK.


Beta Cutoff Threshold
         Specifying Beta


As depicted in the figure above, setting a β cut-off threshold of 0.72 will exclude all query 32-mers alignments that have an edit distance greater than 9 from a matching seed in the reference database. For example, consider the 11/18 spaced seed using by PatternHunter: ###-#--#-#--##-###. As FLAK processes sequences in 32-mer chunks, this is implemented as the 11/32 spaced seed ###-#--#-#--##-###--------------, where hash (#) positions require a match and don't care positions (-) are extended from 11 to 32 positions. If a β cut-off threshold of 0.00 and a minimum match length of 32 were used, FLAK would return all 32-mers with matching hash positions. As the β cut-off threshold is increased, it has the effect of increasing the specificity of a 32-match, with a value of 0.97 excluding any 32-mer matches with an edit distance greater than 1. FLAK supports consecutive and spaced seeds and allows any custom seed of size ≤32 to be used. See Custom Alignment Seeds for details on how seeds are processed by FLAK.




© 2016.  FLAK (Fuzzy Logic Analysis of k-mers) Version 1.0     
Flak 1.0 - Ultra-fast Fuzzy Whole-genome Alignment