|Download FLAK 1.0|
Native Approximate Sequence AlignmentFLAK provides native support for approximate string matching. This capability arises from the exploitation of a fuzzy hash map to store a reference genome, where approximately matching k-mer sequences are grouped together into fuzzy sets, enabling an average O(1) time complexity for insertion, deletion and search operations. In classical set theory, membership of a crisp set A of X is defined as a function ƒA(x), called the characteristic function of A:
Membership of a classical set is therefore predicated on a boolean characteristic function, requiring an exact match to align two subsequences. However, a fuzzy set A of X can be defined by a function μA(x), called the membership function of A, which spans the continuum of real numbers in the interval 0 . . . 1:
Using this approach, approximately matching k-mer subsequences, 32-mers in the FLAK system, can be grouped together and accessed rapidly in a hash map.
The Fuzzy Membership Function μA(x) in FLAKThe membership function μA(x) is implemented using an optimised variation of the Levenshtein distance algorithm. For a query sequence S and a reference sequence T, both of size k, the membership value returned by the algorithm is μA(x) = 1 − levenshtein(S, T)/ k . Although, in common with other approximate string-matching algorithms based on dynamic programming, the Levenshtein distance has a space and time complexity of O(n2), for each 32-mer search the algorithm is only applied to the fuzzy sets in the matching bucket of a hash map. A table of the possible Levenshtein distances for a 32-mer and their corresponding fuzzy set membership values is shown below.
β Cut-off ThresholdFuzzy hash maps implement the μA(x) membership function with a minimal membership constraint, called a β cut-off threshold, that excludes weak sequences from a fuzzy set. Consequently, each matching k-mer in a fuzzy set has a membership value μA(x) | β < μA(x) < 1. The β cut-off threshold controls the specificity of an alignment and is chosen by the user of FLAK before comparing two genomes. The β threshold is also used by FLAK to terminate the execution of the Levenshtein distance algorithm when the threshold cannot possibly be reached. This contributes significantly to the overall speed of the FLAK.