Crisp v Fuzzy Seeds (Fuzzy k-mers)
FLAK permits the parametrisation of an alignment with any type of seed with a seed weight ≥8 and a length ≤32. This includes both consecutive and spaced-seeds. FLAK provides a selection of consecutive and spaced-seeds based on those recommended by Choi et al* and by Mac and Benson**.
In contrast with other genome aligners, which are based on exact-matching data structures like suffix trees or use the seed-and-extend model with consecutive or spaced seeds (crisp seeds), FLAK uses fuzzy seeds (fuzzy k-mers) to accommodate an approximate k-mer match.
Approximate or fuzzy matching enables FLAK to detect alignments in the presence of polymorphisms and indels (insertions and deletions) that are common in DNA sequences and reduce the sensitivity of exact-matching approaches.
FLAK processes sequences in chunks of 32 bases (32-mers) and uses a fuzzy hash map to provide the speed required for an approximate 32-mer match of a large amount of sequence data. Approximate does not mean inaccurate! As shown below, the fuzzy seeds used by FLAK are equally sensitive, but more specific than crisp consecutive or spaced seeds.
Consider the crisp spaced seed 111010010100110111, with a length of 18 and a seed weight of 11, used by PatternHunter***. The positions in the seed denoted by 1s are must-match indices, with Os indicating don't-care indices. This seed is defined in FLAK as ###-#--#-#--##-###, with a hash character (#) representing a must-match index and dashes (-) denoting don't-care positions. FLAK will execute a 32-mer alignment using this seed in a two-step operation:
- Hashing: An integer value is computed from the hash positions of the must-match indices in the seed. E.g., the 32-mer, S, with the sequence TATACCGGATATGACGTTACGGATAGGCCAAA will be processed as follows:
TATACCGGATATGACGTTACGGATAGGCCAAA Query 32-mer
--------------###-#--#-#--##-### Fuzzy Seed
--------------CGT-A--G-T--GC-AAA ⇒ hash(S) = hash integer
The generated hash integer is then used to identify the search bucket in a fuzzy hash map using the standard modulus method. If a seed has a length <32, FLAK will create a prefix padding of don't-care positions to extend the seed length to 32.
- Approximate 32-mer Comparison:
If the search bucket in the fuzzy hash map has one or more entries, a dynamic programming algorithm will compute the Levenshtein edit distance between candidate matches (see the section Native Approximate Sequence Alignment). Any matches at or above a user specified fuzzy threshold, the β cut-off, will be processed. For example, comparing the query 32-mer TATACCGGATATGACGTTACGGATAGGCCAAA against the reference 32-mer GGTACCGGATATGACGTAAAAGTTCCGCGAAA in the fuzzy hash map will produce a match for β=0.75 (an edit distance of 8):
TATACCGGATATGACGTTACGGATAGGCCAAA Query 32-mer
GGTACCGGATATGACGTAAAAGTTCCGCGAAA Reference 32-mer
*GTACCGGATATGACGT*A**G*T**GC*AAA Edit distance = 8 ⇒ μA(x) = 0.75
Because the seed will more than likely be padded with leading don't care positions, moving the must-match hash positions to the right, the dynamic programming algorithm can terminate a comparison if the β cut-off threshold cannot possibly be met. This provides a significant reduction in the time required to execute a genome comparison.
Controlling Sensitivity - Seed Weight
The sensitivity of any hash-based alignment mechanism that uses crisp seeds is determined by the seed weight, i.e. the number of hash indices in the seed. For crisp consecutive and spaced seeds, the seed weight also controls the specificity of a k-mer match and the overall running time of an alignment. As the seed weight increases, so too does the specificity of a k-mer match. Increasing the seed weight from 9 to 13 will have a significant impact on the speed, sensitivity and specificity of a crisp k-mer alignment.
For the fuzzy seeds used by FLAK, the seed weight controls both alignment speed and sensitivity, but does not determine specificity. The specificity of a fuzzy seed is determined by the β cut-off threshold.
Controlling Specificity - β Cut-off Threshold
The β cut-off threshold is a fuzzy value between 0 and 1 that is used to control the specificity of an alignment. For a 32-mer sequence, using the seed ###-#--#-#--##-### with a β cut-off of 0 will produce the exact same set of k-mer matches as a crisp spaced seed, i.e. the same sensitivity and specificity. Increasing the β cut-off to 1, will restrict matches to 32-mers with an edit distance of zero (an exact match). As depicted in the generalised diagram below, the range of numbers between 0 and 1 control the specificity of a fuzzy seed:
Increasing β increases seed specificity.
For example, selecting a β cut-off of 0.3 will yield a specificity of 20%, while a value of 0.8 will have a specificity of approximately 92%. The exact relationship between β and specificity depends on the k-mer spectrum of a genome, but will form an "S" curve. Typical usage will have a β cut-off threshold of 0.75 or greater.
Another way of understanding fuzzy seeds is to examine the number of unique keys generated by a hash map when a reference genome is loaded into memory. Consider the digram below, that shows the number of keys generated from the set of non-overlapping 32-mers from the 231.1Mbps H.sapiens Chromosome I, as the number of must-match positions in a seed is increased from 9 to 32. The β cut-off threshold was kept at 0.75 (an edit distance of 8) for this comparison:
H.sapiens Chromosome I (231.1Mbps) - Sensitivity and Specificity of Fuzzy v/s Crisp Seeds for β=0.75 (an edit distance of 8).
The area in yellow shows the increase in specificity of a fuzzy seed with a β cut-off of 0.75 compared to a crisp seed with the same seed weight. The green line at the top of the diagram shows the maximum number of unique 32-mers extracted from Chromosome I. While an increase in the number of map keys may seem to be a disadvantage, the higher number of unique keys produced by a fuzzy seed is fundamental to how fuzzy seeds can execute an approximate k-mer match in average O(1) time. With just 9 must-match indices, the crisp seed is highly sensitive, but will return a large number of false-positives because it has a low specificity. The low number of keys implies that a large number of k-mers are packed into a small number of buckets in a map. The fuzzy seed with the same number of must-match positions is equally sensitive, but much more specific. Combining a small seed weight (≥8) with a high value of β will provide a high degree of both sensitivity and specificity. See the section Custom Alignment Seeds on how to configure FLAK with a custom seed.
* K. P. Choi, F. Zeng, and L. Zhang, "Good spaced seeds for homology search", Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering, 2004. BIBE 2004, pp. 379-386, IEEE, 2004.
** D. Y. Mak and G. Benson, "All hits all the time: parameter-free calculation of spaced seed sensitivity", Bioinformatics, vol. 25, no. 3, pp. 302-308, 2009.
*** B. Ma, J. Tromp, and M. Li, "PatternHunter: faster and more sensitive homology search", Bioinformatics, vol. 18, p. 440, 2002.