At the start of the alignment process, FLAK creates a database from the set of non-overlapping 32-mers in the specified reference genome. Using 2-bit encoding, each 32-mer is stored as a 64-bit number. As the k-mer size is fixed at 32, the amount of memory consumed by a genome of length G will be O(G / 32). This default setting works well in practice and is necessary to prevent the available memory (heap space) of a computer being completely consumed by a large genome.

It is however desireable sometimes to increase the amount of overlap of references 32-mers in the database. A greater overlap increases the sensitivity of a comparison and can facilitate the alignment of genomes that are highly diverged. An example of this is illustrated in the diagram below, where the genome of M.pneumoniae (left) was indexed using the default setting (overlap = 0). The plot on the right was created from a full tiling of the reference genome, i.e. an overlap of 31 bases.


Visualisation of M.genitalium v/s M.pneumoniae with no overlap (left) and an overlap of 31 (right).


The degree of overlap can be specified when configuring parameters with the Alignment Wizard. At the bottom of the Select Reference Genome section, a slider allows users to specify an overlap value in the range 0 - 31.

Users should be wary of using this feature for large genomes or chromosomes as the degree of overlap will have a direct impact on the amount of memory consumed and, depending on the k-mer composition of the genome, may increase the running time of an alignment. For an overlap of i, the amount of memory required to store a reference genome, G, will be in the order of O(G / k - i). Consequently, unless a large amount of memory is available, the degree of overlap should only be changed for bacterial-sized genomes (<10Mpbs).



