Title: | Radix Tree and Trie-Based String Distances |
---|---|
Description: | A collection of Radix Tree and Trie algorithms for finding similar sequences and calculating sequence distances (Levenshtein and other distance metrics). This work was inspired by a trie implementation in Python: "Fast and Easy Levenshtein distance using a Trie." Hanov (2011) <http://stevehanov.ca/blog/index.php?id=114>. |
Authors: | Travers Ching [aut, cre, cph], Martin Moene [ctb, cph] (span-lite C++ library), Steve Hanov [ctb] (Trie levenshtein implementation in Python), Martin Leitner-Ankerl [ctb] (Ankerl unordered dense hashmap) |
Maintainer: | Travers Ching <[email protected]> |
License: | GPL-3 |
Version: | 0.2.8 |
Built: | 2024-11-07 06:14:27 UTC |
Source: | https://github.com/traversc/seqtrie |
Unique TCRB CDR3 sequences from the Nolan et al. 2020. CDR3s were extracted via IgBLAST. The license for this data is Creative Commons Attribution 4.0 International License.
data(covid_cdr3)
data(covid_cdr3)
A character vector of length 133,034.
Nolan, Sean, et al. "A large-scale database of T-cell receptor beta (TCRB) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2." (2020). doi: 10.21203/rs.3.rs-51964/v1.
data(covid_cdr3) # Average CDR3 length mean(nchar(covid_cdr3)) # [1] 43.56821
data(covid_cdr3) # Average CDR3 length mean(nchar(covid_cdr3)) # [1] 43.56821
Compute distances between all combinations of query and target sequences
dist_matrix( query, target, mode, cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
dist_matrix( query, target, mode, cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
query |
A character vector of query sequences. |
target |
A character vector of target sequences. |
mode |
The distance metric to use. One of hamming (hm), global (gb) or anchored (an). |
cost_matrix |
A custom cost matrix for use with the "global" or "anchored" distance metrics. See details. |
gap_cost |
The cost of a gap for use with the "global" or "anchored" distance metrics. See details. |
gap_open_cost |
The cost of a gap opening. See details. |
nthreads |
The number of threads to use for parallel computation. |
show_progress |
Whether to show a progress bar. |
This function calculates all combinations of pairwise distances based on Hamming, Levenshtein or Anchored algorithms. The output is a NxM matrix where N = length(query) and M = length(target). Note: this can take a really long time; be careful with input size.
Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.
An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.
In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.
By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).
If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.
The output is a distance matrix between all query (rows) and target (columns) sequences. For anchored searches, the output also includes attributes "query_size" and "target_size" which are matrices containing the lengths of the query and target sequences that are aligned.
dist_matrix(c("ACGT", "AAAA"), c("ACG", "ACGT"), mode = "global")
dist_matrix(c("ACGT", "AAAA"), c("ACG", "ACGT"), mode = "global")
Compute the pairwise distance between two sets of sequences
dist_pairwise( query, target, mode, cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
dist_pairwise( query, target, mode, cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
query |
A character vector of query sequences. |
target |
A character vector of target sequences.. Must be the same length as query. |
mode |
The distance metric to use. One of hamming (hm), global (gb) or anchored (an). |
cost_matrix |
A custom cost matrix for use with the "global" or "anchored" distance metrics. See details. |
gap_cost |
The cost of a gap for use with the "global" or "anchored" distance metrics. See details. |
gap_open_cost |
The cost of a gap opening. See details. |
nthreads |
The number of threads to use for parallel computation. |
show_progress |
Whether to show a progress bar. |
This function calculates pairwise distances based on Hamming, Levenshtein or Anchored algorithms. query and target must be the same length.
Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.
An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.
In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.
By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).
If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.
The output of this function is a vector of distances. If mode == "anchored" then the output also includes attributes "query_size" and "target_size" which are vectors containing the lengths of the query and target sequences that are aligned.
dist_pairwise(c("ACGT", "AAAA"), c("ACG", "ACGT"), mode = "global")
dist_pairwise(c("ACGT", "AAAA"), c("ACG", "ACGT"), mode = "global")
Find similar sequences within a distance threshold
dist_search( query, target, max_distance = NULL, max_fraction = NULL, mode = "levenshtein", cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, tree_class = "RadixTree", nthreads = 1, show_progress = FALSE )
dist_search( query, target, max_distance = NULL, max_fraction = NULL, mode = "levenshtein", cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, tree_class = "RadixTree", nthreads = 1, show_progress = FALSE )
query |
A character vector of query sequences. |
target |
A character vector of target sequences. |
max_distance |
how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction. |
max_fraction |
how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance. |
mode |
The distance metric to use. One of hamming (hm), global (gb) or anchored (an). |
cost_matrix |
A custom cost matrix for use with the "global" or "anchored" distance metrics. See details. |
gap_cost |
The cost of a gap for use with the "global" or "anchored" distance metrics. See details. |
gap_open_cost |
The cost of a gap opening. See details. |
tree_class |
Which R6 class to use. Either RadixTree or RadixForest (default: RadixTree) |
nthreads |
The number of threads to use for parallel computation. |
show_progress |
Whether to show a progress bar. |
This function finds all sequences in target that are within a distance threshold of any sequence in query. This function uses either a RadixTree or RadixForest to store target sequences. See the R6 class documentation for additional details.
Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.
An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.
In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.
By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).
If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.
The output is a data.frame of all matches with columns "query" and "target". For anchored searches, the output also includes attributes "query_size" and "target_size" which are vectors containing the portion of the query and target sequences that are aligned.
dist_search(c("ACGT", "AAAA"), c("ACG", "ACGT"), max_distance = 1, mode = "levenshtein")
dist_search(c("ACGT", "AAAA"), c("ACG", "ACGT"), max_distance = 1, mode = "levenshtein")
Generate a cost matrix for use with the search
method
generate_cost_matrix( charset, match = 0L, mismatch = 1L, gap = NULL, gap_open = NULL )
generate_cost_matrix( charset, match = 0L, mismatch = 1L, gap = NULL, gap_open = NULL )
charset |
A string representing all possible characters in both query and target sequences (e.g. "ACGT") |
match |
The cost of a match |
mismatch |
The cost of a mismatch |
gap |
The cost of a gap or NULL if this parameter will be set later. |
gap_open |
The cost of a gap opening or NULL. If this parameter is set, gap must also be set. |
A cost matrix
generate_cost_matrix("ACGT", match = 0, mismatch = 1)
generate_cost_matrix("ACGT", match = 0, mismatch = 1)
Radix Forest class implementation
The RadixForest class is a specialization of the RadixTree implementation. Instead of putting sequences into a single tree, the RadixForest class puts sequences into separate trees based on sequence length. This allows for faster searching of similar sequences based on Hamming or Levenshtein distance metrics. Unlike the RadixTree class, the RadixForest class does not support anchored searches or a custom cost matrix. See RadixTree for additional details.
forest_pointer
Map of sequence length to RadixTree
char_counter_pointer
Character count data for the purpose of validating input
new()
Create a new RadixForest object
RadixForest$new(sequences = NULL)
sequences
A character vector of sequences to insert into the forest
show()
Print the forest to screen
RadixForest$show()
to_string()
Print the forest to a string
RadixForest$to_string()
graph()
Plot of the forest using igraph
RadixForest$graph(depth = -1, root_label = "root", plot = TRUE)
depth
The tree depth to plot for each tree in the forest.
root_label
The label of the root node(s) in the plot.
plot
Whether to create a plot or return the data used to generate the plot.
A data frame of parent-child relationships used to generate the igraph plot OR a ggplot2 object
to_vector()
Output all sequences held by the forest as a character vector
RadixForest$to_vector()
A character vector of all sequences contained in the forest.
size()
Output the size of the forest (i.e. how many sequences are contained)
RadixForest$size()
The size of the forest
insert()
Insert new sequences into the forest
RadixForest$insert(sequences)
sequences
A character vector of sequences to insert into the forest
A logical vector indicating whether the sequence was inserted (TRUE) or already existing in the forest (FALSE)
erase()
Erase sequences from the forest
RadixForest$erase(sequences)
sequences
A character vector of sequences to erase from the forest
A logical vector indicating whether the sequence was erased (TRUE) or not found in the forest (FALSE)
find()
Find sequences in the forest
RadixForest$find(query)
query
A character vector of sequences to find in the forest
A logical vector indicating whether the sequence was found (TRUE) or not found in the forest (FALSE)
prefix_search()
Search for sequences in the forest that start with a specified prefix. E.g.: a query of "CAR" will find "CART", "CARBON", "CARROT", etc. but not "CATS".
RadixForest$prefix_search(query)
query
A character vector of sequences to search for in the forest
A data frame of all matches with columns "query" and "target".
search()
Search for sequences in the forest that are with a specified distance metric to a specified query.
RadixForest$search( query, max_distance = NULL, max_fraction = NULL, mode = "levenshtein", nthreads = 1, show_progress = FALSE )
query
A character vector of query sequences.
max_distance
how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction.
max_fraction
how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance.
mode
The distance metric to use. One of hamming (hm), global (gb) or anchored (an).
nthreads
The number of threads to use for parallel computation.
show_progress
Whether to show a progress bar.
The output is a data.frame of all matches with columns "query" and "target".
validate()
Validate the forest
RadixForest$validate()
A logical indicating whether the forest is valid (TRUE) or not (FALSE). This is mostly an internal function for debugging purposes and should always return TRUE.
forest <- RadixForest$new() forest$insert(c("ACGT", "AAAA")) forest$erase("AAAA") forest$search("ACG", max_distance = 1, mode = "levenshtein") # query target distance # 1 ACG ACGT 1 forest$search("ACG", max_distance = 1, mode = "hamming") # query target distance # <0 rows> (or 0-length row.names)
forest <- RadixForest$new() forest$insert(c("ACGT", "AAAA")) forest$erase("AAAA") forest$search("ACG", max_distance = 1, mode = "levenshtein") # query target distance # 1 ACG ACGT 1 forest$search("ACG", max_distance = 1, mode = "hamming") # query target distance # <0 rows> (or 0-length row.names)
Radix Tree (trie) class implementation
The RadixTree class is a trie implementation. The primary usage is to be able to search of similar sequences based on a dynamic programming framework. This can be done using the search method which searches for similar sequences based on the Global, Anchored or Hamming distance metrics.
Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.
An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.
In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.
By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).
If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.
root_pointer
Root of the RadixTree
char_counter_pointer
Character count data for the purpose of validating input
new()
Create a new RadixTree object
RadixTree$new(sequences = NULL)
sequences
A character vector of sequences to insert into the tree
show()
Print the tree to screen
RadixTree$show()
to_string()
Print the tree to a string
RadixTree$to_string()
A string representation of the tree
graph()
Plot of the tree using igraph (needs to be installed separately)
RadixTree$graph(depth = -1, root_label = "root", plot = TRUE)
depth
The tree depth to plot. If -1 (default), plot the entire tree.
root_label
The label of the root node in the plot.
plot
Whether to create a plot or return the data used to generate the plot.
A data frame of parent-child relationships used to generate the igraph plot OR a ggplot2 object
to_vector()
Output all sequences held by the tree as a character vector
RadixTree$to_vector()
A character vector of all sequences contained in the tree. Return order is not guaranteed.
size()
Output the size of the tree (i.e. how many sequences are contained)
RadixTree$size()
The size of the tree
insert()
Insert new sequences into the tree
RadixTree$insert(sequences)
sequences
A character vector of sequences to insert into the tree
A logical vector indicating whether the sequence was inserted (TRUE) or already existing in the tree (FALSE)
erase()
Erase sequences from the tree
RadixTree$erase(sequences)
sequences
A character vector of sequences to erase from the tree
A logical vector indicating whether the sequence was erased (TRUE) or not found in the tree (FALSE)
find()
Find sequences in the tree
RadixTree$find(query)
query
A character vector of sequences to find in the tree
A logical vector indicating whether the sequence was found (TRUE) or not found in the tree (FALSE)
prefix_search()
Search for sequences in the tree that start with a specified prefix. E.g.: a query of "CAR" will find "CART", "CARBON", "CARROT", etc. but not "CATS".
RadixTree$prefix_search(query)
query
A character vector of sequences to search for in the tree
A data frame of all matches with columns "query" and "target".
search()
Search for sequences in the tree that are with a specified distance metric to a specified query.
RadixTree$search( query, max_distance = NULL, max_fraction = NULL, mode = "levenshtein", cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
query
A character vector of query sequences.
max_distance
how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction.
max_fraction
how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance.
mode
The distance metric to use. One of hamming (hm), global (gb) or anchored (an).
cost_matrix
A custom cost matrix for use with the "global" or "anchored" distance metrics. See details.
gap_cost
The cost of a gap for use with the "global" or "anchored" distance metrics. See details.
gap_open_cost
The cost of a gap opening. See details.
nthreads
The number of threads to use for parallel computation.
show_progress
Whether to show a progress bar.
The output is a data.frame of all matches with columns "query" and "target". For anchored searches, the output also includes attributes "query_size" and "target_size" which are vectors containing the portion of the query and target sequences that are aligned.
validate()
Validate the tree
RadixTree$validate()
A logical indicating whether the tree is valid (TRUE) or not (FALSE). This is mostly an internal function for debugging purposes and should always return TRUE.
https://en.wikipedia.org/wiki/Radix_tree
tree <- RadixTree$new() tree$insert(c("ACGT", "AAAA")) tree$erase("AAAA") tree$search("ACG", max_distance = 1, mode = "levenshtein") # query target distance # 1 ACG ACGT 1 tree$search("ACG", max_distance = 1, mode = "hamming") # query target distance # <0 rows> (or 0-length row.names)
tree <- RadixTree$new() tree$insert(c("ACGT", "AAAA")) tree$erase("AAAA") tree$search("ACG", max_distance = 1, mode = "levenshtein") # query target distance # 1 ACG ACGT 1 tree$search("ACG", max_distance = 1, mode = "hamming") # query target distance # <0 rows> (or 0-length row.names)
Search for similar sequences based on splitting sequences into left and right sides and searching for matches in each side using a bi-directional anchored alignment.
split_search( query, target, query_split, target_split, edge_trim = 0L, max_distance = 0L, ... )
split_search( query, target, query_split, target_split, edge_trim = 0L, max_distance = 0L, ... )
query |
A character vector of query sequences. |
target |
A character vector of target sequences. |
query_split |
index to split query sequence. Should be within (edge_trim, nchar(query)-edge_trim] or -1 to indicate no split. |
target_split |
index to split target sequence. Should be within (edge_trim, nchar(query)-edge_trim] or -1 to indicate no split. |
edge_trim |
number of bases to trim from each side of the sequence (default value: 0). |
max_distance |
how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction. |
... |
additional arguments passed to |
This function is useful for searching for similar sequences that may have variable windows of sequencing (e.g. different 5' and 3' primers) but contain the same core sequence or position. The two split parameters partition the query and target sequences into left and right sides, where left = stri_sub(sequence, edge_trim+1, split) and right = stri_sub(query, split+1, -edge_trim-1).
data.frame with columns query, target, and distance.
# Consider two sets of sequences # query1 AGACCTAA CCC # target1 AAGACCTAA CC # query2 GGGTGTAA CCACCC # target2 GGTGTAA CCAC # Despite having different frames, query1 and query2 and clearly # match to target1 and target2, respectively. # One could consider splitting based on a common core sequence, # e.g. a common TAA stop codon. split_search(query=c( "AGACCTAACCC", "GGGTGTAACCACCC"), target=c("AAGACCTAACC", "GGTGTAACCAC"), query_split=c(8, 8), target_split=c(9, 7), edge_trim=0, max_distance=0)
# Consider two sets of sequences # query1 AGACCTAA CCC # target1 AAGACCTAA CC # query2 GGGTGTAA CCACCC # target2 GGTGTAA CCAC # Despite having different frames, query1 and query2 and clearly # match to target1 and target2, respectively. # One could consider splitting based on a common core sequence, # e.g. a common TAA stop codon. split_search(query=c( "AGACCTAACCC", "GGGTGTAACCACCC"), target=c("AAGACCTAACC", "GGTGTAACCAC"), query_split=c(8, 8), target_split=c(9, 7), edge_trim=0, max_distance=0)