| Title: | Alt String Implementation |
|---|---|
| Description: | Provides an extendable, performant and multithreaded 'alt-string' implementation backed by 'C++' vectors and strings. |
| Authors: | Travers Ching [aut, cre, cph], Phillip Hazel [ctb] (Bundled PCRE2 code), Zoltan Herczeg [ctb, cph] (Bundled PCRE2 code), University of Cambridge [cph] (Bundled PCRE2 code), Tilera Corporation [cph] (Stack-less Just-In-Time compiler bundled with PCRE2), Yann Collet [ctb, cph] (Yann Collet is the author of the bundled xxHash code), Martin Leitner-Ankerl [ctb, cph] (Bundled ankerl::unordered_dense code) |
| Maintainer: | Travers Ching <[email protected]> |
| License: | GPL-3 |
| Version: | 0.19.0 |
| Built: | 2026-05-22 09:16:41 UTC |
| Source: | https://github.com/traversc/stringfish |
Converts a character vector to an 'sf_vec'-backed stringfish vector 'convert_to_sf()' is a compatibility alias for 'convert_to_sf_vector()'.
convert_to_sf_vector(x, length.out = length(x))convert_to_sf_vector(x, length.out = length(x))
x |
A character vector |
length.out |
Optional output length used to recycle 'x' |
Converts a character vector to a stringfish vector backed by 'sf_vec'. If 'length.out' is supplied, 'x' is recycled to that length before conversion. The opposite of 'materialize'.
The converted character vector
x <- convert_to_sf_vector(letters)x <- convert_to_sf_vector(letters)
Converts a character vector to a slice-store-backed stringfish vector
convert_to_slice_store(x, length.out = length(x))convert_to_slice_store(x, length.out = length(x))
x |
A character vector |
length.out |
Optional output length used to recycle 'x' |
Converts a character vector to a stringfish vector backed by 'slice_store'. If 'length.out' is supplied, 'x' is recycled to that length before conversion. The converter pre-sizes the first 'slice_store' slice from the normalized string bytes when possible. The opposite of 'materialize'.
The converted character vector
x <- convert_to_slice_store(letters)x <- convert_to_slice_store(letters)
Returns the type of the character vector
get_string_type(x)get_string_type(x)
x |
the vector |
A function that returns the type of character vector. Possible values are "normal vector", "stringfish vector", "stringfish vector (materialized)", "stringfish slice store", "stringfish slice store (materialized)" or "other alt-rep vector"
The type of vector
x <- sf_vector(10) get_string_type(x) # returns "stringfish vector" x <- character(10) get_string_type(x) # returns "normal vector"x <- sf_vector(10) get_string_type(x) # returns "stringfish vector" x <- character(10) get_string_type(x) # returns "normal vector"
Materializes an alt-rep object
materialize(x)materialize(x)
x |
An alt-rep object |
Materializes any alt-rep object and then returns it. Note: the object is materialized regardless of whether the return value is assigned to a variable.
x
x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") x <- materialize(x)x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string") x <- materialize(x)
A function that generates random strings
random_strings(N, string_size = 50L, charset = "abcdefghijklmnopqrstuvwxyz", vector_mode = "stringfish")random_strings(N, string_size = 50L, charset = "abcdefghijklmnopqrstuvwxyz", vector_mode = "stringfish")
N |
The number of strings to generate |
string_size |
Either a single non-negative integer applied to every output string, or a non-negative integer vector of length 'N'. |
charset |
The characters used to generate the random strings (default: abcdefghijklmnopqrstuvwxyz) |
vector_mode |
The type of character vector to generate (either stringfish or normal, default: stringfish) |
A convenience function for generating test strings.
A character vector of the random strings
set.seed(1) x <- random_strings(1e6, 80L, "ACGT", vector_mode = "stringfish") y <- random_strings(4, c(1L, 2L, 4L, 8L), "ACGT")set.seed(1) x <- random_strings(1e6, 80L, "ACGT", vector_mode = "stringfish") y <- random_strings(4, c(1L, 2L, 4L, 8L), "ACGT")
Assigns a new string to a stringfish vector or any other character vector
sf_assign(x, i, e)sf_assign(x, i, e)
x |
the vector |
i |
the index to assign to |
e |
the new string to replace at i in x |
A function to assign a new element to an existing character vector. If the the vector is a stringfish vector, it does so without materialization.
No return value, the function assigns an element to an existing stringfish vector
x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string")x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string")
Pastes a series of strings together separated by the 'collapse' parameter
sf_collapse(x, collapse)sf_collapse(x, collapse)
x |
A character vector |
collapse |
A single string |
This works the same way as 'paste0(x, collapse=collapse)'
A single string with all values in 'x' pasted together, separated by 'collapse'.
paste0, paste
x <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c") Encoding(x) <- "UTF-8" sf_collapse(x, " ") # "hello world" in Japanese sf_collapse(letters, "") # returns the alphabetx <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c") Encoding(x) <- "UTF-8" sf_collapse(x, " ") # "hello world" in Japanese sf_collapse(letters, "") # returns the alphabet
Returns a logical vector testing equality of strings from two string vectors
sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L)) sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L)) sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector of length 1 or the same non-zero length as y |
y |
Another character vector of length 1 or the same non-zero length as y |
nthreads |
Number of threads to use |
Note: the function tests semantic string equality. Non-byte text is normalized to a UTF-8 working representation, while 'CE_BYTES' strings are compared byte-for-byte.
A logical vector
sf_compare(letters, "a")sf_compare(letters, "a")
Appends vectors together
sf_concat(...) sfc(...)sf_concat(...) sfc(...)
... |
Any number of vectors, coerced to character vector if necessary |
A concatenated stringfish vector
sf_concat(letters, 1:5)sf_concat(letters, 1:5)
A function for detecting a pattern at the end of a string
sf_ends(subject, pattern, ...)sf_ends(subject, pattern, ...)
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
A logical vector true if there is a match, false if no match, NA is the subject was NA
endsWith, sf_starts
x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_ends(x, "a")x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_ends(x, "a")
A function that matches patterns and returns a logical vector
sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
The subject character vector to search |
pattern |
The pattern to search for |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
The function uses the PCRE2 library, which is also used internally by R. The encoding is based on the pattern string (or forced via the encode_mode parameter). Note: the order of paramters is switched compared to the 'grepl' base R function, with subject being first.
A logical vector with the same length as subject
grepl
x <- sf_vector(10) sf_assign(x, 1, "hello world") pattern <- "^hello" sf_grepl(x, pattern)x <- sf_vector(10) sf_assign(x, 1, "hello world") pattern <- "^hello" sf_grepl(x, pattern)
A function that performs pattern substitution
sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
The subject character vector to search |
pattern |
The pattern to search for |
replacement |
The replacement string |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
The function uses the PCRE2 library, which is also used internally by R. However, syntax may be slightly different. E.g.: capture groups: "\1" in R, but "$1" in PCRE2 (as in Perl). The encoding of the output is determined by the pattern (or forced using encode_mode parameter) and encodings should be compatible. E.g: mixing ASCII and UTF-8 is okay, but not UTF-8 and latin1. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first.
A stringfish vector of the replacement string
gsub
x <- "hello world" pattern <- "^hello (.+)" replacement <- "goodbye $1" sf_gsub(x, pattern, replacement)x <- "hello world" pattern <- "^hello (.+)" replacement <- "goodbye $1" sf_gsub(x, pattern, replacement)
Converts encoding of one character vector to another
sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))
x |
An alt-rep object |
from |
the encoding to assume of 'x' |
nthreads |
Number of threads to use |
to |
the new encoding |
This is an analogue to the base R function 'iconv'. It converts a string from one encoding (e.g. latin1 or UTF-8) to another
the converted character vector as a stringfish vector
iconv
x <- "fa\xE7ile" Encoding(x) <- "latin1" sf_iconv(x, "latin1", "UTF-8")x <- "fa\xE7ile" Encoding(x) <- "latin1" sf_iconv(x, "latin1", "UTF-8")
Returns a vector of the positions of x in table
sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector to search for in table |
table |
A character vector to be matched against x |
nthreads |
Number of threads to use |
Note: similarly to the base R function, long "table" vectors are not supported. This is due to the maximum integer value that can be returned ('.Machine$integer.max')
An integer vector of the indicies of each x element's position in table
match
sf_match("c", letters)sf_match("c", letters)
Counts the number of characters in a character vector
sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector |
type |
The type of counting to perform ("chars" or "bytes", default: "chars") |
nthreads |
Number of threads to use |
Returns the number of characters per string. The type of counting only matters for UTF-8 strings, where a character can be represented by multiple bytes.
An integer vector of the number of characters
nchar
x <- "fa\xE7ile" Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8")x <- "fa\xE7ile" Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8")
Pastes a series of strings together
sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))
... |
Any number of character vector strings |
sep |
The seperating string between strings |
nthreads |
Number of threads to use |
This works the same way as 'paste0(..., sep=sep)'
A character vector where elements of the arguments are pasted together
paste0, paste
x <- letters y <- LETTERS sf_paste(x,y, sep = ":")x <- letters y <- LETTERS sf_paste(x,y, sep = ":")
A function that reads a file line by line
sf_readLines(file, encoding = "UTF-8")sf_readLines(file, encoding = "UTF-8")
file |
The file name |
encoding |
The encoding to use (Default: UTF-8) |
A function for reading in text data using 'std::ifstream'.
A stringfish vector of the lines in a file
readLines
file <- tempfile() sf_writeLines(letters, file) sf_readLines(file)file <- tempfile() sf_writeLines(letters, file) sf_readLines(file)
A function to split strings by a delimiter
sf_split(subject, split, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))sf_split(subject, split, encode_mode = "auto", fixed = FALSE, nthreads = getOption("stringfish.nthreads", 1L))
subject |
A character vector |
split |
A delimiter to split the string by |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the split parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
A list of stringfish character vectors
strsplit
sf_split(datasets::state.name, "\\s") # split U.S. state names by any space charactersf_split(datasets::state.name, "\\s") # split U.S. state names by any space character
A function for detecting a pattern at the start of a string
sf_starts(subject, pattern, ...)sf_starts(subject, pattern, ...)
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
A logical vector true if there is a match, false if no match, NA is the subject was NA
startsWith, sf_ends
x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_starts(x, "a")x <- c("alpha", "beta", "gamma", "delta", "epsilon") sf_starts(x, "a")
Extracts substrings from a character vector
sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))
x |
A character vector |
start |
The begining to extract from |
stop |
The end to extract from |
nthreads |
Number of threads to use |
This works the same way as 'substr', but in addition allows negative indexing. Negative indicies count backwards from the end of the string, with -1 being the last character.
A stringfish vector of substrings
substr
x <- c("fa\xE7ile", "hello world") Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") sf_substr(x, 4, -1) # extracts from the 4th character to the last ## [1] "ile" "lo world"x <- c("fa\xE7ile", "hello world") Encoding(x) <- "latin1" x <- sf_iconv(x, "latin1", "UTF-8") sf_substr(x, 4, -1) # extracts from the 4th character to the last ## [1] "ile" "lo world"
A function converting a string to all lowercase
sf_tolower(x)sf_tolower(x)
x |
A character vector |
Note: the function only converts ASCII characters.
A stringfish vector where all uppercase is converted to lowercase
tolower
x <- LETTERS sf_tolower(x)x <- LETTERS sf_tolower(x)
A function converting a string to all uppercase
sf_toupper(x)sf_toupper(x)
x |
A character vector |
Note: the function only converts ASCII characters.
A stringfish vector where all lowercase is converted to uppercase
toupper
x <- letters sf_toupper(x)x <- letters sf_toupper(x)
A function to remove leading/trailing whitespace
sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)
subject |
A character vector |
which |
"both", "left", or "right" determines which white space is removed |
whitespace |
Whitespace characters (default: "[ \\t\\r\\n]") |
... |
Parameters passed to sf_gsub |
A stringfish vector of trimmed whitespace
trimws
x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ") sf_trim(x)x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ") sf_trim(x)
Creates a new empty stringfish vector
sf_vector(len)sf_vector(len)
len |
length of the new vector |
This is a backwards-compatible alias for 'sf_vector_create(len)'. It creates a new stringfish vector, an alt-rep character vector backed by a C++ "std::vector" as the internal memory representation. The vector type is "sfstring", which is a simple C++ class containing owned string bytes and a single byte (uint8_t) representing the encoding.
A new empty stringfish vector
x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string")x <- sf_vector(10) sf_assign(x, 1, "hello world") sf_assign(x, 2, "another string")
Creates a new empty stringfish vector
sf_vector_create(len)sf_vector_create(len)
len |
length of the new vector |
This function creates a new empty 'sf_vec'-backed stringfish vector. If you want to fill the vector from character data, use 'convert_to_sf_vector'.
A new stringfish vector
x <- sf_vector_create(4)x <- sf_vector_create(4)
A function that writes text line by line
sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")
text |
A character to write to file |
file |
Name of the file to write to |
sep |
The line separator character(s) |
na_value |
What to write in case of a NA string |
encode_mode |
"UTF-8" or "byte". If "UTF-8", text strings are normalized to UTF-8 while 'CE_BYTES' strings are written as raw bytes. |
A function for writing text data using 'std::ofstream'.
writeLines
file <- tempfile() sf_writeLines(letters, file) sf_readLines(file)file <- tempfile() sf_writeLines(letters, file) sf_readLines(file)
Creates a new empty slice-store-backed stringfish vector
slice_store_create(len)slice_store_create(len)
len |
length of the new vector |
This function creates a new stringfish vector backed by 'slice_store', which stores string bytes in append-only slices plus per-element records. If you want to fill the vector from character data, use 'convert_to_slice_store'.
A new slice-store-backed stringfish vector
x <- slice_store_create(4)x <- slice_store_create(4)
Creates a new empty slice-store-backed stringfish vector with a fixed initial slice size
slice_store_create_with_size(len, initial_slice_size)slice_store_create_with_size(len, initial_slice_size)
len |
length of the new vector |
initial_slice_size |
Initial size of the first underlying 'slice_store' slice. |
This function creates a new stringfish vector backed by 'slice_store', and uses 'initial_slice_size' for the first slice allocation instead of the default heuristic. If you want to fill the vector from character data, use 'convert_to_slice_store'.
A new slice-store-backed stringfish vector
x <- slice_store_create_with_size(4, 256)x <- slice_store_create_with_size(4, 256)
Compare strings semantically or exactly
string_identical(x, y, mode = c("semantic", "exact"))string_identical(x, y, mode = c("semantic", "exact"))
x |
A character vector |
y |
Another character vector to compare to x |
mode |
Either '"semantic"' to compare text after normalizing non-byte strings to UTF-8, or '"exact"' to additionally require matching encoding. Strings marked as '"bytes"' are always compared exactly. |
TRUE if strings are identical under the selected comparison mode
identical
x <- "fa\xE7ile" Encoding(x) <- "latin1" y <- iconv(x, "latin1", "UTF-8") identical(x, y) # TRUE string_identical(x, y) # TRUE string_identical(x, y, mode = "exact") # FALSEx <- "fa\xE7ile" Encoding(x) <- "latin1" y <- iconv(x, "latin1", "UTF-8") identical(x, y) # TRUE string_identical(x, y) # TRUE string_identical(x, y, mode = "exact") # FALSE