Predict the next nucleotide using n-gram.
Usage
predict_with_n_gram(
path_input,
distribution_matrix,
default_pred = "random",
vocabulary = c("A", "C", "G", "T"),
file_sample = NULL,
format = "fasta",
return_data_frames = FALSE,
step = 1
)
Arguments
- path_input
Path to folder containing fasta files or single fasta file.
- distribution_matrix
A data frame containing frequency of next nucleotide given the previous n nucleotides (output of
n_gram_dist
function).- default_pred
Either character from vocabulary or
"random"
. Will be used as prediction if certain n-gram did not appear before. If"random"
assign random prediction.- vocabulary
Vector of allowed characters, samples outside vocabulary get discarded.
- file_sample
If integer, size of random sample of files in
path_input
.- format
File format, either
"fasta"
or"fastq"
.- return_data_frames
Boolean, whether to return data frame with input, predictions, target position and true target.
- step
How often to take a sample.
Examples
# create dummy fasta files
temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
num_files = 3,
seq_length = 8,
vocabulary = c("A", "C", "G", "T"),
num_seq = 2)
m <- n_gram_dist(path_input = temp_dir,
n = 3,
step = 1,
nuc_dist = FALSE)
# use distribution matrix to make predictions for one file
predictions <- predict_with_n_gram(path_input = list.files(temp_dir, full.names = TRUE)[1],
distribution_matrix = m)
# show accuracy
predictions[[1]]
#> $accuracy
#> [1] 1
#>