Get distribution of next character given previous n nucleotides.
Usage
n_gram_dist(
path_input,
n = 2,
vocabulary = c("A", "C", "G", "T"),
format = "fasta",
file_sample = NULL,
step = 1,
nuc_dist = FALSE
)
Arguments
- path_input
Path to folder containing fasta files or single fasta file.
- n
Size of n gram.
- vocabulary
Vector of allowed characters, samples outside vocabulary get discarded.
- format
File format, either
"fasta"
or"fastq"
.- file_sample
If integer, size of random sample of files in
path_input
.- step
How often to take a sample.
- nuc_dist
Nucleotide distribution.
Value
Returns a matrix with distributions of nucleotides given the previous n nucleotides.
A data frame of n-gram predictions.
Examples
temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
num_files = 3,
seq_length = 80,
vocabulary = c("A", "C", "G", "T"),
num_seq = 2)
m <- n_gram_dist(path_input = temp_dir,
n = 3,
step = 1,
nuc_dist = FALSE)
head(round(m, 2))
#> A C G T
#> AAA 0.25 0.50 0.12 0.12
#> CAA 0.73 0.00 0.00 0.27
#> GAA 0.29 0.29 0.29 0.14
#> TAA 0.40 0.40 0.20 0.00
#> ACA 0.71 0.14 0.14 0.00
#> CCA 0.00 0.29 0.29 0.43