Get distribution of next character given previous n nucleotides.
Usage
n_gram_dist(
  path_input,
  n = 2,
  vocabulary = c("A", "C", "G", "T"),
  format = "fasta",
  file_sample = NULL,
  step = 1,
  nuc_dist = FALSE
)Arguments
- path_input
- Path to folder containing fasta files or single fasta file. 
- n
- Size of n gram. 
- vocabulary
- Vector of allowed characters, samples outside vocabulary get discarded. 
- format
- File format, either - "fasta"or- "fastq".
- file_sample
- If integer, size of random sample of files in - path_input.
- step
- How often to take a sample. 
- nuc_dist
- Nucleotide distribution. 
Value
Returns a matrix with distributions of nucleotides given the previous n nucleotides.
A data frame of n-gram predictions.
Examples
temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
                  num_files = 3,
                  seq_length = 80,
                  vocabulary = c("A", "C", "G", "T"),
                  num_seq = 2)
m <- n_gram_dist(path_input = temp_dir,
                 n = 3,
                 step = 1,
                 nuc_dist = FALSE)
head(round(m, 2))
#>        A    C    G    T
#> AAA 0.25 0.50 0.12 0.12
#> CAA 0.73 0.00 0.00 0.27
#> GAA 0.29 0.29 0.29 0.14
#> TAA 0.40 0.40 0.20 0.00
#> ACA 0.71 0.14 0.14 0.00
#> CCA 0.00 0.29 0.29 0.43
