Computes start position of samples — get_start

Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.

Usage

get_start_ind(
  seq_vector,
  length_vector,
  maxlen,
  step,
  train_mode = "label",
  discard_amb_nuc = FALSE,
  vocabulary = c("A", "C", "G", "T")
)

Arguments

seq_vector: Vector of character sequences.
length_vector: Length of sequences in seq_vector.
maxlen: Length of one predictor sequence.
step: Distance between samples from one entry in seq_vector.
train_mode: Either "lm" for language model or "label" for label classification.
discard_amb_nuc: Whether to discard all samples that contain characters outside vocabulary.
vocabulary: Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

Value

A numeric vector.

Examples

seq_vector <- c("AAACCCNNNGGGTTT")
get_start_ind(
  seq_vector = seq_vector,
  length_vector = nchar(seq_vector),
  maxlen = 4,
  step = 2,
  train_mode = "label",
  discard_amb_nuc = TRUE,
  vocabulary = c("A", "C", "G", "T"))
#> [1]  1  3 10 12