Skip to contents

Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.

Usage

get_start_ind(
  seq_vector,
  length_vector,
  maxlen,
  step,
  train_mode = "label",
  discard_amb_nuc = FALSE,
  vocabulary = c("A", "C", "G", "T")
)

Arguments

seq_vector

Vector of character sequences.

length_vector

Length of sequences in seq_vector.

maxlen

Length of one predictor sequence.

step

Distance between samples from one entry in seq_vector.

train_mode

Either "lm" for language model or "label" for label classification.

discard_amb_nuc

Whether to discard all samples that contain characters outside vocabulary.

vocabulary

Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

Value

A numeric vector.

Examples

seq_vector <- c("AAACCCNNNGGGTTT")
get_start_ind(
  seq_vector = seq_vector,
  length_vector = nchar(seq_vector),
  maxlen = 4,
  step = 2,
  train_mode = "label",
  discard_amb_nuc = TRUE,
  vocabulary = c("A", "C", "G", "T"))
#> [1]  1  3 10 12