Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.
Usage
get_start_ind(
  seq_vector,
  length_vector,
  maxlen,
  step,
  train_mode = "label",
  discard_amb_nuc = FALSE,
  vocabulary = c("A", "C", "G", "T")
)Arguments
- seq_vector
- Vector of character sequences. 
- length_vector
- Length of sequences in - seq_vector.
- maxlen
- Length of one predictor sequence. 
- step
- Distance between samples from one entry in - seq_vector.
- train_mode
- Either - "lm"for language model or- "label"for label classification.
- discard_amb_nuc
- Whether to discard all samples that contain characters outside vocabulary. 
- vocabulary
- Vector of allowed characters. Characters outside vocabulary get encoded as specified in - ambiguous_nuc.
