Helper function for data generators. Computes start positions in sequence where samples can be extracted, given maxlen, step size and ambiguous nucleotide constraints.
Usage
get_start_ind(
seq_vector,
length_vector,
maxlen,
step,
train_mode = "label",
discard_amb_nuc = FALSE,
vocabulary = c("A", "C", "G", "T")
)
Arguments
- seq_vector
Vector of character sequences.
- length_vector
Length of sequences in
seq_vector
.- maxlen
Length of one predictor sequence.
- step
Distance between samples from one entry in
seq_vector
.- train_mode
Either
"lm"
for language model or"label"
for label classification.- discard_amb_nuc
Whether to discard all samples that contain characters outside vocabulary.
- vocabulary
Vector of allowed characters. Characters outside vocabulary get encoded as specified in
ambiguous_nuc
.