Helper function for generator_fasta_lm.
Encodes integer sequence to input/target list according to output_format argument.
Usage
seq_encoding_lm(
  sequence = NULL,
  maxlen,
  vocabulary,
  start_ind,
  ambiguous_nuc = "zero",
  nuc_dist = NULL,
  quality_vector = NULL,
  return_int = FALSE,
  target_len = 1,
  use_coverage = FALSE,
  max_cov = NULL,
  cov_vector = NULL,
  n_gram = NULL,
  n_gram_stride = 1,
  output_format = "target_right",
  char_sequence = NULL,
  adjust_start_ind = FALSE,
  tokenizer = NULL
)Arguments
- sequence
- Sequence of integers. 
- maxlen
- Length of predictor sequence. 
- vocabulary
- Vector of allowed characters. Characters outside vocabulary get encoded as specified in - ambiguous_nuc.
- start_ind
- Start positions of samples in - sequence.
- ambiguous_nuc
- How to handle nucleotides outside vocabulary, either - "zero",- "empirical"or- "equal". See- train_model. Note that- "discard"option is not available for this function.
- nuc_dist
- Nucleotide distribution. 
- quality_vector
- Vector of quality probabilities. 
- return_int
- Whether to return integer encoding or one-hot encoding. 
- target_len
- Number of nucleotides to predict at once for language model. 
- use_coverage
- Integer or - NULL. If not- NULL, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string- "cov_n"in the header, where- nis some integer.
- max_cov
- Biggest coverage value. Only applies if - use_coverage = TRUE.
- cov_vector
- Vector of coverage values associated to the input. 
- n_gram
- Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for - n=2, "AA" -> (1, 0,..., 0),- "AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1), where the one-hot vectors have length- length(vocabulary)^n.
- n_gram_stride
- Step size for n-gram encoding. For AACCGGTT with - n_gram = 4and- n_gram_stride = 2, generator encodes- (AACC), (CCGG), (GGTT); for- n_gram_stride = 4generator encodes- (AACC), (GGTT).
- output_format
- Determines shape of output tensor for language model. Either - "target_right",- "target_middle_lstm",- "target_middle_cnn"or- "wavenet". Assume a sequence- "AACCGTA". Output correspond as follows- "target_right": X = "AACCGT", Y = "A"
- "target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"(note reversed order of X_2)
- "target_middle_cnn": X = "AACGTA", Y = "C"
- "wavenet": X = "AACCGT", Y = "ACCGTA"
 
- char_sequence
- A character string. 
- adjust_start_ind
- Whether to shift values in - start_indto start at 1: for example (5,11,25) becomes (1,7,21).
- tokenizer
- A keras tokenizer. 
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# use integer sequence as input 
z <- seq_encoding_lm(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
target_len = 1,
output_format = "target_right")
x <- z[[1]]
y <- z[[2]]
x[1,,] # 1,0,5,1,3
y[1,] # 4
x[2,,] # 5,1,3,4,
y[2,] # 1
# use character string as input
z <- seq_encoding_lm(sequence = NULL,
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "zero",
target_len = 1,
output_format = "target_right",
char_sequence = "ACTaaTNTNaZ")
x <- z[[1]]
y <- z[[2]]
x[1,,] # actaa
y[1,] # t
x[2,,] # taatn
y[2,] # t
}
