Helper function for `generator_fasta_lm`

.
Encodes integer sequence to input/target list according to `output_format`

argument.

## Usage

```
seq_encoding_lm(
sequence = NULL,
maxlen,
vocabulary,
start_ind,
ambiguous_nuc = "zero",
nuc_dist = NULL,
quality_vector = NULL,
return_int = FALSE,
target_len = 1,
use_coverage = FALSE,
max_cov = NULL,
cov_vector = NULL,
n_gram = NULL,
n_gram_stride = 1,
output_format = "target_right",
char_sequence = NULL,
adjust_start_ind = FALSE,
tokenizer = NULL
)
```

## Arguments

- sequence
Sequence of integers.

- maxlen
Length of predictor sequence.

- vocabulary
Vector of allowed characters. Characters outside vocabulary get encoded as specified in

`ambiguous_nuc`

.- start_ind
Start positions of samples in

`sequence`

.- ambiguous_nuc
How to handle nucleotides outside vocabulary, either

`"zero"`

,`"empirical"`

or`"equal"`

. See`train_model`

. Note that`"discard"`

option is not available for this function.- nuc_dist
Nucleotide distribution.

- quality_vector
Vector of quality probabilities.

- return_int
Whether to return integer encoding or one-hot encoding.

- target_len
Number of nucleotides to predict at once for language model.

- use_coverage
Integer or

`NULL`

. If not`NULL`

, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string`"cov_n"`

in the header, where`n`

is some integer.- max_cov
Biggest coverage value. Only applies if

`use_coverage = TRUE`

.- cov_vector
Vector of coverage values associated to the input.

- n_gram
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for

`n=2, "AA" -> (1, 0,..., 0),`

`"AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1)`

, where the one-hot vectors have length`length(vocabulary)^n`

.- n_gram_stride
Step size for n-gram encoding. For AACCGGTT with

`n_gram = 4`

and`n_gram_stride = 2`

, generator encodes`(AACC), (CCGG), (GGTT)`

; for`n_gram_stride = 4`

generator encodes`(AACC), (GGTT)`

.- output_format
Determines shape of output tensor for language model. Either

`"target_right"`

,`"target_middle_lstm"`

,`"target_middle_cnn"`

or`"wavenet"`

. Assume a sequence`"AACCGTA"`

. Output correspond as follows`"target_right": X = "AACCGT", Y = "A"`

`"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"`

(note reversed order of X_2)`"target_middle_cnn": X = "AACGTA", Y = "C"`

`"wavenet": X = "AACCGT", Y = "ACCGTA"`

- char_sequence
A character string.

- adjust_start_ind
Whether to shift values in

`start_ind`

to start at 1: for example (5,11,25) becomes (1,7,21).- tokenizer
A keras tokenizer.

## Examples

```
if (FALSE) { # reticulate::py_module_available("tensorflow")
# use integer sequence as input
z <- seq_encoding_lm(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
target_len = 1,
output_format = "target_right")
x <- z[[1]]
y <- z[[2]]
x[1,,] # 1,0,5,1,3
y[1,] # 4
x[2,,] # 5,1,3,4,
y[2,] # 1
# use character string as input
z <- seq_encoding_lm(sequence = NULL,
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "zero",
target_len = 1,
output_format = "target_right",
char_sequence = "ACTaaTNTNaZ")
x <- z[[1]]
y <- z[[2]]
x[1,,] # actaa
y[1,] # t
x[2,,] # taatn
y[2,] # t
}
```