Returns encoding for integer or character sequence.
Usage
seq_encoding_label(
sequence = NULL,
maxlen,
vocabulary,
start_ind,
ambiguous_nuc = "zero",
nuc_dist = NULL,
quality_vector = NULL,
use_coverage = FALSE,
max_cov = NULL,
cov_vector = NULL,
n_gram = NULL,
n_gram_stride = 1,
masked_lm = NULL,
char_sequence = NULL,
tokenizer = NULL,
adjust_start_ind = FALSE,
return_int = FALSE
)
Arguments
- sequence
Sequence of integers.
- maxlen
Length of predictor sequence.
- vocabulary
Vector of allowed characters. Characters outside vocabulary get encoded as specified in
ambiguous_nuc
.- start_ind
Start positions of samples in
sequence
.- ambiguous_nuc
How to handle nucleotides outside vocabulary, either
"zero"
,"empirical"
or"equal"
. Seetrain_model
. Note that"discard"
option is not available for this function.- nuc_dist
Nucleotide distribution.
- quality_vector
Vector of quality probabilities.
- use_coverage
Integer or
NULL
. If notNULL
, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string"cov_n"
in the header, wheren
is some integer.- max_cov
Biggest coverage value. Only applies if
use_coverage = TRUE
.- cov_vector
Vector of coverage values associated to the input.
- n_gram
Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for
n=2, "AA" -> (1, 0,..., 0),
"AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1)
, where the one-hot vectors have lengthlength(vocabulary)^n
.- n_gram_stride
Step size for n-gram encoding. For AACCGGTT with
n_gram = 4
andn_gram_stride = 2
, generator encodes(AACC), (CCGG), (GGTT)
; forn_gram_stride = 4
generator encodes(AACC), (GGTT)
.- masked_lm
If not
NULL
, input and target are equal except some parts of the input are masked or random. Must be list with the following arguments:mask_rate
: Rate of input to mask (rate of input to replace with mask token).random_rate
: Rate of input to set to random token.identity_rate
: Rate of input where sample weights are applied but input and output are identical.include_sw
: Whether to include sample weights.block_len
(optional): Masked/random/identity regions appear in blocks of sizeblock_len
.
- char_sequence
A character string.
- tokenizer
A keras tokenizer.
- adjust_start_ind
Whether to shift values in
start_ind
to start at 1: for example (5,11,25) becomes (1,7,21).- return_int
Whether to return integer encoding or one-hot encoding.
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# use integer sequence as input
x <- seq_encoding_label(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal")
x[1,,] # 1,0,5,1,3
x[2,,] # 5,1,3,4,
# use character string as input
x <- seq_encoding_label(maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
char_sequence = "ACTaaTNTNaZ")
x[1,,] # actaa
x[2,,] # taatn
}