Encodes integer sequence for language model — seq_encoding

Helper function for generator_fasta_lm. Encodes integer sequence to input/target list according to output_format argument.

Usage

seq_encoding_lm(
  sequence = NULL,
  maxlen,
  vocabulary,
  start_ind,
  ambiguous_nuc = "zero",
  nuc_dist = NULL,
  quality_vector = NULL,
  return_int = FALSE,
  target_len = 1,
  use_coverage = FALSE,
  max_cov = NULL,
  cov_vector = NULL,
  n_gram = NULL,
  n_gram_stride = 1,
  output_format = "target_right",
  char_sequence = NULL,
  adjust_start_ind = FALSE,
  tokenizer = NULL
)

Arguments

sequence

Sequence of integers.

maxlen

Length of predictor sequence.

vocabulary

Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

start_ind

Start positions of samples in sequence.

ambiguous_nuc

How to handle nucleotides outside vocabulary, either "zero", "empirical" or "equal". See train_model. Note that "discard" option is not available for this function.

nuc_dist

Nucleotide distribution.

quality_vector

Vector of quality probabilities.

return_int

Whether to return integer encoding or one-hot encoding.

target_len

Number of nucleotides to predict at once for language model.

use_coverage

Integer or NULL. If not NULL, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where n is some integer.

max_cov

Biggest coverage value. Only applies if use_coverage = TRUE.

cov_vector

Vector of coverage values associated to the input.

n_gram

Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for n=2, "AA" -> (1, 0,..., 0), "AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1), where the one-hot vectors have length length(vocabulary)^n.

n_gram_stride

Step size for n-gram encoding. For AACCGGTT with n_gram = 4 and n_gram_stride = 2, generator encodes (AACC), (CCGG), (GGTT); for n_gram_stride = 4 generator encodes (AACC), (GGTT).

output_format

Determines shape of output tensor for language model. Either "target_right", "target_middle_lstm", "target_middle_cnn" or "wavenet". Assume a sequence "AACCGTA". Output correspond as follows

"target_right": X = "AACCGT", Y = "A"
"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C" (note reversed order of X_2)
"target_middle_cnn": X = "AACGTA", Y = "C"
"wavenet": X = "AACCGT", Y = "ACCGTA"

char_sequence

A character string.

adjust_start_ind

Whether to shift values in start_ind to start at 1: for example (5,11,25) becomes (1,7,21).

tokenizer

A keras tokenizer.

Value

A list of 2 tensors.

Examples

if (FALSE) { # reticulate::py_module_available("tensorflow")
# use integer sequence as input 

z <- seq_encoding_lm(sequence = c(1,0,5,1,3,4,3,1,4,1,2),
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "equal",
target_len = 1,
output_format = "target_right")

x <- z[[1]]
y <- z[[2]]

x[1,,] # 1,0,5,1,3
y[1,] # 4

x[2,,] # 5,1,3,4,
y[2,] # 1

# use character string as input
z <- seq_encoding_lm(sequence = NULL,
maxlen = 5,
vocabulary = c("a", "c", "g", "t"),
start_ind = c(1,3),
ambiguous_nuc = "zero",
target_len = 1,
output_format = "target_right",
char_sequence = "ACTaaTNTNaZ")


x <- z[[1]]
y <- z[[2]]

x[1,,] # actaa
y[1,] # t

x[2,,] # taatn
y[2,] # t
}