Data Generator • deepG

library(deepG)
library(magrittr)

Introduction

The most common use case for the deepG data generator is to extract samples from a collection of fasta (or fastq) files. The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$ . We can differentiate between 2 approaches

Language model: Part of a sequence is the input and other part the target.
- Example: Predict the next nucleotide given the previous 100 nucleotides.
Label classification: Assign a label to a sequence.
- Example: Assign a label “virus” or “bacteria” to a sequence of length 100.

Suppose we are given 2 fasta files called “a.fasta” and “b.fasta” that look as follows:

a.fasta
>header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT

b.fasta
>header_b1 GTGTGT >header_b2 AAGG

If we want to extract sequences of length 4 from these files, there would be 17 possible samples (5 from AACCAAGG, 3 from TTTGGG, …). A naive approach would be to extract the samples in a sequential manner:

1. sample:

a.fasta
>header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT

b.fasta
>header_b1 GTGTGT >header_b2 AAGG

2. sample:

a.fasta
>header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT

b.fasta
>header_b1 GTGTGT >header_b2 AAGG

…

17. sample:

a.fasta
>header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT

b.fasta
>header_b1 GTGTGT >header_b2 AAGG

18. sample:

a.fasta
>header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT

b.fasta
>header_b1 GTGTGT >header_b2 AAGG

…

For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and the model would often see long stretches of data from the same source. Choosing the samples completely at random can also be problematic since we would constantly have to open new files. The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches.

Data generator options

In the following code examples, we will mostly use the sequence abcdefghiiii to demonstrate some of the deepG data generator options. (In real world application you would usually have sequences from the ACGT vocabulary.)

sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i")
vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")

We may store this sequence in a fasta file

temp_dir <- tempfile()
dir.create(temp_dir)
dir_path <- paste0(temp_dir, "/dummy_data")
dir.create(dir_path)
df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE)
file_path <- file.path(dir_path, "a.fasta")
# sequence as fasta file
microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path)

Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data. Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see use_coverage, use_quality_score and ambiguous_nuc sections.

# one-hot encoding example
s <-  c("a", "c", "a", "f", "i", "b")
s_as_int_seq <- vector("integer", length(s))
for (i in 1:length(s)) {
  s_as_int_seq[i] <- which(s[i] == vocabulary) - 1
}
one_hot_sample <- keras::to_categorical(s_as_int_seq)
colnames(one_hot_sample) <- vocabulary
one_hot_sample

maxlen

The length of the input sequence.

vocabulary

The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with the ambiguous_nuc argument.

train_type

The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$ . The train_type argument determines how $X$ and $Y$ get extracted. Possible arguments for language models are:

“lm” or “lm_rds”: Given some sequence $s$ , we take some subset of that sequence as input and the rest as target. How to split $s$ can be specified in output_format argument.

Besides the language model approach, we can use label classification . This means we map some label to a sequence. For example, the target for some nucleotide sequence could be one of the labels “bacteria” or “virus”. We have to specify how to extract a label corresponding to a sequence. Possible arguments are:

“label_header”: get label from fasta headers.
“label_folder”: get label from folder, i.e. all files in one folder must belong to the same class.
“label_csv”: get label from csv file. Csv file should have one column named “file”. The targets then correspond to entries in that row (except “file” column). Example: if we are currently working with a file called “a.fasta”, there should be a row in our csv file with some target information for that file

file label_1 label_2

“a.fasta” 1 0
“label_rds”: rds file contains preprocessed list of input and target tensors.

Another option is “dummy_gen”: generator creates random data once and repeatedly returns them.

Extract target from fasta header (fasta header is “label_1” in example file):

# get target from header
vocabulary_label <- paste0("label_", 1:5)
gen <- get_generator(path = file_path,
                     train_type = "label_header",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label 
x # abcdef
y # label_1

Extract target from fasta folder:

# create data for second class
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)

# get target from folder
vocabulary_label <- paste0("label_", 1:2)
gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class
                     train_type = "label_folder",
                     batch_size = 8,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]]
y <- z[[2]] 
x_1_1 <- x[1, , ]
colnames(x_1_1) <- vocabulary
x_1_1 # first sample from first class
x_2_1 <- x[5, , ]
colnames(x_2_1) <- vocabulary
x_2_1 # first sample from second class
colnames(y) <- vocabulary_label 
y # 4 samples from each class

Extract target from csv file:

# get target from csv
file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta")
vocabulary_label <- paste0("label", 1:4)
label_1 <- c(1, 0, 0, 0)
label_2 <- c(0, 1, 0, 0)
label_3 <- c(0, 0, 1, 0)
label_4 <- c(0, 0, 0, 1)
df <- data.frame(file, label_1, label_2, label_3, label_4)
df
csv_file <- tempfile(fileext = ".csv")
write.csv(df, csv_file, row.names = FALSE)

gen <- get_generator(path = file_path,
                     train_type = "label_csv",
                     batch_size = 1,
                     maxlen = 6,
                     target_from_csv = csv_file,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label 
x # abcdef
y # label_1

Examples for language models follow in the next section.

output_format

The output_format determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the target $Y$ . Assume a sequence abcdefg and maxlen = 6. Output correspond as follows

“target_right”: $X=$ abcdef, $Y=$ g

“target_middle_lstm”: $X =$ ( $X_1 =$ abc, $X_2 =$ gfe), $Y=$ d (note reversed order of $X_2$ )

“target_middle_cnn”: $X =$ abcefg, $Y =$ d

“wavenet”: $X =$ abcdef, $Y =$ bcdefg

# target_right
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # g

# target_middle_lstm
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_middle_lstm")

z <- gen()
x_1 <- z[[1]][[1]][1,,] 
x_2 <- z[[1]][[2]][1,,] 
y <- z[[2]] 
colnames(x_1) <- vocabulary
colnames(x_2) <- vocabulary
colnames(y) <- vocabulary
x_1 # abc
x_2 # gfe
y # d

# target_middle_cnn
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_middle_cnn")

z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcefg
y # d

# wavenet
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "wavenet")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]][1,,]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # bcdefg

batch_size

Number of samples in one batch.

# target_right
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 7,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]] 
dim(x)
dim(y)

step

We may determine how frequently we want to take a sample. If step = 1 we take a sample at every possible step. Let’s assume we want to predict the next character, i.e. part of the sequence is the input and next character the target. If maxlen = 3, step = 1:

sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii

if step = 3

sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 3,
                     vocabulary = vocabulary,
                     step = 3, 
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] #encodes abc
y <- z[[2]] # encodes d
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y
# go 3 steps forward
z <- gen()
x <- z[[1]][1,,] #encodes def
y <- z[[2]] # encodes g
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y

padding

If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If padding = FALSE the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample.

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 15, # maxlen is longer than sequence
                     vocabulary = vocabulary,
                     step = 3,
                     padding = TRUE,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # first 4 entries are zero-vectors
y

ambiguous_nuc

A sequence might contain a character that does not lie inside our vocabulary. For example, let’s assume we discard e from our vocabulary. We have 4 options to handle this situation

encode as zero vector

vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary

# zero
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "zeros")
z <- gen()
x <- z[[1]][1,,] 
colnames(x) <- vocabulary_2
x # fifth row is zero vector

equal probability

# equal
gen <- get_generator(path = file_path,
                    train_type = "lm",
                    batch_size = 1,
                    maxlen = 6,
                    vocabulary = vocabulary_2,
                    output_format = "target_right",
                    ambiguous_nuc = "equal") 

z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # fifth row is 1/8 for every entry

use distribution of current file

# empirical
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "empirical") 

z <- gen()
x <- z[[1]][1,,] 
colnames(x) <- vocabulary_2
x # fifth row is distribuation of file

discard

# discard
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "discard") 

z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # first sample with only characters from vocabulary is fghiii|i

proportion_per_seq

The proportion_per_seq argument gives the option to use a random subset instead of the full sequence.

cat("sequence is ", nchar(sequence), "characters long \n")
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 5,
                     seed = 1,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     # take random subsequence using 50% of sequence 
                     proportion_per_seq = 0.5)

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # defgh
y # i

file_limit

Integer or NULL. If integer, use only specified number of randomly sampled files for training.

delete_used_files

If true, delete file once used. Only applies for rds files.

x <- array(0, dim = c(1,5,4))
y <- matrix(0, ncol = 1)
rds_path <- tempfile(fileext = ".rds")
saveRDS(list(x, y), rds_path)

gen <- get_generator(path = rds_path,
                     delete_used_files = TRUE,
                     train_type = "label_rds",
                     batch_size = 1,
                     maxlen = 5)

z <- gen()
file.exists(rds_path)
# z <- gen()
# When calling the generator again, it will wait until it finds a file again from the files listed in 
# the initial `path` argument. Can be used if another process(es) create rds files.

max_samples

Only use fixed number of samples per file. Randomly choose which samples to use. (If random_sampling = FALSE, samples are consecutive.)

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 2,
                     maxlen = 5,
                     step = 1,
                     seed = 3,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     max_samples = 2)

z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # bcdef
x2 # cdefg

random_sampling

If you use max_samples, generator will randomly choose subset from all possible samples, but those samples are consecutive. With random_sampling = TRUE, samples are completely random.

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 2,
                     maxlen = 5,
                     seed = 66,
                     random_sampling = TRUE,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     max_samples = 2)

z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # efghi
x2 # defgh

target_len

Target length for language model.

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     target_len = 3, 
                     maxlen = 5,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]
y3 <- z[[2]][ , 3, ]
colnames(x) <- vocabulary
names(y1) <- vocabulary
names(y2) <- vocabulary
names(y3) <- vocabulary
x # abcde
y1 # f
y2 # g
y3 # h

n_gram / n_gram_stride

Encode target in language model not character wise but combine n characters to one target. n_gram_stride determines the frequency of the n-gram encoding.

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     target_len = 6, 
                     n_gram = 3,
                     n_gram_stride = 3,
                     maxlen = 3,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]

dim(x)[3] == length(vocabulary)^3
# x = abc as 3-gram
# y1 = def as 3-gram
# y2 = ghi as 3-gram

add_noise

Add noise to input. Must be a list that specifies noise distribution or NULL (no noise). List contains arguments noise_type: either "normal" or "uniform". Optional arguments sd or mean if noise_type is "normal" (default is sd=1 and mean=0) or min, max if noise_type is "uniform" (default is min=0, max=1).

gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     add_noise = list(noise_type = "normal", mean = 0, sd = 0.01),
                     maxlen = 5,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- vocabulary
colnames(y) <- vocabulary
round(x, 3) # abcde + noise
y # f

proportion_entries

If a fasta file has multiple entries, you can randomly choose a subset. For example, if the file has 6 entries and proportion_entries = 0.5 the generator will randomly choose only 3 of the entries.

shuffle_file_order

Shuffle file order before iterating through files. Order gets reshuffled after every iteration.

shuffle_input

Whether to shuffle fasta entries if fasta file has multiple entries.

reverse_complement

If TRUE, randomly decide for every batch to use original sequence or its reverse complement. Only implemented for ACGT vocabulary.

sample_by_file_size

Randomly choose new file by sampling according to file size (bigger files more likely).

concat_seq

Character string or NULL. If not NULL all entries from file get concatenated to one sequence with concat_seq string between them. Use concat_seq = "" if you don’t want to add a new token.

df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3))
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 9,
                     vocabulary = c("A", "C", "G", "T", "Z"),
                     concat_seq = "ZZ",
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T", "Z")
colnames(y) <- c("A", "C", "G", "T", "Z")
x # ACZZAGZZA
y # T

set_learning

When you want to assign one label to set of samples. Only implemented for train_type = "label_folder". Input is a list with the following parameters

samples_per_target how many samples to use for one target
maxlen length of one sample
reshape_mode: "time_dist", "multi_input" or "concat".
- If reshape_mode = "multi_input", generator will produce samples_per_target separate inputs, each of length maxlen.
- If reshape_mode = "time_dist", generator will produce a 4D input array. The dimensions correspond to (batch_size, samples_per_target, maxlen, length(vocabulary)).
- If reshape_mode is "concat", generator will concatenate samples_per_target sequences of length maxlen to one long sequence.
If reshape_mode = "concat", there is an additional buffer_len argument: add new token between concatenated samples
- If buffer_len is an integer, the sub-sequences are inter spaced with buffer_len rows. The input length is (maxlen * samples_per_target) + buffer_len * (samples_per_target - 1)

# create data for second label
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)

# multi_input 
set_learning <- list(reshape_mode = "multi_input",
                     maxlen = 4,
                     samples_per_target = 3)

gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
                     train_type = "label_folder",
                     batch_size = 2,
                     maxlen = 4,
                     step = 1, 
                     vocabulary = vocabulary,
                     set_learning = set_learning)

z <- gen()
x <- z[[1]]
y <- z[[2]]
length(x) # 3 samples per target
x_1_1 <- x[[1]][1, , ]
x_1_1 # abcd
x_1_2 <- x[[2]][1, , ]
x_1_2 # bcde
x_1_3 <- x[[3]][1, , ]
x_1_3 # cdef

x_2_1 <- x[[1]][2, , ]
x_2_1 # aaba
x_2_2 <- x[[2]][2, , ]
x_2_2 # abaa
x_2_3 <- x[[3]][2, , ]
x_2_3 # baac

colnames(y) <- c("label_1", "label_2")
y

# concat 
set_learning <- list(reshape_mode = "concat",
                     maxlen = 4,
                     samples_per_target = 3)

gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
                     train_type = "label_folder",
                     batch_size = 2,
                     maxlen = 4,
                     step = 2, 
                     vocabulary = vocabulary,
                     set_learning = set_learning)

z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x) 
x_1 <- x[1, , ]
colnames(x_1) <- vocabulary
x_1 # abcd | cdef | efgh
x_2 <- x[2, , ]
colnames(x_2) <- vocabulary
x_2 # aaba | baac | acaa

colnames(y) <- c("label_1", "label_2")
y

use_quality_score

If TRUE, instead of one-hot encoding, use quality score of fastq file.

df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I")
fastq_path <- tempfile(fileext = ".fastq")
fastq_file <- microseq::writeFastq(df, fastq_path)
gen <- get_generator(path = fastq_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 5,
                     format = "fastq",
                     vocabulary = c("A", "C", "G", "T"),
                     use_quality_score = TRUE,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA
y # T

use_coverage

Integer or NULL. If not NULL, use coverage as encoding rather than one-hot encoding. Coverage information must be contained in fasta header: there must be a string “cov_n” in the header, where n is some integer.

df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      maxlen = 5,
                      vocabulary = c("A", "C", "G", "T"),
                      use_coverage = 25,
                      output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA; 0.32 = 8/25
y # T

added_label_path

It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named “file”.

We may add some additional input to our dummy data

file <- c(basename(file_path), "some_file_name.fasta")
df <- data.frame(file = file,
                 label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0))
df
write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE)

If we add the path to the csv file, the generator will map additional input to sequences:

gen <-  get_generator(path = dir_path,
                      train_type = "lm", 
                      batch_size = 1,
                      maxlen = 5,
                      output_format = "target_right",
                      vocabulary = vocabulary,
                      added_label_path = file.path(dir_path, "add_input.csv"),
                      add_input_as_seq = FALSE)  # don't treat added input as sequence
                      
z <- gen()
added_label_input <- z[[1]][[1]]
added_label_input
x <- z[[1]][[2]]
x[1, , ]
y <- z[[2]] 
y

If we want to train a network with additional labels, we have to add an additional input layer.

model <- create_model_lstm_cnn(
  maxlen = 5,
  layer_lstm = c(8, 8),
  layer_dense = c(4),
  label_input = 3 # additional input vector has length 3
)

# train_model(train_type = "lm", 
#             model = model,
#             path = file.path(dir_path, "train_files_1"),
#             path_val = file.path(dir_path, "validation_files_1"),
#             added_label_path = file.path(dir_path, "add_input.csv"),
#             steps_per_epoch = 5,
#             batch_size = 8,
#             epochs = 2)

return_int

Whether to return integer encoding rather than one-hot encoding.

df <- data.frame(Sequence = "ATCGC", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      return_int = TRUE,
                      padding = TRUE,
                      maxlen = 8,
                      vocabulary = c("A", "C", "G", "T"),
                      output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G")
x
colnames(y) <- "C"
y

Can also be combined with n-gram encoding:

df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      n_gram = 3,
                      n_gram_stride = 3,
                      return_int = TRUE,
                      maxlen = 6,
                      target_len = 3,
                      vocabulary = c("A", "C", "G", "T"),
                      output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("AAA", "CCC")
x
colnames(y) <- "TTT"
y

reshape_xy

Apply some function to the output of a generator call.

df <- data.frame(Sequence = "AAAATTTT", Header = "header_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
fx <- function(x = NULL, y = NULL) {
  return(x - 1)
}
fy <- function(x = NULL, y = NULL) {
  return(exp(y * 5))
}

gen <-  get_generator(path = fasta_path,
                      reshape_xy = list(x = fx, y = fy),
                      train_type = "label_folder",
                      batch_size = 1,
                      maxlen = 8)

z <- gen()
x <- z[[1]]
x[1,,]
y <- z[[2]]
y

masked_lm

Masks some parts of input sequence. Can be used for training BERT-like models.

nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
                  random_rate = 0.025, # set 2.5% of input to random value
                  identity_rate = 0.05, # leave 5% unchanged
                  include_sw = TRUE) # 0,1 matrix showing where masking was applied
gen <-  get_generator(path = fasta_path,
                      train_type = "masked_lm",
                      masked_lm = masked_lm,
                      batch_size = 1,
                      n_gram = 1,
                      n_gram_stride = 1,
                      return_int = TRUE,
                      maxlen = 100,
                      vocabulary = c("A", "C", "G", "T"))

z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ])
head(df)

Whenever sw (sample weight) column is 0, x and y columns are identical. Let’s look at rows where sw is 1:

df %>% dplyr::filter(sw == 1)

Here 5 is the mask token, this is always the size of the vocabulary + 1.

df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part
df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value)

Can be combined with n-gram encoding and masking of fixed block size:

nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
                  random_rate = 0.05, # set 5% of input to random value
                  identity_rate = 0.05, # leave 5% unchanged
                  include_sw = TRUE, # 0,1 matrix showing where masking was applied
                  block_len = 3) # always mask at least 3 tokens in a row 
gen <-  get_generator(path = fasta_path,
                      train_type = "masked_lm",
                      masked_lm = masked_lm,
                      batch_size = 1,
                      n_gram = 3,
                      seed = 12,
                      n_gram_stride = 1,
                      return_int = TRUE,
                      maxlen = 100,
                      vocabulary = c("A", "C", "G", "T"))

z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x))
head(df)
tail(df)

We can check that sample weights appear only in blocks.

which(sw == 1)

Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1).

df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part
df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value)