Introduction
The most common use case for the deepG data generator is to extract samples from a collection of fasta (or fastq) files. The generator will always return a list of length 2. The first element is the input and the second the target . We can differentiate between 2 approaches
-
Language model: Part of a sequence is the input and other part the target.
- Example: Predict the next nucleotide given the previous 100 nucleotides.
-
Label classification: Assign a label to a sequence.
- Example: Assign a label “virus” or “bacteria” to a sequence of length 100.
Suppose we are given 2 fasta files called “a.fasta” and “b.fasta” that look as follows:
a.fasta >header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT |
b.fasta >header_b1 GTGTGT >header_b2 AAGG |
If we want to extract sequences of length 4 from these files, there would be 17 possible samples (5 from AACCAAGG, 3 from TTTGGG, …). A naive approach would be to extract the samples in a sequential manner:
1. sample:
a.fasta >header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT |
b.fasta >header_b1 GTGTGT >header_b2 AAGG |
2. sample:
a.fasta >header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT |
b.fasta >header_b1 GTGTGT >header_b2 AAGG |
…
17. sample:
a.fasta >header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT |
b.fasta >header_b1 GTGTGT >header_b2 AAGG |
18. sample:
a.fasta >header_a1 AACCAAGG >header_a2 TTTGGG >header_a3 ACGTACGT |
b.fasta >header_b1 GTGTGT >header_b2 AAGG |
…
For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and the model would often see long stretches of data from the same source. Choosing the samples completely at random can also be problematic since we would constantly have to open new files. The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches.
Data generator options
In the following code examples, we will mostly use the sequence abcdefghiiii to demonstrate some of the deepG data generator options. (In real world application you would usually have sequences from the ACGT vocabulary.)
sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i")
vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")
We may store this sequence in a fasta file
temp_dir <- tempfile()
dir.create(temp_dir)
dir_path <- paste0(temp_dir, "/dummy_data")
dir.create(dir_path)
df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE)
file_path <- file.path(dir_path, "a.fasta")
# sequence as fasta file
microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path)
Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data. Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see use_coverage
, use_quality_score
and ambiguous_nuc
sections.
# one-hot encoding example
s <- c("a", "c", "a", "f", "i", "b")
s_as_int_seq <- vector("integer", length(s))
for (i in 1:length(s)) {
s_as_int_seq[i] <- which(s[i] == vocabulary) - 1
}
one_hot_sample <- keras::to_categorical(s_as_int_seq)
colnames(one_hot_sample) <- vocabulary
one_hot_sample
vocabulary
The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with the ambiguous_nuc
argument.
train_type
The generator will always return a list of length 2. The first element is the input and the second the target . The train_type
argument determines how and get extracted. Possible arguments for language models are:
-
“lm” or “lm_rds”: Given some sequence , we take some subset of that sequence as input and the rest as target. How to split can be specified in
output_format
argument.
Besides the language model approach, we can use label classification . This means we map some label to a sequence. For example, the target for some nucleotide sequence could be one of the labels “bacteria” or “virus”. We have to specify how to extract a label corresponding to a sequence. Possible arguments are:
“label_header”: get label from fasta headers.
“label_folder”: get label from folder, i.e. all files in one folder must belong to the same class.
-
“label_csv”: get label from csv file. Csv file should have one column named “file”. The targets then correspond to entries in that row (except “file” column). Example: if we are currently working with a file called “a.fasta”, there should be a row in our csv file with some target information for that file
file label_1 label_2 “a.fasta” 1 0 “label_rds”: rds file contains preprocessed list of input and target tensors.
Another option is “dummy_gen”: generator creates random data once and repeatedly returns them.
Extract target from fasta header (fasta header is “label_1” in example file):
# get target from header
vocabulary_label <- paste0("label_", 1:5)
gen <- get_generator(path = file_path,
train_type = "label_header",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary,
vocabulary_label = vocabulary_label)
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label
x # abcdef
y # label_1
Extract target from fasta folder:
# create data for second class
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)
# get target from folder
vocabulary_label <- paste0("label_", 1:2)
gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class
train_type = "label_folder",
batch_size = 8,
maxlen = 6,
vocabulary = vocabulary,
vocabulary_label = vocabulary_label)
z <- gen()
x <- z[[1]]
y <- z[[2]]
x_1_1 <- x[1, , ]
colnames(x_1_1) <- vocabulary
x_1_1 # first sample from first class
x_2_1 <- x[5, , ]
colnames(x_2_1) <- vocabulary
x_2_1 # first sample from second class
colnames(y) <- vocabulary_label
y # 4 samples from each class
Extract target from csv file:
# get target from csv
file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta")
vocabulary_label <- paste0("label", 1:4)
label_1 <- c(1, 0, 0, 0)
label_2 <- c(0, 1, 0, 0)
label_3 <- c(0, 0, 1, 0)
label_4 <- c(0, 0, 0, 1)
df <- data.frame(file, label_1, label_2, label_3, label_4)
df
csv_file <- tempfile(fileext = ".csv")
write.csv(df, csv_file, row.names = FALSE)
gen <- get_generator(path = file_path,
train_type = "label_csv",
batch_size = 1,
maxlen = 6,
target_from_csv = csv_file,
vocabulary = vocabulary,
vocabulary_label = vocabulary_label)
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label
x # abcdef
y # label_1
Examples for language models follow in the next section.
output_format
The output_format
determines the shape of the output for a language model, i.e. part of a sequence is the input and another the target . Assume a sequence abcdefg and maxlen = 6
. Output correspond as follows
“target_right”: abcdef, g
“target_middle_lstm”: (abc, gfe), d (note reversed order of )
“target_middle_cnn”: abcefg, d
“wavenet”: abcdef, bcdefg
# target_right
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary,
output_format = "target_right")
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # g
# target_middle_lstm
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary,
output_format = "target_middle_lstm")
z <- gen()
x_1 <- z[[1]][[1]][1,,]
x_2 <- z[[1]][[2]][1,,]
y <- z[[2]]
colnames(x_1) <- vocabulary
colnames(x_2) <- vocabulary
colnames(y) <- vocabulary
x_1 # abc
x_2 # gfe
y # d
# target_middle_cnn
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary,
output_format = "target_middle_cnn")
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcefg
y # d
# wavenet
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary,
output_format = "wavenet")
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]][1,,]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # bcdefg
batch_size
Number of samples in one batch.
# target_right
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 7,
maxlen = 6,
vocabulary = vocabulary,
output_format = "target_right")
z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x)
dim(y)
step
We may determine how frequently we want to take a sample. If step = 1
we take a sample at every possible step. Let’s assume we want to predict the next character, i.e. part of the sequence is the input and next character the target. If maxlen = 3, step = 1
:
sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii
if step = 3
sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 3,
vocabulary = vocabulary,
step = 3,
output_format = "target_right")
z <- gen()
x <- z[[1]][1,,] #encodes abc
y <- z[[2]] # encodes d
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y
# go 3 steps forward
z <- gen()
x <- z[[1]][1,,] #encodes def
y <- z[[2]] # encodes g
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y
padding
If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If padding = FALSE
the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample.
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 15, # maxlen is longer than sequence
vocabulary = vocabulary,
step = 3,
padding = TRUE,
output_format = "target_right")
z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # first 4 entries are zero-vectors
y
ambiguous_nuc
A sequence might contain a character that does not lie inside our vocabulary. For example, let’s assume we discard e from our vocabulary. We have 4 options to handle this situation
- encode as zero vector
vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary
# zero
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary_2,
output_format = "target_right",
ambiguous_nuc = "zeros")
z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # fifth row is zero vector
- equal probability
# equal
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary_2,
output_format = "target_right",
ambiguous_nuc = "equal")
z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # fifth row is 1/8 for every entry
- use distribution of current file
# empirical
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary_2,
output_format = "target_right",
ambiguous_nuc = "empirical")
z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # fifth row is distribuation of file
- discard
# discard
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 6,
vocabulary = vocabulary_2,
output_format = "target_right",
ambiguous_nuc = "discard")
z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # first sample with only characters from vocabulary is fghiii|i
proportion_per_seq
The proportion_per_seq
argument gives the option to use a random subset instead of the full sequence.
cat("sequence is ", nchar(sequence), "characters long \n")
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
maxlen = 5,
seed = 1,
vocabulary = vocabulary,
output_format = "target_right",
# take random subsequence using 50% of sequence
proportion_per_seq = 0.5)
z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # defgh
y # i
file_limit
Integer or NULL. If integer, use only specified number of randomly sampled files for training.
delete_used_files
If true, delete file once used. Only applies for rds files.
x <- array(0, dim = c(1,5,4))
y <- matrix(0, ncol = 1)
rds_path <- tempfile(fileext = ".rds")
saveRDS(list(x, y), rds_path)
gen <- get_generator(path = rds_path,
delete_used_files = TRUE,
train_type = "label_rds",
batch_size = 1,
maxlen = 5)
z <- gen()
file.exists(rds_path)
# z <- gen()
# When calling the generator again, it will wait until it finds a file again from the files listed in
# the initial `path` argument. Can be used if another process(es) create rds files.
max_samples
Only use fixed number of samples per file. Randomly choose which samples to use. (If random_sampling = FALSE
, samples are consecutive.)
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 2,
maxlen = 5,
step = 1,
seed = 3,
vocabulary = vocabulary,
output_format = "target_right",
max_samples = 2)
z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # bcdef
x2 # cdefg
random_sampling
If you use max_samples
, generator will randomly choose subset from all possible samples, but those samples are consecutive. With random_sampling = TRUE
, samples are completely random.
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 2,
maxlen = 5,
seed = 66,
random_sampling = TRUE,
vocabulary = vocabulary,
output_format = "target_right",
max_samples = 2)
z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # efghi
x2 # defgh
target_len
Target length for language model.
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
target_len = 3,
maxlen = 5,
vocabulary = vocabulary,
output_format = "target_right")
z <- gen()
x <- z[[1]][1, , ]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]
y3 <- z[[2]][ , 3, ]
colnames(x) <- vocabulary
names(y1) <- vocabulary
names(y2) <- vocabulary
names(y3) <- vocabulary
x # abcde
y1 # f
y2 # g
y3 # h
n_gram / n_gram_stride
Encode target in language model not character wise but combine n characters to one target. n_gram_stride
determines the frequency of the n-gram encoding.
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
target_len = 6,
n_gram = 3,
n_gram_stride = 3,
maxlen = 3,
vocabulary = vocabulary,
output_format = "target_right")
z <- gen()
x <- z[[1]]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]
dim(x)[3] == length(vocabulary)^3
# x = abc as 3-gram
# y1 = def as 3-gram
# y2 = ghi as 3-gram
add_noise
Add noise to input. Must be a list that specifies noise distribution or NULL (no noise). List contains arguments noise_type
: either "normal"
or "uniform"
. Optional arguments sd
or mean
if noise_type
is "normal"
(default is sd=1
and mean=0
) or min
, max
if noise_type
is "uniform"
(default is min=0
, max=1
).
gen <- get_generator(path = file_path,
train_type = "lm",
batch_size = 1,
add_noise = list(noise_type = "normal", mean = 0, sd = 0.01),
maxlen = 5,
vocabulary = vocabulary,
output_format = "target_right")
z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
round(x, 3) # abcde + noise
y # f
proportion_entries
If a fasta file has multiple entries, you can randomly choose a subset. For example, if the file has 6 entries and proportion_entries = 0.5
the generator will randomly choose only 3 of the entries.
shuffle_file_order
Shuffle file order before iterating through files. Order gets reshuffled after every iteration.
reverse_complement
If TRUE
, randomly decide for every batch to use original sequence or its reverse complement. Only implemented for ACGT vocabulary.
sample_by_file_size
Randomly choose new file by sampling according to file size (bigger files more likely).
concat_seq
Character string or NULL
. If not NULL
all entries from file get concatenated to one sequence with concat_seq
string between them. Use concat_seq = ""
if you don’t want to add a new token.
df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3))
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
train_type = "lm",
batch_size = 1,
maxlen = 9,
vocabulary = c("A", "C", "G", "T", "Z"),
concat_seq = "ZZ",
output_format = "target_right")
z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- c("A", "C", "G", "T", "Z")
colnames(y) <- c("A", "C", "G", "T", "Z")
x # ACZZAGZZA
y # T
set_learning
When you want to assign one label to set of samples. Only implemented for train_type = "label_folder"
. Input is a list with the following parameters
-
samples_per_target
how many samples to use for one target -
maxlen
length of one sample -
reshape_mode
:"time_dist", "multi_input"
or"concat"
.- If
reshape_mode = "multi_input"
, generator will producesamples_per_target
separate inputs, each of lengthmaxlen
. - If
reshape_mode = "time_dist"
, generator will produce a 4D input array. The dimensions correspond to(batch_size, samples_per_target, maxlen, length(vocabulary))
.
- If
reshape_mode
is"concat"
, generator will concatenatesamples_per_target
sequences of lengthmaxlen
to one long sequence.
- If
- If
reshape_mode = "concat"
, there is an additionalbuffer_len
argument: add new token between concatenated samples- If
buffer_len
is an integer, the sub-sequences are inter spaced withbuffer_len
rows. The input length is (maxlen
*samples_per_target
) +buffer_len
* (samples_per_target
- 1)
- If
# create data for second label
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)
# multi_input
set_learning <- list(reshape_mode = "multi_input",
maxlen = 4,
samples_per_target = 3)
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
train_type = "label_folder",
batch_size = 2,
maxlen = 4,
step = 1,
vocabulary = vocabulary,
set_learning = set_learning)
z <- gen()
x <- z[[1]]
y <- z[[2]]
length(x) # 3 samples per target
x_1_1 <- x[[1]][1, , ]
x_1_1 # abcd
x_1_2 <- x[[2]][1, , ]
x_1_2 # bcde
x_1_3 <- x[[3]][1, , ]
x_1_3 # cdef
x_2_1 <- x[[1]][2, , ]
x_2_1 # aaba
x_2_2 <- x[[2]][2, , ]
x_2_2 # abaa
x_2_3 <- x[[3]][2, , ]
x_2_3 # baac
colnames(y) <- c("label_1", "label_2")
y
# concat
set_learning <- list(reshape_mode = "concat",
maxlen = 4,
samples_per_target = 3)
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
train_type = "label_folder",
batch_size = 2,
maxlen = 4,
step = 2,
vocabulary = vocabulary,
set_learning = set_learning)
z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x)
x_1 <- x[1, , ]
colnames(x_1) <- vocabulary
x_1 # abcd | cdef | efgh
x_2 <- x[2, , ]
colnames(x_2) <- vocabulary
x_2 # aaba | baac | acaa
colnames(y) <- c("label_1", "label_2")
y
use_quality_score
If TRUE
, instead of one-hot encoding, use quality score of fastq file.
df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I")
fastq_path <- tempfile(fileext = ".fastq")
fastq_file <- microseq::writeFastq(df, fastq_path)
gen <- get_generator(path = fastq_path,
train_type = "lm",
batch_size = 1,
maxlen = 5,
format = "fastq",
vocabulary = c("A", "C", "G", "T"),
use_quality_score = TRUE,
output_format = "target_right")
z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA
y # T
use_coverage
Integer or NULL
. If not NULL
, use coverage as encoding rather than one-hot encoding. Coverage information must be contained in fasta header: there must be a string “cov_n” in the header, where n is some integer.
df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
train_type = "lm",
batch_size = 1,
maxlen = 5,
vocabulary = c("A", "C", "G", "T"),
use_coverage = 25,
output_format = "target_right")
z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA; 0.32 = 8/25
y # T
added_label_path
It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named “file”.
We may add some additional input to our dummy data
file <- c(basename(file_path), "some_file_name.fasta")
df <- data.frame(file = file,
label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0))
df
write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE)
If we add the path to the csv file, the generator will map additional input to sequences:
gen <- get_generator(path = dir_path,
train_type = "lm",
batch_size = 1,
maxlen = 5,
output_format = "target_right",
vocabulary = vocabulary,
added_label_path = file.path(dir_path, "add_input.csv"),
add_input_as_seq = FALSE) # don't treat added input as sequence
z <- gen()
added_label_input <- z[[1]][[1]]
added_label_input
x <- z[[1]][[2]]
x[1, , ]
y <- z[[2]]
y
If we want to train a network with additional labels, we have to add an additional input layer.
model <- create_model_lstm_cnn(
maxlen = 5,
layer_lstm = c(8, 8),
layer_dense = c(4),
label_input = 3 # additional input vector has length 3
)
# train_model(train_type = "lm",
# model = model,
# path = file.path(dir_path, "train_files_1"),
# path_val = file.path(dir_path, "validation_files_1"),
# added_label_path = file.path(dir_path, "add_input.csv"),
# steps_per_epoch = 5,
# batch_size = 8,
# epochs = 2)
return_int
Whether to return integer encoding rather than one-hot encoding.
df <- data.frame(Sequence = "ATCGC", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
train_type = "lm",
batch_size = 1,
return_int = TRUE,
padding = TRUE,
maxlen = 8,
vocabulary = c("A", "C", "G", "T"),
output_format = "target_right")
z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G")
x
colnames(y) <- "C"
y
Can also be combined with n-gram encoding:
df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
train_type = "lm",
batch_size = 1,
n_gram = 3,
n_gram_stride = 3,
return_int = TRUE,
maxlen = 6,
target_len = 3,
vocabulary = c("A", "C", "G", "T"),
output_format = "target_right")
z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("AAA", "CCC")
x
colnames(y) <- "TTT"
y
reshape_xy
Apply some function to the output of a generator call.
df <- data.frame(Sequence = "AAAATTTT", Header = "header_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
fx <- function(x = NULL, y = NULL) {
return(x - 1)
}
fy <- function(x = NULL, y = NULL) {
return(exp(y * 5))
}
gen <- get_generator(path = fasta_path,
reshape_xy = list(x = fx, y = fy),
train_type = "label_folder",
batch_size = 1,
maxlen = 8)
z <- gen()
x <- z[[1]]
x[1,,]
y <- z[[2]]
y
masked_lm
Masks some parts of input sequence. Can be used for training BERT-like models.
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
random_rate = 0.025, # set 2.5% of input to random value
identity_rate = 0.05, # leave 5% unchanged
include_sw = TRUE) # 0,1 matrix showing where masking was applied
gen <- get_generator(path = fasta_path,
train_type = "masked_lm",
masked_lm = masked_lm,
batch_size = 1,
n_gram = 1,
n_gram_stride = 1,
return_int = TRUE,
maxlen = 100,
vocabulary = c("A", "C", "G", "T"))
z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ])
head(df)
Whenever sw (sample weight) column is 0, x and y columns are identical. Let’s look at rows where sw is 1:
Here 5 is the mask token, this is always the size of the vocabulary + 1.
df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part
df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value)
Can be combined with n-gram encoding and masking of fixed block size:
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
random_rate = 0.05, # set 5% of input to random value
identity_rate = 0.05, # leave 5% unchanged
include_sw = TRUE, # 0,1 matrix showing where masking was applied
block_len = 3) # always mask at least 3 tokens in a row
gen <- get_generator(path = fasta_path,
train_type = "masked_lm",
masked_lm = masked_lm,
batch_size = 1,
n_gram = 3,
seed = 12,
n_gram_stride = 1,
return_int = TRUE,
maxlen = 100,
vocabulary = c("A", "C", "G", "T"))
z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x))
head(df)
tail(df)
We can check that sample weights appear only in blocks.
which(sw == 1)
Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1).