Randomly select samples from fasta files — generator

Generator generator_fasta_lm, generator_fasta_label_header_csv or generator_fasta_label_folder will randomly choose a consecutive sequence of samples when a max_samples argument is supplied. generator_random will choose samples at random.

Usage

generator_random(
  train_type = "label_folder",
  output_format = NULL,
  seed = 123,
  format = "fasta",
  reverse_complement = TRUE,
  path = NULL,
  batch_size = c(100),
  maxlen = 4,
  ambiguous_nuc = "equal",
  padding = FALSE,
  vocabulary = c("a", "c", "g", "t"),
  number_target_nt = 1,
  n_gram = NULL,
  n_gram_stride = NULL,
  sample_by_file_size = TRUE,
  max_samples = 1,
  skip_amb_nuc = NULL,
  vocabulary_label = NULL,
  target_from_csv = NULL,
  target_split = NULL,
  max_iter = 1000,
  verbose = TRUE,
  set_learning = NULL,
  shuffle_input = TRUE,
  reverse_complement_encoding = FALSE,
  proportion_entries = NULL,
  masked_lm = NULL,
  concat_seq = NULL,
  return_int = FALSE,
  reshape_xy = NULL
)

Arguments

train_type

Either "lm", "lm_rds", "masked_lm" for language model; "label_header", "label_folder", "label_csv", "label_rds" for classification or "dummy_gen".

Language model is trained to predict character(s) in a sequence.
"label_header"/"label_folder"/"label_csv" are trained to predict a corresponding class given a sequence as input.
If "label_header", class will be read from fasta headers.
If "label_folder", class will be read from folder, i.e. all files in one folder must belong to the same class.
If "label_csv", targets are read from a csv file. This file should have one column named "file". The targets then correspond to entries in that row (except "file" column). Example: if we are currently working with a file called "a.fasta" and corresponding label is "label_1", there should be a row in our csv file
file label_1 label_2
"a.fasta" 1 0
If "label_rds", generator will iterate over set of .rds files containing each a list of input and target tensors. Not implemented for model with multiple inputs.
If "lm_rds", generator will iterate over set of .rds files and will split tensor according to target_len argument (targets are last target_len nucleotides of each sequence).
If "dummy_gen", generator creates random data once and repeatedly feeds these to model.
If "masked_lm", generator maskes some parts of the input. See masked_lm argument for details.

output_format

Determines shape of output tensor for language model. Either "target_right", "target_middle_lstm", "target_middle_cnn" or "wavenet". Assume a sequence "AACCGTA". Output correspond as follows

"target_right": X = "AACCGT", Y = "A"
"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C" (note reversed order of X_2)
"target_middle_cnn": X = "AACGTA", Y = "C"
"wavenet": X = "AACCGT", Y = "ACCGTA"

seed

Sets seed for set.seed function for reproducible results.

format

File format, either "fasta" or "fastq".

reverse_complement

Boolean, for every new file decide randomly to use original data or its reverse complement.

path

Path to training data. If train_type is label_folder, should be a vector or list where each entry corresponds to a class (list elements can be directories and/or individual files). If train_type is not label_folder, can be a single directory or file or a list of directories and/or files.

batch_size

Number of samples in one batch.

maxlen

Length of predictor sequence.

ambiguous_nuc

How to handle nucleotides outside vocabulary, either "zero", "discard", "empirical" or "equal".

If "zero", input gets encoded as zero vector.
If "equal", input is repetition of 1/length(vocabulary).
If "discard", samples containing nucleotides outside vocabulary get discarded.
If "empirical", use nucleotide distribution of current file.

padding

Whether to pad sequences too short for one sample with zeros.

vocabulary

Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

number_target_nt

Number of target nucleotides for language model.

n_gram

Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for n=2, "AA" -> (1, 0,..., 0), "AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1), where the one-hot vectors have length length(vocabulary)^n.

n_gram_stride

Step size for n-gram encoding. For AACCGGTT with n_gram = 4 and n_gram_stride = 2, generator encodes (AACC), (CCGG), (GGTT); for n_gram_stride = 4 generator encodes (AACC), (GGTT).

sample_by_file_size

Sample new file weighted by file size (bigger files more likely).

max_samples

Maximum number of samples to use from one file. If not NULL and file has more than max_samples samples, will randomly choose a subset of max_samples samples.

skip_amb_nuc

Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise.

vocabulary_label

Character vector of possible targets. Targets outside vocabulary_label will get discarded.

target_from_csv

Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets.

target_split

If target gets read from csv file, list of names to divide target tensor into list of tensors. Example: if csv file has header names "file", "label_1", "label_2", "label_3" and target_split = list(c("label_1", "label_2"), "label_3"), this will divide target matrix to list of length 2, where the first element contains columns named "label_1" and "label_2" and the second entry contains the column named "label_3".

max_iter

Stop after max_iter number of iterations failed to produce a new batch.

verbose

Whether to show messages.

set_learning

When you want to assign one label to set of samples. Only implemented for train_type = "label_folder". Input is a list with the following parameters

samples_per_target: how many samples to use for one target.
maxlen: length of one sample.
reshape_mode: "time_dist", "multi_input" or "concat".
- If reshape_mode is "multi_input", generator will produce samples_per_target separate inputs, each of length maxlen (model should have samples_per_target input layers).
- If reshape_mode is "time_dist", generator will produce a 4D input array. The dimensions correspond to (batch_size, samples_per_target, maxlen, length(vocabulary)).
- If reshape_mode is "concat", generator will concatenate samples_per_target sequences of length maxlen to one long sequence.
If reshape_mode is "concat", there is an additional buffer_len argument. If buffer_len is an integer, the subsequences are interspaced with buffer_len rows. The input length is (maxlen \(*\) samples_per_target) + buffer_len \(*\) (samples_per_target - 1).

shuffle_input

Whether to shuffle entries in every fasta/fastq file before extracting samples.

reverse_complement_encoding

Whether to use both original sequence and reverse complement as two input sequences.

proportion_entries

Proportion of fasta entries to keep. For example, if fasta file has 50 entries and proportion_entries = 0.1, will randomly select 5 entries.

masked_lm

If not NULL, input and target are equal except some parts of the input are masked or random. Must be list with the following arguments:

mask_rate: Rate of input to mask (rate of input to replace with mask token).
random_rate: Rate of input to set to random token.
identity_rate: Rate of input where sample weights are applied but input and output are identical.
include_sw: Whether to include sample weights.
block_len (optional): Masked/random/identity regions appear in blocks of size block_len.

concat_seq

Character string or NULL. If not NULL all entries from file get concatenated to one sequence with concat_seq string between them. Example: If 1.entry AACC, 2. entry TTTG and concat_seq = "ZZZ" this becomes AACCZZZTTTG.

return_int

Whether to return integer encoding or one-hot encoding.

reshape_xy

Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions) must be called x for input or y for target and each have arguments called x and y. For example: reshape_xy = list(x = function(x, y) {return(x+1)}, y = function(x, y) {return(x+y)}) . For rds generator needs to have an additional argument called sw.

Value

A generator function.

Examples

if (FALSE) { # reticulate::py_module_available("tensorflow")
path_input <- tempfile()
dir.create(path_input)
# create 2 fasta files called 'file_1.fasta', 'file_2.fasta'
create_dummy_data(file_path = path_input,
                  num_files = 2,
                  seq_length = 5,
                  num_seq = 1,
                  vocabulary = c("a", "c", "g", "t"))
dummy_labels <- data.frame(file = c('file_1.fasta', 'file_2.fasta'), # dummy labels
                           label1 = c(0, 1),
                           label2 = c(1, 0))
target_from_csv <- tempfile(fileext = '.csv')
write.csv(dummy_labels, target_from_csv, row.names = FALSE)
gen <- generator_random(path = path_input, batch_size = 2,
                        vocabulary_label = c('label_a', 'label_b'),
                        train_type = 'label_csv',
                        maxlen = 5, target_from_csv = target_from_csv)
z <- gen()
dim(z[[1]])
z[[2]]
}

file	label_1	label_2
"a.fasta"	1	0