Iterates over folder containing fasta/fastq files and produces encoding of predictor sequences and target variables. Files in path_corpus should all belong to one class.


  format = "fasta",
  batch_size = 256,
  maxlen = 250,
  max_iter = 10000,
  vocabulary = c("a", "c", "g", "t"),
  verbose = FALSE,
  shuffle_file_order = FALSE,
  step = 1,
  seed = 1234,
  shuffle_input = FALSE,
  file_limit = NULL,
  path_file_log = NULL,
  reverse_complement = TRUE,
  reverse_complement_encoding = FALSE,
  ambiguous_nuc = "zero",
  proportion_per_seq = NULL,
  read_data = FALSE,
  use_quality_score = FALSE,
  padding = TRUE,
  added_label_path = NULL,
  add_input_as_seq = NULL,
  skip_amb_nuc = NULL,
  max_samples = NULL,
  concat_seq = NULL,
  file_filter = NULL,
  use_coverage = NULL,
  proportion_entries = NULL,
  sample_by_file_size = FALSE,
  n_gram = NULL,
  n_gram_stride = 1,
  masked_lm = NULL,
  add_noise = NULL,
  return_int = FALSE,
  reshape_xy = NULL



Input directory where fasta files are located or path to single file ending with fasta or fastq (as specified in format argument). Can also be a list of directories and/or files.


File format, either "fasta" or "fastq".


Number of samples in one batch.


Length of predictor sequence.


Stop after max_iter number of iterations failed to produce a new batch.


Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.


Whether to show messages.


Logical, whether to go through files randomly or sequentially.


How often to take a sample.


Sets seed for set.seed function for reproducible results.


Whether to shuffle entries in every fasta/fastq file before extracting samples.


Integer or NULL. If integer, use only specified number of randomly sampled files for training. Ignored if greater than number of files in path.


Write name of files to csv file if path is specified.


Boolean, for every new file decide randomly to use original data or its reverse complement.


Whether to use both original sequence and reverse complement as two input sequences.


Number of columns of target matrix.


Which column of target matrix contains ones.


How to handle nucleotides outside vocabulary, either "zero", "discard", "empirical" or "equal".

  • If "zero", input gets encoded as zero vector.

  • If "equal", input is repetition of 1/length(vocabulary).

  • If "discard", samples containing nucleotides outside vocabulary get discarded.

  • If "empirical", use nucleotide distribution of current file.


Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence).


If TRUE the first element of output is a list of length 2, each containing one part of paired read. Maxlen should be 2*length of one read.


Whether to use fastq quality scores. If TRUE input is not one-hot-encoding but corresponds to probabilities. For example (0.97, 0.01, 0.01, 0.01) instead of (1, 0, 0, 0).


Whether to pad sequences too short for one sample with zeros.


Path to file with additional input labels. Should be a csv file with one column named "file". Other columns should correspond to labels.


Boolean vector specifying for each entry in added_label_path if rows from csv should be encoded as a sequence or used directly. If a row in your csv file is a sequence this should be TRUE. For example you may want to add another sequence, say ACCGT. Then this would correspond to 1,2,2,3,4 in csv file (if vocabulary = c("A", "C", "G", "T")). If add_input_as_seq is TRUE, 12234 gets one-hot encoded, so added input is a 3D tensor. If add_input_as_seq is FALSE this will feed network just raw data (a 2D tensor).


Threshold of ambiguous nucleotides to accept in fasta entry. Complete entry will get discarded otherwise.


Maximum number of samples to use from one file. If not NULL and file has more than max_samples samples, will randomly choose a subset of max_samples samples.


Character string or NULL. If not NULL all entries from file get concatenated to one sequence with concat_seq string between them. Example: If 1.entry AACC, 2. entry TTTG and concat_seq = "ZZZ" this becomes AACCZZZTTTG.


Vector of file names to use from path_corpus.


Integer or NULL. If not NULL, use coverage as encoding rather than one-hot encoding and normalize. Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where n is some integer.


Proportion of fasta entries to keep. For example, if fasta file has 50 entries and proportion_entries = 0.1, will randomly select 5 entries.


Sample new file weighted by file size (bigger files more likely).


Integer, encode target not nucleotide wise but combine n nucleotides at once. For example for n=2, "AA" -> (1, 0,..., 0), "AC" -> (0, 1, 0,..., 0), "TT" -> (0,..., 0, 1), where the one-hot vectors have length length(vocabulary)^n.


Step size for n-gram encoding. For AACCGGTT with n_gram = 4 and n_gram_stride = 2, generator encodes (AACC), (CCGG), (GGTT); for n_gram_stride = 4 generator encodes (AACC), (GGTT).


If not NULL, input and target are equal except some parts of the input are masked or random. Must be list with the following arguments:

  • mask_rate: Rate of input to mask (rate of input to replace with mask token).

  • random_rate: Rate of input to set to random token.

  • identity_rate: Rate of input where sample weights are applied but input and output are identical.

  • include_sw: Whether to include sample weights.

  • block_len (optional): Masked/random/identity regions appear in blocks of size block_len.


NULL or list of arguments. If not NULL, list must contain the following arguments: noise_type can be "normal" or "uniform"; optional arguments sd or mean if noise_type is "normal" (default is sd=1 and mean=0) or min, max if noise_type is "uniform" (default is min=0, max=1).


Whether to return integer encoding or one-hot encoding.


Can be a list of functions to apply to input and/or target. List elements (containing the reshape functions) must be called x for input or y for target and each have arguments called x and y. For example: reshape_xy = list(x = function(x, y) {return(x+1)}, y = function(x, y) {return(x+y)}) . For rds generator needs to have an additional argument called sw.


A generator function.


if (FALSE) { # reticulate::py_module_available("tensorflow")
# create dummy fasta files
path_input_1 <- tempfile()
create_dummy_data(file_path = path_input_1, 
                  num_files = 2,
                  seq_length = 7,
                  num_seq = 1,
                  vocabulary = c("a", "c", "g", "t"))

gen <- generator_fasta_label_folder(path_corpus = path_input_1, batch_size = 2,
                                    num_targets = 3, ones_column = 2, maxlen = 7)
z <- gen()