Make prediction for nucleotide sequence or entries in fasta/fastq file

Removes layers (optional) from pretrained model and calculates states of fasta/fastq file or nucleotide sequence. Writes states to h5 or csv file (access content of h5 output with load_prediction function). There are several options on how to process an input file:

If "one_seq", computes prediction for sequence argument or fasta/fastq file. Combines fasta entries in file to one sequence. This means predictor sequences can contain elements from more than one fasta entry.
If "by_entry", will output a separate file for each fasta/fastq entry. Names of output files are: output_dir + "Nr" + i + filename + output_type, where i is the number of the fasta entry.
If "by_entry_one_file", will store prediction for all fasta entries in one h5 file.
If "one_pred_per_entry", will make one prediction for each entry by either picking random sample for long sequences or pad sequence for short sequences.

Usage

predict_model(
  model,
  output_format = "one_seq",
  layer_name = NULL,
  sequence = NULL,
  path_input = NULL,
  round_digits = NULL,
  filename = "states.h5",
  step = 1,
  vocabulary = c("a", "c", "g", "t"),
  batch_size = 256,
  verbose = TRUE,
  return_states = FALSE,
  output_type = "h5",
  padding = "none",
  use_quality = FALSE,
  quality_string = NULL,
  mode = "label",
  lm_format = "target_right",
  output_dir = NULL,
  format = "fasta",
  include_seq = FALSE,
  reverse_complement_encoding = FALSE,
  ambiguous_nuc = "zero",
  ...
)

Arguments

model

A keras model.

output_format

Either "one_seq", "by_entry", "by_entry_one_file", "one_pred_per_entry".

layer_name

Name of layer to get output from. If NULL, will use the last layer.

sequence

Character string, ignores path_input if argument given.

path_input

Path to fasta file.

round_digits

Number of decimal places.

filename

Filename to store states in. No file output if argument is NULL. If output_format = "by_entry", adds "nr" + "i" after name, where i is entry number.

step

Frequency of sampling steps.

vocabulary

Vector of allowed characters. Characters outside vocabulary get encoded as specified in ambiguous_nuc.

batch_size

Number of samples used for one network update.

verbose

Boolean.

return_states

Return predictions as data frame. Only supported for output_format "one_seq".

output_type

"h5" or "csv". If output_format`` is "by_entries_one_file", "one_pred_per_entry"can only be"h5"`.

padding

Either "none", "maxlen", "standard" or "self".

If "none", apply no padding and skip sequences that are too short.
If "maxlen", pad with maxlen number of zeros vectors.
If "standard", pad with zero vectors only if sequence is shorter than maxlen. Pads to minimum size required for one prediction.
If "self", concatenate sequence with itself until sequence is long enough for one prediction. Example: if sequence is "ACGT" and maxlen is 10, make prediction for "ACGTACGTAC". Only applied if sequence is shorter than maxlen.

use_quality

Whether to use quality scores.

quality_string

String for encoding with quality scores (as used in fastq format).

mode

Either "lm" for language model or "label" for label classification.

lm_format

Either "target_right", "target_middle_lstm", "target_middle_cnn" or "wavenet".

output_dir

Directory for file output.

format

File format, "fasta", "fastq", "rds" or "fasta.tar.gz", "fastq.tar.gz" for tar.gz files.

include_seq

Whether to include input sequence in h5 file.

reverse_complement_encoding

Whether to use both original sequence and reverse complement as two input sequences.

ambiguous_nuc

How to handle nucleotides outside vocabulary, either "zero", "discard", "empirical" or "equal".

If "zero", input gets encoded as zero vector.
If "equal", input is repetition of 1/length(vocabulary).
If "discard", samples containing nucleotides outside vocabulary get discarded.
If "empirical", use nucleotide distribution of current file.

...

Further arguments for sequence encoding with seq_encoding_label.

Value

If return_states = TRUE returns a list of model predictions and position of corresponding sequences. If additionally include_seq = TRUE, list contains sequence strings. If return_states = FALSE returns nothing, just writes output to file(s).

Examples

if (FALSE) { # reticulate::py_module_available("tensorflow")
# make prediction for single sequence and write to h5 file
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
vocabulary <- c("a", "c", "g", "t")
sequence <- paste(sample(vocabulary, 200, replace = TRUE), collapse = "")
output_file <- tempfile(fileext = ".h5")
predict_model(output_format = "one_seq", model = model, step = 10,
             sequence = sequence, filename = output_file, mode = "label")

# make prediction for fasta file with multiple entries, write output to separate h5 files
fasta_path <- tempfile(fileext = ".fasta")
create_dummy_data(file_path = fasta_path, num_files = 1,
                 num_seq = 5, seq_length = 100,
                 write_to_file_path = TRUE)
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
output_dir <- tempfile()
dir.create(output_dir)
predict_model(output_format = "by_entry", model = model, step = 10, verbose = FALSE,
               output_dir = output_dir, mode = "label", path_input = fasta_path)
list.files(output_dir)
}