
Make prediction for nucleotide sequence or entries in fasta/fastq file
Source:R/predict.R
      predict_model.RdRemoves layers (optional) from pretrained model and calculates states of fasta/fastq file or nucleotide sequence.
Writes states to h5 or csv file (access content of h5 output with load_prediction function).
There are several options on how to process an input file:
- If - "one_seq", computes prediction for sequence argument or fasta/fastq file. Combines fasta entries in file to one sequence. This means predictor sequences can contain elements from more than one fasta entry.
- If - "by_entry", will output a separate file for each fasta/fastq entry. Names of output files are:- output_dir+ "Nr" + i +- filename+- output_type, where i is the number of the fasta entry.
- If - "by_entry_one_file", will store prediction for all fasta entries in one h5 file.
- If - "one_pred_per_entry", will make one prediction for each entry by either picking random sample for long sequences or pad sequence for short sequences.
Usage
predict_model(
  model,
  output_format = "one_seq",
  layer_name = NULL,
  sequence = NULL,
  path_input = NULL,
  round_digits = NULL,
  filename = "states.h5",
  step = 1,
  vocabulary = c("a", "c", "g", "t"),
  batch_size = 256,
  verbose = TRUE,
  return_states = FALSE,
  output_type = "h5",
  padding = "none",
  use_quality = FALSE,
  quality_string = NULL,
  mode = "label",
  lm_format = "target_right",
  output_dir = NULL,
  format = "fasta",
  include_seq = FALSE,
  reverse_complement_encoding = FALSE,
  ambiguous_nuc = "zero",
  ...
)Arguments
- model
- A keras model. 
- output_format
- Either - "one_seq",- "by_entry",- "by_entry_one_file",- "one_pred_per_entry".
- layer_name
- Name of layer to get output from. If - NULL, will use the last layer.
- sequence
- Character string, ignores path_input if argument given. 
- path_input
- Path to fasta file. 
- round_digits
- Number of decimal places. 
- filename
- Filename to store states in. No file output if argument is - NULL. If- output_format = "by_entry", adds "nr" + "i" after name, where i is entry number.
- step
- Frequency of sampling steps. 
- vocabulary
- Vector of allowed characters. Characters outside vocabulary get encoded as specified in - ambiguous_nuc.
- batch_size
- Number of samples used for one network update. 
- verbose
- Boolean. 
- return_states
- Return predictions as data frame. Only supported for output_format - "one_seq".
- output_type
- "h5"or- "csv". If- output_format`` is"by_entries_one_file", "one_pred_per_entry"- can only be"h5"`.
- padding
- Either - "none",- "maxlen",- "standard"or- "self".- If - "none", apply no padding and skip sequences that are too short.
- If - "maxlen", pad with maxlen number of zeros vectors.
- If - "standard", pad with zero vectors only if sequence is shorter than maxlen. Pads to minimum size required for one prediction.
- If - "self", concatenate sequence with itself until sequence is long enough for one prediction. Example: if sequence is "ACGT" and maxlen is 10, make prediction for "ACGTACGTAC". Only applied if sequence is shorter than maxlen.
 
- use_quality
- Whether to use quality scores. 
- quality_string
- String for encoding with quality scores (as used in fastq format). 
- mode
- Either - "lm"for language model or- "label"for label classification.
- lm_format
- Either - "target_right",- "target_middle_lstm",- "target_middle_cnn"or- "wavenet".
- output_dir
- Directory for file output. 
- format
- File format, - "fasta",- "fastq",- "rds"or- "fasta.tar.gz",- "fastq.tar.gz"for- tar.gzfiles.
- include_seq
- Whether to include input sequence in h5 file. 
- reverse_complement_encoding
- Whether to use both original sequence and reverse complement as two input sequences. 
- ambiguous_nuc
- How to handle nucleotides outside vocabulary, either - "zero",- "discard",- "empirical"or- "equal".- If - "zero", input gets encoded as zero vector.
- If - "equal", input is repetition of- 1/length(vocabulary).
- If - "discard", samples containing nucleotides outside vocabulary get discarded.
- If - "empirical", use nucleotide distribution of current file.
 
- ...
- Further arguments for sequence encoding with - seq_encoding_label.
Value
If return_states = TRUE returns a list of model predictions and position of corresponding sequences.
If additionally include_seq = TRUE, list contains sequence strings.
If return_states = FALSE returns nothing, just writes output to file(s).
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# make prediction for single sequence and write to h5 file
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
vocabulary <- c("a", "c", "g", "t")
sequence <- paste(sample(vocabulary, 200, replace = TRUE), collapse = "")
output_file <- tempfile(fileext = ".h5")
predict_model(output_format = "one_seq", model = model, step = 10,
             sequence = sequence, filename = output_file, mode = "label")
# make prediction for fasta file with multiple entries, write output to separate h5 files
fasta_path <- tempfile(fileext = ".fasta")
create_dummy_data(file_path = fasta_path, num_files = 1,
                 num_seq = 5, seq_length = 100,
                 write_to_file_path = TRUE)
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
output_dir <- tempfile()
dir.create(output_dir)
predict_model(output_format = "by_entry", model = model, step = 10, verbose = FALSE,
               output_dir = output_dir, mode = "label", path_input = fasta_path)
list.files(output_dir)
}