
Make prediction for nucleotide sequence or entries in fasta/fastq file
Source:R/predict.R
predict_model.RdRemoves layers (optional) from pretrained model and calculates states of fasta/fastq file or nucleotide sequence.
Writes states to h5 or csv file (access content of h5 output with load_prediction function).
There are several options on how to process an input file:
If
"one_seq", computes prediction for sequence argument or fasta/fastq file. Combines fasta entries in file to one sequence. This means predictor sequences can contain elements from more than one fasta entry.If
"by_entry", will output a separate file for each fasta/fastq entry. Names of output files are:output_dir+ "Nr" + i +filename+output_type, where i is the number of the fasta entry.If
"by_entry_one_file", will store prediction for all fasta entries in one h5 file.If
"one_pred_per_entry", will make one prediction for each entry by either picking random sample for long sequences or pad sequence for short sequences.
Usage
predict_model(
model,
output_format = "one_seq",
layer_name = NULL,
sequence = NULL,
path_input = NULL,
round_digits = NULL,
filename = "states.h5",
step = 1,
vocabulary = c("a", "c", "g", "t"),
batch_size = 256,
verbose = TRUE,
return_states = FALSE,
output_type = "h5",
padding = "none",
use_quality = FALSE,
quality_string = NULL,
mode = "label",
lm_format = "target_right",
output_dir = NULL,
format = "fasta",
include_seq = FALSE,
reverse_complement_encoding = FALSE,
ambiguous_nuc = "zero",
...
)Arguments
- model
A keras model.
- output_format
Either
"one_seq","by_entry","by_entry_one_file","one_pred_per_entry".- layer_name
Name of layer to get output from. If
NULL, will use the last layer.- sequence
Character string, ignores path_input if argument given.
- path_input
Path to fasta file.
- round_digits
Number of decimal places.
- filename
Filename to store states in. No file output if argument is
NULL. Ifoutput_format = "by_entry", adds "nr" + "i" after name, where i is entry number.- step
Frequency of sampling steps.
- vocabulary
Vector of allowed characters. Characters outside vocabulary get encoded as specified in
ambiguous_nuc.- batch_size
Number of samples used for one network update.
- verbose
Boolean.
- return_states
Return predictions as data frame. Only supported for output_format
"one_seq".- output_type
"h5"or"csv". Ifoutput_format`` is"by_entries_one_file", "one_pred_per_entry"can only be"h5"`.- padding
Either
"none","maxlen","standard"or"self".If
"none", apply no padding and skip sequences that are too short.If
"maxlen", pad with maxlen number of zeros vectors.If
"standard", pad with zero vectors only if sequence is shorter than maxlen. Pads to minimum size required for one prediction.If
"self", concatenate sequence with itself until sequence is long enough for one prediction. Example: if sequence is "ACGT" and maxlen is 10, make prediction for "ACGTACGTAC". Only applied if sequence is shorter than maxlen.
- use_quality
Whether to use quality scores.
- quality_string
String for encoding with quality scores (as used in fastq format).
- mode
Either
"lm"for language model or"label"for label classification.- lm_format
Either
"target_right","target_middle_lstm","target_middle_cnn"or"wavenet".- output_dir
Directory for file output.
- format
File format,
"fasta","fastq","rds"or"fasta.tar.gz","fastq.tar.gz"fortar.gzfiles.- include_seq
Whether to include input sequence in h5 file.
- reverse_complement_encoding
Whether to use both original sequence and reverse complement as two input sequences.
- ambiguous_nuc
How to handle nucleotides outside vocabulary, either
"zero","discard","empirical"or"equal".If
"zero", input gets encoded as zero vector.If
"equal", input is repetition of1/length(vocabulary).If
"discard", samples containing nucleotides outside vocabulary get discarded.If
"empirical", use nucleotide distribution of current file.
- ...
Further arguments for sequence encoding with
seq_encoding_label.
Value
If return_states = TRUE returns a list of model predictions and position of corresponding sequences.
If additionally include_seq = TRUE, list contains sequence strings.
If return_states = FALSE returns nothing, just writes output to file(s).
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# make prediction for single sequence and write to h5 file
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
vocabulary <- c("a", "c", "g", "t")
sequence <- paste(sample(vocabulary, 200, replace = TRUE), collapse = "")
output_file <- tempfile(fileext = ".h5")
predict_model(output_format = "one_seq", model = model, step = 10,
sequence = sequence, filename = output_file, mode = "label")
# make prediction for fasta file with multiple entries, write output to separate h5 files
fasta_path <- tempfile(fileext = ".fasta")
create_dummy_data(file_path = fasta_path, num_files = 1,
num_seq = 5, seq_length = 100,
write_to_file_path = TRUE)
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
output_dir <- tempfile()
dir.create(output_dir)
predict_model(output_format = "by_entry", model = model, step = 10, verbose = FALSE,
output_dir = output_dir, mode = "label", path_input = fasta_path)
list.files(output_dir)
}