Make prediction for nucleotide sequence or entries in fasta/fastq file
Source:R/predict.R
predict_model.Rd
Removes layers (optional) from pretrained model and calculates states of fasta/fastq file or nucleotide sequence.
Writes states to h5 or csv file (access content of h5 output with load_prediction
function).
There are several options on how to process an input file:
If
"one_seq"
, computes prediction for sequence argument or fasta/fastq file. Combines fasta entries in file to one sequence. This means predictor sequences can contain elements from more than one fasta entry.If
"by_entry"
, will output a separate file for each fasta/fastq entry. Names of output files are:output_dir
+ "Nr" + i +filename
+output_type
, where i is the number of the fasta entry.If
"by_entry_one_file"
, will store prediction for all fasta entries in one h5 file.If
"one_pred_per_entry"
, will make one prediction for each entry by either picking random sample for long sequences or pad sequence for short sequences.
Usage
predict_model(
model,
output_format = "one_seq",
layer_name = NULL,
sequence = NULL,
path_input = NULL,
round_digits = NULL,
filename = "states.h5",
step = 1,
vocabulary = c("a", "c", "g", "t"),
batch_size = 256,
verbose = TRUE,
return_states = FALSE,
output_type = "h5",
padding = "none",
use_quality = FALSE,
quality_string = NULL,
mode = "label",
lm_format = "target_right",
output_dir = NULL,
format = "fasta",
include_seq = FALSE,
reverse_complement_encoding = FALSE,
ambiguous_nuc = "zero",
...
)
Arguments
- model
A keras model.
- output_format
Either
"one_seq"
,"by_entry"
,"by_entry_one_file"
,"one_pred_per_entry"
.- layer_name
Name of layer to get output from. If
NULL
, will use the last layer.- sequence
Character string, ignores path_input if argument given.
- path_input
Path to fasta file.
- round_digits
Number of decimal places.
- filename
Filename to store states in. No file output if argument is
NULL
. Ifoutput_format = "by_entry"
, adds "nr" + "i" after name, where i is entry number.- step
Frequency of sampling steps.
- vocabulary
Vector of allowed characters. Characters outside vocabulary get encoded as specified in
ambiguous_nuc
.- batch_size
Number of samples used for one network update.
- verbose
Boolean.
- return_states
Return predictions as data frame. Only supported for output_format
"one_seq"
.- output_type
"h5"
or"csv"
. Ifoutput_format`` is
"by_entries_one_file", "one_pred_per_entry"can only be
"h5"`.- padding
Either
"none"
,"maxlen"
,"standard"
or"self"
.If
"none"
, apply no padding and skip sequences that are too short.If
"maxlen"
, pad with maxlen number of zeros vectors.If
"standard"
, pad with zero vectors only if sequence is shorter than maxlen. Pads to minimum size required for one prediction.If
"self"
, concatenate sequence with itself until sequence is long enough for one prediction. Example: if sequence is "ACGT" and maxlen is 10, make prediction for "ACGTACGTAC". Only applied if sequence is shorter than maxlen.
- use_quality
Whether to use quality scores.
- quality_string
String for encoding with quality scores (as used in fastq format).
- mode
Either
"lm"
for language model or"label"
for label classification.- lm_format
Either
"target_right"
,"target_middle_lstm"
,"target_middle_cnn"
or"wavenet"
.- output_dir
Directory for file output.
- format
File format,
"fasta"
,"fastq"
,"rds"
or"fasta.tar.gz"
,"fastq.tar.gz"
fortar.gz
files.- include_seq
Whether to include input sequence in h5 file.
- reverse_complement_encoding
Whether to use both original sequence and reverse complement as two input sequences.
- ambiguous_nuc
How to handle nucleotides outside vocabulary, either
"zero"
,"discard"
,"empirical"
or"equal"
.If
"zero"
, input gets encoded as zero vector.If
"equal"
, input is repetition of1/length(vocabulary)
.If
"discard"
, samples containing nucleotides outside vocabulary get discarded.If
"empirical"
, use nucleotide distribution of current file.
- ...
Further arguments for sequence encoding with
seq_encoding_label
.
Value
If return_states = TRUE
returns a list of model predictions and position of corresponding sequences.
If additionally include_seq = TRUE
, list contains sequence strings.
If return_states = FALSE
returns nothing, just writes output to file(s).
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# make prediction for single sequence and write to h5 file
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
vocabulary <- c("a", "c", "g", "t")
sequence <- paste(sample(vocabulary, 200, replace = TRUE), collapse = "")
output_file <- tempfile(fileext = ".h5")
predict_model(output_format = "one_seq", model = model, step = 10,
sequence = sequence, filename = output_file, mode = "label")
# make prediction for fasta file with multiple entries, write output to separate h5 files
fasta_path <- tempfile(fileext = ".fasta")
create_dummy_data(file_path = fasta_path, num_files = 1,
num_seq = 5, seq_length = 100,
write_to_file_path = TRUE)
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
output_dir <- tempfile()
dir.create(output_dir)
predict_model(output_format = "by_entry", model = model, step = 10, verbose = FALSE,
output_dir = output_dir, mode = "label", path_input = fasta_path)
list.files(output_dir)
}