Returns evaluation metric like confusion matrix, loss, AUC, AUPRC, MAE, MSE (depending on output layer).
Usage
evaluate_model(
path_input,
model = NULL,
batch_size = 100,
step = 1,
padding = FALSE,
vocabulary = c("a", "c", "g", "t"),
vocabulary_label = list(c("a", "c", "g", "t")),
number_batches = 10,
format = "fasta",
target_middle = FALSE,
mode = "lm",
output_format = "target_right",
ambiguous_nuc = "zero",
evaluate_all_files = FALSE,
verbose = TRUE,
max_iter = 20000,
target_from_csv = NULL,
max_samples = NULL,
proportion_per_seq = NULL,
concat_seq = NULL,
seed = 1234,
auc = FALSE,
auprc = FALSE,
path_pred_list = NULL,
exact_num_samples = NULL,
activations = NULL,
shuffle_file_order = FALSE,
include_seq = FALSE,
...
)
Arguments
- path_input
Input directory where fasta, fastq or rds files are located.
- model
A keras model.
- batch_size
Number of samples per batch.
- step
How often to take a sample.
- padding
Whether to pad sequences too short for one sample with zeros.
- vocabulary
Vector of allowed characters. Character outside vocabulary get encoded as specified in ambiguous_nuc.
- vocabulary_label
List of labels for targets of each output layer.
- number_batches
How many batches to evaluate.
- format
File format,
"fasta"
,"fastq"
or"rds"
.- target_middle
Whether model is language model with separate input layers.
- mode
Either
"lm"
for language model or"label_header"
,"label_csv"
or"label_folder"
for label classification.- output_format
Determines shape of output tensor for language model. Either
"target_right"
,"target_middle_lstm"
,"target_middle_cnn"
or"wavenet"
. Assume a sequence"AACCGTA"
. Output correspond as follows"target_right": X = "AACCGT", Y = "A"
"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"
(note reversed order of X_2)"target_middle_cnn": X = "AACGTA", Y = "C"
"wavenet": X = "AACCGT", Y = "ACCGTA"
- ambiguous_nuc
How to handle nucleotides outside vocabulary, either
"zero"
,"discard"
,"empirical"
or"equal"
.If
"zero"
, input gets encoded as zero vector.If
"equal"
, input is repetition of1/length(vocabulary)
.If
"discard"
, samples containing nucleotides outside vocabulary get discarded.If
"empirical"
, use nucleotide distribution of current file.
- evaluate_all_files
Boolean, if
TRUE
will iterate over all files inpath_input
once.number_batches
will be overwritten.- verbose
Boolean.
- max_iter
Stop after
max_iter
number of iterations failed to produce a new batch.- target_from_csv
Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets.
- max_samples
Maximum number of samples to use from one file. If not
NULL
and file has more thanmax_samples
samples, will randomly choose a subset ofmax_samples
samples.- proportion_per_seq
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence).
- concat_seq
Character string or
NULL
. If notNULL
all entries from file get concatenated to one sequence withconcat_seq
string between them. Example: If 1.entry AACC, 2. entry TTTG andconcat_seq = "ZZZ"
this becomes AACCZZZTTTG.- seed
Sets seed for
set.seed
function for reproducible results.- auc
Whether to include AUC metric. If output layer activation is
"softmax"
, only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.- auprc
Whether to include AUPRC metric. If output layer activation is
"softmax"
, only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.- path_pred_list
Path to store list of predictions (output of output layers) and corresponding true labels as rds file.
- exact_num_samples
Exact number of samples to evaluate. If you want to evaluate a number of samples not divisible by batch_size. Useful if you want to evaluate a data set exactly ones and know the number of samples already. Should be a vector if
mode = "label_folder"
(with same length asvocabulary_label
) and else an integer.- activations
List containing output formats for output layers (
softmax, sigmoid
orlinear
). IfNULL
, will be estimated from model.- shuffle_file_order
Logical, whether to go through files randomly or sequentially.
- include_seq
Whether to store input. Only applies if
path_pred_list
is notNULL
.- ...
Further generator options. See
get_generator
.
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# create dummy data
path_input <- tempfile()
dir.create(path_input)
create_dummy_data(file_path = path_input,
num_files = 3,
seq_length = 11,
num_seq = 5,
vocabulary = c("a", "c", "g", "t"))
# create model
model <- create_model_lstm_cnn(layer_lstm = 8, layer_dense = 4, maxlen = 10, verbose = FALSE)
# evaluate
evaluate_model(path_input = path_input,
model = model,
step = 11,
vocabulary = c("a", "c", "g", "t"),
vocabulary_label = list(c("a", "c", "g", "t")),
mode = "lm",
output_format = "target_right",
evaluate_all_files = TRUE,
verbose = FALSE)
}