Returns evaluation metric like confusion matrix, loss, AUC, AUPRC, MAE, MSE (depending on output layer).
Usage
evaluate_model(
path_input,
model = NULL,
batch_size = 100,
step = 1,
padding = FALSE,
vocabulary = c("a", "c", "g", "t"),
vocabulary_label = list(c("a", "c", "g", "t")),
number_batches = 10,
format = "fasta",
target_middle = FALSE,
mode = "lm",
output_format = "target_right",
ambiguous_nuc = "zero",
evaluate_all_files = FALSE,
verbose = TRUE,
max_iter = 20000,
target_from_csv = NULL,
max_samples = NULL,
proportion_per_seq = NULL,
concat_seq = NULL,
seed = 1234,
auc = FALSE,
auprc = FALSE,
path_pred_list = NULL,
exact_num_samples = NULL,
activations = NULL,
shuffle_file_order = FALSE,
include_seq = FALSE,
...
)Arguments
- path_input
Input directory where fasta, fastq or rds files are located.
- model
A keras model.
- batch_size
Number of samples per batch.
- step
How often to take a sample.
- padding
Whether to pad sequences too short for one sample with zeros.
- vocabulary
Vector of allowed characters. Character outside vocabulary get encoded as specified in ambiguous_nuc.
- vocabulary_label
List of labels for targets of each output layer.
- number_batches
How many batches to evaluate.
- format
File format,
"fasta","fastq"or"rds".- target_middle
Whether model is language model with separate input layers.
- mode
Either
"lm"for language model or"label_header","label_csv"or"label_folder"for label classification.- output_format
Determines shape of output tensor for language model. Either
"target_right","target_middle_lstm","target_middle_cnn"or"wavenet". Assume a sequence"AACCGTA". Output correspond as follows"target_right": X = "AACCGT", Y = "A""target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"(note reversed order of X_2)"target_middle_cnn": X = "AACGTA", Y = "C""wavenet": X = "AACCGT", Y = "ACCGTA"
- ambiguous_nuc
How to handle nucleotides outside vocabulary, either
"zero","discard","empirical"or"equal".If
"zero", input gets encoded as zero vector.If
"equal", input is repetition of1/length(vocabulary).If
"discard", samples containing nucleotides outside vocabulary get discarded.If
"empirical", use nucleotide distribution of current file.
- evaluate_all_files
Boolean, if
TRUEwill iterate over all files inpath_inputonce.number_batcheswill be overwritten.- verbose
Boolean.
- max_iter
Stop after
max_iternumber of iterations failed to produce a new batch.- target_from_csv
Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets.
- max_samples
Maximum number of samples to use from one file. If not
NULLand file has more thanmax_samplessamples, will randomly choose a subset ofmax_samplessamples.- proportion_per_seq
Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence).
- concat_seq
Character string or
NULL. If notNULLall entries from file get concatenated to one sequence withconcat_seqstring between them. Example: If 1.entry AACC, 2. entry TTTG andconcat_seq = "ZZZ"this becomes AACCZZZTTTG.- seed
Sets seed for
set.seedfunction for reproducible results.- auc
Whether to include AUC metric. If output layer activation is
"softmax", only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.- auprc
Whether to include AUPRC metric. If output layer activation is
"softmax", only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.- path_pred_list
Path to store list of predictions (output of output layers) and corresponding true labels as rds file.
- exact_num_samples
Exact number of samples to evaluate. If you want to evaluate a number of samples not divisible by batch_size. Useful if you want to evaluate a data set exactly ones and know the number of samples already. Should be a vector if
mode = "label_folder"(with same length asvocabulary_label) and else an integer.- activations
List containing output formats for output layers (
softmax, sigmoidorlinear). IfNULL, will be estimated from model.- shuffle_file_order
Logical, whether to go through files randomly or sequentially.
- include_seq
Whether to store input. Only applies if
path_pred_listis notNULL.- ...
Further generator options. See
get_generator.
Examples
if (FALSE) { # reticulate::py_module_available("tensorflow")
# create dummy data
path_input <- tempfile()
dir.create(path_input)
create_dummy_data(file_path = path_input,
num_files = 3,
seq_length = 11,
num_seq = 5,
vocabulary = c("a", "c", "g", "t"))
# create model
model <- create_model_lstm_cnn(layer_lstm = 8, layer_dense = 4, maxlen = 10, verbose = FALSE)
# evaluate
evaluate_model(path_input = path_input,
model = model,
step = 11,
vocabulary = c("a", "c", "g", "t"),
vocabulary_label = list(c("a", "c", "g", "t")),
mode = "lm",
output_format = "target_right",
evaluate_all_files = TRUE,
verbose = FALSE)
}
