Evaluates a trained model on fasta, fastq or rds files

Returns evaluation metric like confusion matrix, loss, AUC, AUPRC, MAE, MSE (depending on output layer).

Usage

evaluate_model(
  path_input,
  model = NULL,
  batch_size = 100,
  step = 1,
  padding = FALSE,
  vocabulary = c("a", "c", "g", "t"),
  vocabulary_label = list(c("a", "c", "g", "t")),
  number_batches = 10,
  format = "fasta",
  target_middle = FALSE,
  mode = "lm",
  output_format = "target_right",
  ambiguous_nuc = "zero",
  evaluate_all_files = FALSE,
  verbose = TRUE,
  max_iter = 20000,
  target_from_csv = NULL,
  max_samples = NULL,
  proportion_per_seq = NULL,
  concat_seq = NULL,
  seed = 1234,
  auc = FALSE,
  auprc = FALSE,
  path_pred_list = NULL,
  exact_num_samples = NULL,
  activations = NULL,
  shuffle_file_order = FALSE,
  include_seq = FALSE,
  ...
)

Arguments

path_input

Input directory where fasta, fastq or rds files are located.

model

A keras model.

batch_size

Number of samples per batch.

step

How often to take a sample.

padding

Whether to pad sequences too short for one sample with zeros.

vocabulary

Vector of allowed characters. Character outside vocabulary get encoded as specified in ambiguous_nuc.

vocabulary_label

List of labels for targets of each output layer.

number_batches

How many batches to evaluate.

format

File format, "fasta", "fastq" or "rds".

target_middle

Whether model is language model with separate input layers.

mode

Either "lm" for language model or "label_header", "label_csv" or "label_folder" for label classification.

output_format

Determines shape of output tensor for language model. Either "target_right", "target_middle_lstm", "target_middle_cnn" or "wavenet". Assume a sequence "AACCGTA". Output correspond as follows

"target_right": X = "AACCGT", Y = "A"
"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C" (note reversed order of X_2)
"target_middle_cnn": X = "AACGTA", Y = "C"
"wavenet": X = "AACCGT", Y = "ACCGTA"

ambiguous_nuc

How to handle nucleotides outside vocabulary, either "zero", "discard", "empirical" or "equal".

If "zero", input gets encoded as zero vector.
If "equal", input is repetition of 1/length(vocabulary).
If "discard", samples containing nucleotides outside vocabulary get discarded.
If "empirical", use nucleotide distribution of current file.

evaluate_all_files

Boolean, if TRUE will iterate over all files in path_input once. number_batches will be overwritten.

verbose

Boolean.

max_iter

Stop after max_iter number of iterations failed to produce a new batch.

target_from_csv

Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets.

max_samples

Maximum number of samples to use from one file. If not NULL and file has more than max_samples samples, will randomly choose a subset of max_samples samples.

proportion_per_seq

Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence).

concat_seq

Character string or NULL. If not NULL all entries from file get concatenated to one sequence with concat_seq string between them. Example: If 1.entry AACC, 2. entry TTTG and concat_seq = "ZZZ" this becomes AACCZZZTTTG.

seed

Sets seed for set.seed function for reproducible results.

auc

Whether to include AUC metric. If output layer activation is "softmax", only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.

auprc

Whether to include AUPRC metric. If output layer activation is "softmax", only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.

path_pred_list

Path to store list of predictions (output of output layers) and corresponding true labels as rds file.

exact_num_samples

Exact number of samples to evaluate. If you want to evaluate a number of samples not divisible by batch_size. Useful if you want to evaluate a data set exactly ones and know the number of samples already. Should be a vector if mode = "label_folder" (with same length as vocabulary_label) and else an integer.

activations

List containing output formats for output layers (softmax, sigmoid or linear). If NULL, will be estimated from model.

shuffle_file_order

Logical, whether to go through files randomly or sequentially.

include_seq

Whether to store input. Only applies if path_pred_list is not NULL.

...

Further generator options. See get_generator.

Value

A list of evaluation results. Each list element corresponds to an output layer of the model.

Examples

if (FALSE) { # reticulate::py_module_available("tensorflow")
# create dummy data
path_input <- tempfile()
dir.create(path_input)
create_dummy_data(file_path = path_input,
                  num_files = 3,
                  seq_length = 11, 
                  num_seq = 5,
                  vocabulary = c("a", "c", "g", "t"))
# create model
model <- create_model_lstm_cnn(layer_lstm = 8, layer_dense = 4, maxlen = 10, verbose = FALSE)
# evaluate
evaluate_model(path_input = path_input,
  model = model,
  step = 11,
  vocabulary = c("a", "c", "g", "t"),
  vocabulary_label = list(c("a", "c", "g", "t")),
  mode = "lm",
  output_format = "target_right",
  evaluate_all_files = TRUE,
  verbose = FALSE)
  
}