Creates transformer network for classification. Model can consist of several stacked attention blocks.
Usage
create_model_transformer(
maxlen,
vocabulary_size = 4,
embed_dim = 64,
pos_encoding = "embedding",
head_size = 4L,
num_heads = 5L,
ff_dim = 8,
dropout = 0,
n = 10000,
layer_dense = 2,
dropout_dense = NULL,
flatten_method = "flatten",
last_layer_activation = "softmax",
loss_fn = "categorical_crossentropy",
solver = "adam",
learning_rate = 0.01,
label_noise_matrix = NULL,
bal_acc = FALSE,
f1_metric = FALSE,
auc_metric = FALSE,
label_smoothing = 0,
verbose = TRUE,
model_seed = NULL,
mixed_precision = FALSE,
mirrored_strategy = NULL
)Arguments
- maxlen
Length of predictor sequence.
- vocabulary_size
Number of unique character in vocabulary.
- embed_dim
Dimension for token embedding. No embedding if set to 0. Should be used when input is not one-hot encoded (integer sequence).
- pos_encoding
Either
"sinusoid"or"embedding". How to add positional information. If"sinusoid", will add sine waves of different frequencies to input. If"embedding", model learns positional embedding.- head_size
Dimensions of attention key.
- num_heads
Number of attention heads.
- ff_dim
Units of first dense layer after attention blocks.
- dropout
Vector of dropout rates after attention block(s).
- n
Frequency of sine waves for positional encoding. Only applied if
pos_encoding = "sinusoid".- layer_dense
Vector specifying number of neurons per dense layer after last LSTM or CNN layer (if no LSTM used).
- dropout_dense
Dropout for dense layers.
- flatten_method
How to process output of last attention block. Can be
"max_ch_first","max_ch_last","average_ch_first","average_ch_last","both_ch_first","both_ch_last","all","none"or"flatten". If"average_ch_last"/"max_ch_last"or"average_ch_first"/"max_ch_first", will apply global average/max pooling._ch_first/_ch_lastto decide along which axis."both_ch_first"/"both_ch_last"to use max and average together."all"to use all 4 global pooling options together. If"flatten", will flatten output after last attention block. If"none"no flattening applied.- last_layer_activation
Activation function of output layer(s). For example
"sigmoid"or"softmax".- loss_fn
Either
"categorical_crossentropy"or"binary_crossentropy". Iflabel_noise_matrixgiven, will use custom"noisy_loss".- solver
Optimization method, options are
"adam", "adagrad", "rmsprop"or"sgd".- learning_rate
Learning rate for optimizer.
- label_noise_matrix
Matrix of label noises. Every row stands for one class and columns for percentage of labels in that class. If first label contains 5 percent wrong labels and second label no noise, then
label_noise_matrix <- matrix(c(0.95, 0.05, 0, 1), nrow = 2, byrow = TRUE )- bal_acc
Whether to add balanced accuracy.
- f1_metric
Whether to add F1 metric.
- auc_metric
Whether to add AUC metric.
- label_smoothing
Float in [0, 1]. If 0, no smoothing is applied. If > 0, loss between the predicted labels and a smoothed version of the true labels, where the smoothing squeezes the labels towards 0.5. The closer the argument is to 1 the more the labels get smoothed.
- verbose
Boolean.
- model_seed
Set seed for model parameters in tensorflow if not
NULL.- mixed_precision
Whether to use mixed precision (https://www.tensorflow.org/guide/mixed_precision).
- mirrored_strategy
Whether to use distributed mirrored strategy. If NULL, will use distributed mirrored strategy only if >1 GPU available.
