Creates transformer network for classification. Model can consist of several stacked attention blocks.
Usage
create_model_transformer(
  maxlen,
  vocabulary_size = 4,
  embed_dim = 64,
  pos_encoding = "embedding",
  head_size = 4L,
  num_heads = 5L,
  ff_dim = 8,
  dropout = 0,
  n = 10000,
  layer_dense = 2,
  dropout_dense = NULL,
  flatten_method = "flatten",
  last_layer_activation = "softmax",
  loss_fn = "categorical_crossentropy",
  solver = "adam",
  learning_rate = 0.01,
  label_noise_matrix = NULL,
  bal_acc = FALSE,
  f1_metric = FALSE,
  auc_metric = FALSE,
  label_smoothing = 0,
  verbose = TRUE,
  model_seed = NULL,
  mixed_precision = FALSE,
  mirrored_strategy = NULL
)Arguments
- maxlen
- Length of predictor sequence. 
- vocabulary_size
- Number of unique character in vocabulary. 
- embed_dim
- Dimension for token embedding. No embedding if set to 0. Should be used when input is not one-hot encoded (integer sequence). 
- pos_encoding
- Either - "sinusoid"or- "embedding". How to add positional information. If- "sinusoid", will add sine waves of different frequencies to input. If- "embedding", model learns positional embedding.
- head_size
- Dimensions of attention key. 
- num_heads
- Number of attention heads. 
- ff_dim
- Units of first dense layer after attention blocks. 
- dropout
- Vector of dropout rates after attention block(s). 
- n
- Frequency of sine waves for positional encoding. Only applied if - pos_encoding = "sinusoid".
- layer_dense
- Vector specifying number of neurons per dense layer after last LSTM or CNN layer (if no LSTM used). 
- dropout_dense
- Dropout for dense layers. 
- flatten_method
- How to process output of last attention block. Can be - "max_ch_first",- "max_ch_last",- "average_ch_first",- "average_ch_last",- "both_ch_first",- "both_ch_last",- "all",- "none"or- "flatten". If- "average_ch_last"/- "max_ch_last"or- "average_ch_first"/- "max_ch_first", will apply global average/max pooling.- _ch_first/- _ch_lastto decide along which axis.- "both_ch_first"/- "both_ch_last"to use max and average together.- "all"to use all 4 global pooling options together. If- "flatten", will flatten output after last attention block. If- "none"no flattening applied.
- last_layer_activation
- Activation function of output layer(s). For example - "sigmoid"or- "softmax".
- loss_fn
- Either - "categorical_crossentropy"or- "binary_crossentropy". If- label_noise_matrixgiven, will use custom- "noisy_loss".
- solver
- Optimization method, options are - "adam", "adagrad", "rmsprop"or- "sgd".
- learning_rate
- Learning rate for optimizer. 
- label_noise_matrix
- Matrix of label noises. Every row stands for one class and columns for percentage of labels in that class. If first label contains 5 percent wrong labels and second label no noise, then - label_noise_matrix <- matrix(c(0.95, 0.05, 0, 1), nrow = 2, byrow = TRUE )
- bal_acc
- Whether to add balanced accuracy. 
- f1_metric
- Whether to add F1 metric. 
- auc_metric
- Whether to add AUC metric. 
- label_smoothing
- Float in [0, 1]. If 0, no smoothing is applied. If > 0, loss between the predicted labels and a smoothed version of the true labels, where the smoothing squeezes the labels towards 0.5. The closer the argument is to 1 the more the labels get smoothed. 
- verbose
- Boolean. 
- model_seed
- Set seed for model parameters in tensorflow if not - NULL.
- mixed_precision
- Whether to use mixed precision (https://www.tensorflow.org/guide/mixed_precision). 
- mirrored_strategy
- Whether to use distributed mirrored strategy. If NULL, will use distributed mirrored strategy only if >1 GPU available. 
