PerFeatureTransformer ¶
Bases: Module
A Transformer model processes a token per feature and sample.
This model extends the standard Transformer architecture to operate on a per-feature basis. It allows for processing each feature separately while still leveraging the power of self-attention.
The model consists of an encoder, decoder, and optional components such as a feature positional embedding and a separate decoder for each feature.
__init__ ¶
__init__(
encoder: Module = encoders.SequentialEncoder(
encoders.LinearInputEncoderStep(
1,
DEFAULT_EMSIZE,
in_keys=["main"],
out_keys=["output"],
)
),
ninp: int = DEFAULT_EMSIZE,
nhead: int = 4,
nhid: int = DEFAULT_EMSIZE * 4,
nlayers: int = 10,
y_encoder: Module = encoders.SequentialEncoder(
encoders.NanHandlingEncoderStep(),
encoders.LinearInputEncoderStep(
2,
DEFAULT_EMSIZE,
out_keys=["output"],
in_keys=["main", "nan_indicators"],
),
),
decoder_dict: Dict[
str, Tuple[Optional[Type[Module]], int]
] = {"standard": (None, 1)},
init_method: Optional[str] = None,
activation: str = "gelu",
recompute_layer: bool = False,
min_num_layers_layer_dropout: Optional[int] = None,
repeat_same_layer: bool = False,
dag_pos_enc_dim: int = 0,
features_per_group: int = 1,
feature_positional_embedding: Optional[str] = None,
zero_init: bool = True,
use_separate_decoder: bool = False,
nlayers_decoder: Optional[int] = None,
use_encoder_compression_layer: bool = False,
precomputed_kv: Optional[
List[Union[Tensor, Tuple[Tensor, Tensor]]]
] = None,
cache_trainset_representation: bool = False,
**layer_kwargs: Any
)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
encoder |
Module
|
Pass a nn.Module that takes in a batch of sequences of inputs and returns something of the shape (seq_len, batch_size, ninp) |
SequentialEncoder(LinearInputEncoderStep(1, DEFAULT_EMSIZE, in_keys=['main'], out_keys=['output']))
|
ninp |
int
|
Input dimension, also called the embedding dimension |
DEFAULT_EMSIZE
|
nhead |
int
|
Number of attention heads |
4
|
nhid |
int
|
Hidden dimension in the MLP layers |
DEFAULT_EMSIZE * 4
|
nlayers |
int
|
Number of layers, each consisting of a multi-head attention layer and an MLP layer |
10
|
y_encoder |
Module
|
A nn.Module that takes in a batch of sequences of outputs and returns something of the shape (seq_len, batch_size, ninp) |
SequentialEncoder(NanHandlingEncoderStep(), LinearInputEncoderStep(2, DEFAULT_EMSIZE, out_keys=['output'], in_keys=['main', 'nan_indicators']))
|
decoder_dict |
Dict[str, Tuple[Optional[Type[Module]], int]]
|
|
{'standard': (None, 1)}
|
activation |
str
|
An activation function, e.g. "gelu" or "relu" |
'gelu'
|
recompute_layer |
bool
|
If True, the transformer layers will be recomputed on each forward pass in training. This is useful to save memory. |
False
|
min_num_layers_layer_dropout |
Optional[int]
|
if this is set, it enables to drop the last layers randomly during training up to this number. |
None
|
repeat_same_layer |
bool
|
If True, the same layer will be used for all layers. This is useful to save memory on weights. |
False
|
features_per_group |
int
|
If > 1, the features will be grouped into groups of this size and the attention is across groups. |
1
|
feature_positional_embedding |
Optional[str]
|
There is a risk that our models confuse features with each other. This positional embedding is added to the features to help the model distinguish them. We recommend setting this to "subspace". |
None
|
zero_init |
bool
|
If True, the last sublayer of each attention and MLP layer will be initialized with zeros. Thus, the layers will start out as identity functions. |
True
|
use_separate_decoder |
bool
|
If True, the decoder will be separate from the encoder. |
False
|
nlayers_decoder |
Optional[int]
|
If use_separate_decoder is True, this is the number of layers in the decoder. The default is to use ⅓ of the layers for the decoder and ⅔ for the encoder. |
None
|
use_encoder_compression_layer |
bool
|
Experimental |
False
|
precomputed_kv |
Optional[List[Union[Tensor, Tuple[Tensor, Tensor]]]]
|
Experimental |
None
|
layer_kwargs |
Any
|
|
{}
|
forward ¶
Performs a forward pass through the model.
This method supports multiple calling conventions: - model(train_x, train_y, test_x, **kwargs) - model((x,y), **kwargs) - model((style,x,y), **kwargs)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_x |
The input data for the training set. |
required | |
train_y |
The target data for the training set. |
required | |
test_x |
The input data for the test set. |
required | |
x |
The input data. |
required | |
y |
The target data. |
required | |
style |
The style vector. |
required | |
single_eval_pos |
The position to evaluate at. |
required | |
only_return_standard_out |
Whether to only return the standard output. |
required | |
data_dags |
The data DAGs for each example. |
required | |
categorical_inds |
The indices of categorical features. |
required | |
freeze_kv |
Whether to freeze the key and value weights. |
required |
Returns:
Type | Description |
---|---|
The output of the model, which can be a tensor or a dictionary of tensors. |
PerFeatureEncoderLayer ¶
Bases: Module
Transformer encoder layer that processes each feature block separately.
This layer consists of multi-head attention between features, multi-head attention between items, and feedforward neural networks (MLPs). It supports various configurations and optimization options.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model |
int
|
The dimensionality of the input and output embeddings. |
required |
nhead |
int
|
The number of attention heads. |
required |
dim_feedforward |
Optional[int]
|
The dimensionality of the feedforward network. Default is None (2 * d_model). |
None
|
activation |
str
|
The activation function to use in the MLPs. Default is "relu". |
'relu'
|
layer_norm_eps |
float
|
The epsilon value for layer normalization. Default is 1e-5. |
1e-05
|
pre_norm |
bool
|
Whether to apply layer normalization before or after the attention and MLPs. Default is False. |
False
|
device |
Optional[device]
|
The device to use for the layer parameters. Default is None. |
None
|
dtype |
Optional[dtype]
|
The data type to use for the layer parameters. Default is None. |
None
|
recompute_attn |
bool
|
Whether to recompute attention during backpropagation. Default is False. |
False
|
second_mlp |
bool
|
Whether to include a second MLP in the layer. Default is False. |
False
|
layer_norm_with_elementwise_affine |
bool
|
Whether to use elementwise affine parameters in layer normalization. Default is False. |
False
|
zero_init |
bool
|
Whether to initialize the output of the MLPs to zero. Default is False. |
False
|
save_peak_mem_factor |
Optional[int]
|
The factor to save peak memory, only effective with post-norm. Default is None. |
None
|
attention_between_features |
bool
|
Whether to apply attention between feature blocks. Default is True. |
True
|
multiquery_item_attention |
bool
|
Whether to use multiquery attention for items. Default is False. |
False
|
multiquery_item_attention_for_test_set |
bool
|
Whether to use multiquery attention for the test set. Default is False. |
False
|
attention_init_gain |
float
|
The gain value for initializing attention parameters. Default is 1.0. |
1.0
|
d_k |
Optional[int]
|
The dimensionality of the query and key vectors. Default is None (d_model // nhead). |
None
|
d_v |
Optional[int]
|
The dimensionality of the value vectors. Default is None (d_model // nhead). |
None
|
precomputed_kv |
Union[None, Tensor, Tuple[Tensor, Tensor]]
|
Precomputed key-value pairs for attention. Default is None. |
None
|
__init__ ¶
__init__(
d_model: int,
nhead: int,
dim_feedforward: Optional[int] = None,
activation: str = "relu",
layer_norm_eps: float = 1e-05,
pre_norm: bool = False,
device: Optional[device] = None,
dtype: Optional[dtype] = None,
recompute_attn: bool = False,
second_mlp: bool = False,
layer_norm_with_elementwise_affine: bool = False,
zero_init: bool = False,
save_peak_mem_factor: Optional[int] = None,
attention_between_features: bool = True,
multiquery_item_attention: bool = False,
multiquery_item_attention_for_test_set: bool = False,
two_sets_of_queries: bool = False,
attention_init_gain: float = 1.0,
d_k: Optional[int] = None,
d_v: Optional[int] = None,
precomputed_kv: Union[
None, Tensor, Tuple[Tensor, Tensor]
] = None,
) -> None
forward ¶
forward(
state: Tensor,
single_eval_pos: Optional[int] = None,
cache_trainset_representation: bool = False,
att_src: Optional[Tensor] = None,
) -> Tensor
Pass the input through the encoder layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
state |
Tensor
|
The transformer state passed as input to the layer of shape (batch_size, num_items, num_feature_blocks, d_model). |
required |
single_eval_pos |
Optional[int]
|
The position from which on everything is treated as test set. Default is None. |
None
|
cache_trainset_representation |
bool
|
Whether to cache the trainset representation. If single_eval_pos is set (> 0 and not None), create a cache of the trainset KV. This may require a lot of memory. Otherwise, use cached KV representations for inference. Default is False. |
False
|
att_src |
Optional[Tensor]
|
The tensor to attend to from the final layer of the encoder. It has a shape of (batch_size, num_train_items, num_feature_blocks, d_model). This does not work with multiquery_item_attention_for_test_set and cache_trainset_representation at this point. Combining would be possible, however. Default is None. |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
The transformer state passed through the encoder layer. |