Skip to content

PerFeatureTransformer

Bases: Module

A Transformer model processes a token per feature and sample.

This model extends the standard Transformer architecture to operate on a per-feature basis. It allows for processing each feature separately while still leveraging the power of self-attention.

The model consists of an encoder, decoder, and optional components such as a feature positional embedding and a separate decoder for each feature.

__init__

__init__(
    encoder: Module = encoders.SequentialEncoder(
        encoders.LinearInputEncoderStep(
            1,
            DEFAULT_EMSIZE,
            in_keys=["main"],
            out_keys=["output"],
        )
    ),
    ninp: int = DEFAULT_EMSIZE,
    nhead: int = 4,
    nhid: int = DEFAULT_EMSIZE * 4,
    nlayers: int = 10,
    y_encoder: Module = encoders.SequentialEncoder(
        encoders.NanHandlingEncoderStep(),
        encoders.LinearInputEncoderStep(
            2,
            DEFAULT_EMSIZE,
            out_keys=["output"],
            in_keys=["main", "nan_indicators"],
        ),
    ),
    decoder_dict: Dict[
        str, Tuple[Optional[Type[Module]], int]
    ] = {"standard": (None, 1)},
    init_method: Optional[str] = None,
    activation: str = "gelu",
    recompute_layer: bool = False,
    min_num_layers_layer_dropout: Optional[int] = None,
    repeat_same_layer: bool = False,
    dag_pos_enc_dim: int = 0,
    features_per_group: int = 1,
    feature_positional_embedding: Optional[str] = None,
    zero_init: bool = True,
    use_separate_decoder: bool = False,
    nlayers_decoder: Optional[int] = None,
    use_encoder_compression_layer: bool = False,
    precomputed_kv: Optional[
        List[Union[Tensor, Tuple[Tensor, Tensor]]]
    ] = None,
    cache_trainset_representation: bool = False,
    **layer_kwargs: Any
)

Parameters:

Name Type Description Default
encoder Module

Pass a nn.Module that takes in a batch of sequences of inputs and returns something of the shape (seq_len, batch_size, ninp)

SequentialEncoder(LinearInputEncoderStep(1, DEFAULT_EMSIZE, in_keys=['main'], out_keys=['output']))
ninp int

Input dimension, also called the embedding dimension

DEFAULT_EMSIZE
nhead int

Number of attention heads

4
nhid int

Hidden dimension in the MLP layers

DEFAULT_EMSIZE * 4
nlayers int

Number of layers, each consisting of a multi-head attention layer and an MLP layer

10
y_encoder Module

A nn.Module that takes in a batch of sequences of outputs and returns something of the shape (seq_len, batch_size, ninp)

SequentialEncoder(NanHandlingEncoderStep(), LinearInputEncoderStep(2, DEFAULT_EMSIZE, out_keys=['output'], in_keys=['main', 'nan_indicators']))
decoder_dict Dict[str, Tuple[Optional[Type[Module]], int]]
{'standard': (None, 1)}
activation str

An activation function, e.g. "gelu" or "relu"

'gelu'
recompute_layer bool

If True, the transformer layers will be recomputed on each forward pass in training. This is useful to save memory.

False
min_num_layers_layer_dropout Optional[int]

if this is set, it enables to drop the last layers randomly during training up to this number.

None
repeat_same_layer bool

If True, the same layer will be used for all layers. This is useful to save memory on weights.

False
features_per_group int

If > 1, the features will be grouped into groups of this size and the attention is across groups.

1
feature_positional_embedding Optional[str]

There is a risk that our models confuse features with each other. This positional embedding is added to the features to help the model distinguish them. We recommend setting this to "subspace".

None
zero_init bool

If True, the last sublayer of each attention and MLP layer will be initialized with zeros. Thus, the layers will start out as identity functions.

True
use_separate_decoder bool

If True, the decoder will be separate from the encoder.

False
nlayers_decoder Optional[int]

If use_separate_decoder is True, this is the number of layers in the decoder. The default is to use ⅓ of the layers for the decoder and ⅔ for the encoder.

None
use_encoder_compression_layer bool

Experimental

False
precomputed_kv Optional[List[Union[Tensor, Tuple[Tensor, Tensor]]]]

Experimental

None
layer_kwargs Any
{}

forward

forward(*args, **kwargs)

Performs a forward pass through the model.

This method supports multiple calling conventions: - model(train_x, train_y, test_x, **kwargs) - model((x,y), **kwargs) - model((style,x,y), **kwargs)

Parameters:

Name Type Description Default
train_x

The input data for the training set.

required
train_y

The target data for the training set.

required
test_x

The input data for the test set.

required
x

The input data.

required
y

The target data.

required
style

The style vector.

required
single_eval_pos

The position to evaluate at.

required
only_return_standard_out

Whether to only return the standard output.

required
data_dags

The data DAGs for each example.

required
categorical_inds

The indices of categorical features.

required
freeze_kv

Whether to freeze the key and value weights.

required

Returns:

Type Description

The output of the model, which can be a tensor or a dictionary of tensors.

PerFeatureEncoderLayer

Bases: Module

Transformer encoder layer that processes each feature block separately.

This layer consists of multi-head attention between features, multi-head attention between items, and feedforward neural networks (MLPs). It supports various configurations and optimization options.

Parameters:

Name Type Description Default
d_model int

The dimensionality of the input and output embeddings.

required
nhead int

The number of attention heads.

required
dim_feedforward Optional[int]

The dimensionality of the feedforward network. Default is None (2 * d_model).

None
activation str

The activation function to use in the MLPs. Default is "relu".

'relu'
layer_norm_eps float

The epsilon value for layer normalization. Default is 1e-5.

1e-05
pre_norm bool

Whether to apply layer normalization before or after the attention and MLPs. Default is False.

False
device Optional[device]

The device to use for the layer parameters. Default is None.

None
dtype Optional[dtype]

The data type to use for the layer parameters. Default is None.

None
recompute_attn bool

Whether to recompute attention during backpropagation. Default is False.

False
second_mlp bool

Whether to include a second MLP in the layer. Default is False.

False
layer_norm_with_elementwise_affine bool

Whether to use elementwise affine parameters in layer normalization. Default is False.

False
zero_init bool

Whether to initialize the output of the MLPs to zero. Default is False.

False
save_peak_mem_factor Optional[int]

The factor to save peak memory, only effective with post-norm. Default is None.

None
attention_between_features bool

Whether to apply attention between feature blocks. Default is True.

True
multiquery_item_attention bool

Whether to use multiquery attention for items. Default is False.

False
multiquery_item_attention_for_test_set bool

Whether to use multiquery attention for the test set. Default is False.

False
attention_init_gain float

The gain value for initializing attention parameters. Default is 1.0.

1.0
d_k Optional[int]

The dimensionality of the query and key vectors. Default is None (d_model // nhead).

None
d_v Optional[int]

The dimensionality of the value vectors. Default is None (d_model // nhead).

None
precomputed_kv Union[None, Tensor, Tuple[Tensor, Tensor]]

Precomputed key-value pairs for attention. Default is None.

None

__init__

__init__(
    d_model: int,
    nhead: int,
    dim_feedforward: Optional[int] = None,
    activation: str = "relu",
    layer_norm_eps: float = 1e-05,
    pre_norm: bool = False,
    device: Optional[device] = None,
    dtype: Optional[dtype] = None,
    recompute_attn: bool = False,
    second_mlp: bool = False,
    layer_norm_with_elementwise_affine: bool = False,
    zero_init: bool = False,
    save_peak_mem_factor: Optional[int] = None,
    attention_between_features: bool = True,
    multiquery_item_attention: bool = False,
    multiquery_item_attention_for_test_set: bool = False,
    two_sets_of_queries: bool = False,
    attention_init_gain: float = 1.0,
    d_k: Optional[int] = None,
    d_v: Optional[int] = None,
    precomputed_kv: Union[
        None, Tensor, Tuple[Tensor, Tensor]
    ] = None,
) -> None

forward

forward(
    state: Tensor,
    single_eval_pos: Optional[int] = None,
    cache_trainset_representation: bool = False,
    att_src: Optional[Tensor] = None,
) -> Tensor

Pass the input through the encoder layer.

Parameters:

Name Type Description Default
state Tensor

The transformer state passed as input to the layer of shape (batch_size, num_items, num_feature_blocks, d_model).

required
single_eval_pos Optional[int]

The position from which on everything is treated as test set. Default is None.

None
cache_trainset_representation bool

Whether to cache the trainset representation. If single_eval_pos is set (> 0 and not None), create a cache of the trainset KV. This may require a lot of memory. Otherwise, use cached KV representations for inference. Default is False.

False
att_src Optional[Tensor]

The tensor to attend to from the final layer of the encoder. It has a shape of (batch_size, num_train_items, num_feature_blocks, d_model). This does not work with multiquery_item_attention_for_test_set and cache_trainset_representation at this point. Combining would be possible, however. Default is None.

None

Returns:

Type Description
Tensor

The transformer state passed through the encoder layer.