PerFeatureTransformer ¶

Bases: Module

A Transformer model processes a token per feature and sample.

This model extends the standard Transformer architecture to operate on a per-feature basis. It allows for processing each feature separately while still leveraging the power of self-attention.

The model consists of an encoder, decoder, and optional components such as a feature positional embedding and a separate decoder for each feature.

init ¶

__init__(
    encoder: Module = encoders.SequentialEncoder(
        encoders.LinearInputEncoderStep(
            1,
            DEFAULT_EMSIZE,
            in_keys=["main"],
            out_keys=["output"],
        )
    ),
    ninp: int = DEFAULT_EMSIZE,
    nhead: int = 4,
    nhid: int = DEFAULT_EMSIZE * 4,
    nlayers: int = 10,
    y_encoder: Module = encoders.SequentialEncoder(
        encoders.NanHandlingEncoderStep(),
        encoders.LinearInputEncoderStep(
            2,
            DEFAULT_EMSIZE,
            out_keys=["output"],
            in_keys=["main", "nan_indicators"],
        ),
    ),
    decoder_dict: Dict[
        str, Tuple[Optional[Type[Module]], int]
    ] = {"standard": (None, 1)},
    init_method: Optional[str] = None,
    activation: str = "gelu",
    recompute_layer: bool = False,
    min_num_layers_layer_dropout: Optional[int] = None,
    repeat_same_layer: bool = False,
    dag_pos_enc_dim: int = 0,
    features_per_group: int = 1,
    feature_positional_embedding: Optional[str] = None,
    zero_init: bool = True,
    use_separate_decoder: bool = False,
    nlayers_decoder: Optional[int] = None,
    use_encoder_compression_layer: bool = False,
    precomputed_kv: Optional[
        List[Union[Tensor, Tuple[Tensor, Tensor]]]
    ] = None,
    cache_trainset_representation: bool = False,
    **layer_kwargs: Any
)

Parameters:

Name	Type	Description	Default
`encoder`	`Module`	Pass a nn.Module that takes in a batch of sequences of inputs and returns something of the shape (seq_len, batch_size, ninp)	`SequentialEncoder(LinearInputEncoderStep(1, DEFAULT_EMSIZE, in_keys=['main'], out_keys=['output']))`
`ninp`	`int`	Input dimension, also called the embedding dimension	`DEFAULT_EMSIZE`
`nhead`	`int`	Number of attention heads	`4`
`nhid`	`int`	Hidden dimension in the MLP layers	`DEFAULT_EMSIZE * 4`
`nlayers`	`int`	Number of layers, each consisting of a multi-head attention layer and an MLP layer	`10`
`y_encoder`	`Module`	A nn.Module that takes in a batch of sequences of outputs and returns something of the shape (seq_len, batch_size, ninp)	`SequentialEncoder(NanHandlingEncoderStep(), LinearInputEncoderStep(2, DEFAULT_EMSIZE, out_keys=['output'], in_keys=['main', 'nan_indicators']))`
`decoder_dict`	`Dict[str, Tuple[Optional[Type[Module]], int]]`		`{'standard': (None, 1)}`
`activation`	`str`	An activation function, e.g. "gelu" or "relu"	`'gelu'`
`recompute_layer`	`bool`	If True, the transformer layers will be recomputed on each forward pass in training. This is useful to save memory.	`False`
`min_num_layers_layer_dropout`	`Optional[int]`	if this is set, it enables to drop the last layers randomly during training up to this number.	`None`
`repeat_same_layer`	`bool`	If True, the same layer will be used for all layers. This is useful to save memory on weights.	`False`
`features_per_group`	`int`	If > 1, the features will be grouped into groups of this size and the attention is across groups.	`1`
`feature_positional_embedding`	`Optional[str]`	There is a risk that our models confuse features with each other. This positional embedding is added to the features to help the model distinguish them. We recommend setting this to "subspace".	`None`
`zero_init`	`bool`	If True, the last sublayer of each attention and MLP layer will be initialized with zeros. Thus, the layers will start out as identity functions.	`True`
`use_separate_decoder`	`bool`	If True, the decoder will be separate from the encoder.	`False`
`nlayers_decoder`	`Optional[int]`	If use_separate_decoder is True, this is the number of layers in the decoder. The default is to use ⅓ of the layers for the decoder and ⅔ for the encoder.	`None`
`use_encoder_compression_layer`	`bool`	Experimental	`False`
`precomputed_kv`	`Optional[List[Union[Tensor, Tuple[Tensor, Tensor]]]]`	Experimental	`None`
`layer_kwargs`	`Any`		`{}`

forward ¶

forward(*args, **kwargs)

Performs a forward pass through the model.

This method supports multiple calling conventions: - model(train_x, train_y, test_x, **kwargs) - model((x,y), **kwargs) - model((style,x,y), **kwargs)

Parameters:

Name	Description	Default
`train_x`	The input data for the training set.	required
`train_y`	The target data for the training set.	required
`test_x`	The input data for the test set.	required
`x`	The input data.	required
`y`	The target data.	required
`style`	The style vector.	required
`single_eval_pos`	The position to evaluate at.	required
`only_return_standard_out`	Whether to only return the standard output.	required
`data_dags`	The data DAGs for each example.	required
`categorical_inds`	The indices of categorical features.	required
`freeze_kv`	Whether to freeze the key and value weights.	required

Returns:

Type	Description
	The output of the model, which can be a tensor or a dictionary of tensors.

PerFeatureEncoderLayer ¶

Bases: Module

Transformer encoder layer that processes each feature block separately.

This layer consists of multi-head attention between features, multi-head attention between items, and feedforward neural networks (MLPs). It supports various configurations and optimization options.

Parameters:

Name	Type	Description	Default
`d_model`	`int`	The dimensionality of the input and output embeddings.	required
`nhead`	`int`	The number of attention heads.	required
`dim_feedforward`	`Optional[int]`	The dimensionality of the feedforward network. Default is None (2 * d_model).	`None`
`activation`	`str`	The activation function to use in the MLPs. Default is "relu".	`'relu'`
`layer_norm_eps`	`float`	The epsilon value for layer normalization. Default is 1e-5.	`1e-05`
`pre_norm`	`bool`	Whether to apply layer normalization before or after the attention and MLPs. Default is False.	`False`
`device`	`Optional[device]`	The device to use for the layer parameters. Default is None.	`None`
`dtype`	`Optional[dtype]`	The data type to use for the layer parameters. Default is None.	`None`
`recompute_attn`	`bool`	Whether to recompute attention during backpropagation. Default is False.	`False`
`second_mlp`	`bool`	Whether to include a second MLP in the layer. Default is False.	`False`
`layer_norm_with_elementwise_affine`	`bool`	Whether to use elementwise affine parameters in layer normalization. Default is False.	`False`
`zero_init`	`bool`	Whether to initialize the output of the MLPs to zero. Default is False.	`False`
`save_peak_mem_factor`	`Optional[int]`	The factor to save peak memory, only effective with post-norm. Default is None.	`None`
`attention_between_features`	`bool`	Whether to apply attention between feature blocks. Default is True.	`True`
`multiquery_item_attention`	`bool`	Whether to use multiquery attention for items. Default is False.	`False`
`multiquery_item_attention_for_test_set`	`bool`	Whether to use multiquery attention for the test set. Default is False.	`False`
`attention_init_gain`	`float`	The gain value for initializing attention parameters. Default is 1.0.	`1.0`
`d_k`	`Optional[int]`	The dimensionality of the query and key vectors. Default is None (d_model // nhead).	`None`
`d_v`	`Optional[int]`	The dimensionality of the value vectors. Default is None (d_model // nhead).	`None`
`precomputed_kv`	`Union[None, Tensor, Tuple[Tensor, Tensor]]`	Precomputed key-value pairs for attention. Default is None.	`None`

init ¶

__init__(
    d_model: int,
    nhead: int,
    dim_feedforward: Optional[int] = None,
    activation: str = "relu",
    layer_norm_eps: float = 1e-05,
    pre_norm: bool = False,
    device: Optional[device] = None,
    dtype: Optional[dtype] = None,
    recompute_attn: bool = False,
    second_mlp: bool = False,
    layer_norm_with_elementwise_affine: bool = False,
    zero_init: bool = False,
    save_peak_mem_factor: Optional[int] = None,
    attention_between_features: bool = True,
    multiquery_item_attention: bool = False,
    multiquery_item_attention_for_test_set: bool = False,
    two_sets_of_queries: bool = False,
    attention_init_gain: float = 1.0,
    d_k: Optional[int] = None,
    d_v: Optional[int] = None,
    precomputed_kv: Union[
        None, Tensor, Tuple[Tensor, Tensor]
    ] = None,
) -> None

forward ¶

forward(
    state: Tensor,
    single_eval_pos: Optional[int] = None,
    cache_trainset_representation: bool = False,
    att_src: Optional[Tensor] = None,
) -> Tensor

Pass the input through the encoder layer.

Parameters:

Name	Type	Description	Default
`state`	`Tensor`	The transformer state passed as input to the layer of shape (batch_size, num_items, num_feature_blocks, d_model).	required
`single_eval_pos`	`Optional[int]`	The position from which on everything is treated as test set. Default is None.	`None`
`cache_trainset_representation`	`bool`	Whether to cache the trainset representation. If single_eval_pos is set (> 0 and not None), create a cache of the trainset KV. This may require a lot of memory. Otherwise, use cached KV representations for inference. Default is False.	`False`
`att_src`	`Optional[Tensor]`	The tensor to attend to from the final layer of the encoder. It has a shape of (batch_size, num_train_items, num_feature_blocks, d_model). This does not work with multiquery_item_attention_for_test_set and cache_trainset_representation at this point. Combining would be possible, however. Default is None.	`None`

Returns:

Type	Description
`Tensor`	The transformer state passed through the encoder layer.

PerFeatureTransformer ¶

__init__ ¶

forward ¶

PerFeatureEncoderLayer ¶

__init__ ¶

forward ¶

init ¶

init ¶