Skip to content

TabPFNUnsupervisedModel

Bases: BaseEstimator

TabPFN unsupervised model for imputation, outlier detection, and synthetic data generation.

This model combines a TabPFNClassifier for categorical features and a TabPFNRegressor for numerical features to perform various unsupervised learning tasks on tabular data.

Parameters:

Name Type Description Default
tabpfn_clf

TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data.

None
tabpfn_reg

TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data.

None

Attributes:

Name Type Description
categorical_features

list List of indices of categorical features in the input data.

Example
>>> tabpfn_clf = TabPFNClassifier()
>>> tabpfn_reg = TabPFNRegressor()
>>> model = TabPFNUnsupervisedModel(tabpfn_clf, tabpfn_reg)
>>>
>>> X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
>>> model.fit(X)
>>>
>>> X_imputed = model.impute(X)
>>> X_outliers = model.outliers(X)
>>> X_synthetic = model.generate_synthetic_data(n_samples=100)

__init__

__init__(
    tabpfn_clf: Optional[TabPFNClassifier] = None,
    tabpfn_reg: Optional[TabPFNRegressor] = None,
) -> None

Initialize the TabPFNUnsupervisedModel.

Parameters:

Name Type Description Default
tabpfn_clf

TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data.

None
tabpfn_reg

TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data.

None

fit

fit(X: ndarray, y: Optional[ndarray] = None) -> None

Fit the model to the input data.

Parameters:

Name Type Description Default
X

array-like of shape (n_samples, n_features) Input data to fit the model.

required
y

array-like of shape (n_samples,), optional Target values.

None

Returns:

Name Type Description
self None

TabPFNUnsupervisedModel Fitted model.

set_categorical_features

set_categorical_features(categorical_features)

impute

impute(
    X: tensor, t: float = 1e-09, n_permutations: int = 10
) -> tensor

Impute missing values in the input data.

Parameters:

Name Type Description Default
X

torch.Tensor of shape (n_samples, n_features) Input data with missing values encoded as np.nan.

required
t

float, default=0.000000001 Temperature for sampling from the imputation distribution. Lower values result in more deterministic imputations.

1e-09

Returns:

Type Description
tensor

torch.Tensor of shape (n_samples, n_features) Imputed data with missing values replaced.

get_embeddings

get_embeddings(
    X: tensor, per_column: bool = False
) -> tensor

Get the transformer embeddings for the test data X.

Parameters:

Name Type Description Default
X tensor
required

Returns:

Type Description
tensor

torch.Tensor of shape (n_samples, embedding_dim)

outliers

outliers(X: tensor, n_permutations: int = 10) -> tensor

Preferred implementation for outliers, where we calculate the sample probability for each sample in X by multiplying the probabilities of each feature according to chain rule of probability. The first feature is estimated by using a zero feature as input.

Args X: Samples to calculate the sample probability for, shape (n_samples, n_features)

Returns:

Type Description
tensor

Sample unnormalized probability for each sample in X, shape (n_samples,)

generate_synthetic_data

generate_synthetic_data(
    n_samples=100, t=1.0, n_permutations=3
)

Generate synthetic data using the trained models. Uses imputation method to generate synthetic data, passed with a matrix of nans. Samples are generated feature by feature in one pass, so samples are not dependent on each other per feature.

Parameters:

Name Type Description Default
n_samples

int, default=100 Number of synthetic samples to generate.

100
t

float, default=1.0 Temperature for sampling from the imputation distribution. Lower values result in more deterministic samples.

1.0

Returns:

Type Description

torch.Tensor of shape (n_samples, n_features) Generated synthetic data.