TabPFNUnsupervisedModel ¶
Bases: BaseEstimator
TabPFN unsupervised model for imputation, outlier detection, and synthetic data generation.
This model combines a TabPFNClassifier for categorical features and a TabPFNRegressor for numerical features to perform various unsupervised learning tasks on tabular data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tabpfn_clf |
TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data. |
None
|
|
tabpfn_reg |
TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data. |
None
|
Attributes:
Name | Type | Description |
---|---|---|
categorical_features |
list List of indices of categorical features in the input data. |
>>> tabpfn_clf = TabPFNClassifier()
>>> tabpfn_reg = TabPFNRegressor()
>>> model = TabPFNUnsupervisedModel(tabpfn_clf, tabpfn_reg)
>>>
>>> X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
>>> model.fit(X)
>>>
>>> X_imputed = model.impute(X)
>>> X_outliers = model.outliers(X)
>>> X_synthetic = model.generate_synthetic_data(n_samples=100)
__init__ ¶
__init__(
tabpfn_clf: Optional[TabPFNClassifier] = None,
tabpfn_reg: Optional[TabPFNRegressor] = None,
) -> None
Initialize the TabPFNUnsupervisedModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tabpfn_clf |
TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data. |
None
|
|
tabpfn_reg |
TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data. |
None
|
fit ¶
Fit the model to the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like of shape (n_samples, n_features) Input data to fit the model. |
required | |
y |
array-like of shape (n_samples,), optional Target values. |
None
|
Returns:
Name | Type | Description |
---|---|---|
self |
None
|
TabPFNUnsupervisedModel Fitted model. |
impute ¶
Impute missing values in the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
torch.Tensor of shape (n_samples, n_features) Input data with missing values encoded as np.nan. |
required | |
t |
float, default=0.000000001 Temperature for sampling from the imputation distribution. Lower values result in more deterministic imputations. |
1e-09
|
Returns:
Type | Description |
---|---|
tensor
|
torch.Tensor of shape (n_samples, n_features) Imputed data with missing values replaced. |
get_embeddings ¶
Get the transformer embeddings for the test data X.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
tensor
|
|
required |
Returns:
Type | Description |
---|---|
tensor
|
torch.Tensor of shape (n_samples, embedding_dim) |
outliers ¶
Preferred implementation for outliers, where we calculate the sample probability for each sample in X by multiplying the probabilities of each feature according to chain rule of probability. The first feature is estimated by using a zero feature as input.
Args X: Samples to calculate the sample probability for, shape (n_samples, n_features)
Returns:
Type | Description |
---|---|
tensor
|
Sample unnormalized probability for each sample in X, shape (n_samples,) |
generate_synthetic_data ¶
Generate synthetic data using the trained models. Uses imputation method to generate synthetic data, passed with a matrix of nans. Samples are generated feature by feature in one pass, so samples are not dependent on each other per feature.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_samples |
int, default=100 Number of synthetic samples to generate. |
100
|
|
t |
float, default=1.0 Temperature for sampling from the imputation distribution. Lower values result in more deterministic samples. |
1.0
|
Returns:
Type | Description |
---|---|
torch.Tensor of shape (n_samples, n_features) Generated synthetic data. |