elliptic_toolkit package
Elliptic Bitcoin Dataset Toolkit
A comprehensive Python toolkit for working with the Elliptic Bitcoin dataset, providing utilities for data loading, temporal analysis, graph neural network modeling, and evaluation.
- elliptic_toolkit.download_dataset(root: str = 'elliptic_bitcoin_dataset', raw_file_names=['elliptic_txs_features.csv', 'elliptic_txs_edgelist.csv', 'elliptic_txs_classes.csv'], force: bool = False, url: str = 'https://data.pyg.org/datasets/elliptic')[source]
Download the Elliptic Bitcoin dataset from PyTorch Geometric’s dataset repository.
- Parameters:
root (str, optional) – The root directory where the dataset will be stored. Defaults to “elliptic_bitcoin_dataset”.
raw_file_names (list, optional) – List of raw file names to download. Defaults to [ ‘elliptic_txs_features.csv’,
'elliptic_txs_edgelist.csv'
'elliptic_txs_classes.csv'
].
force (bool, optional) – Whether to force re-download the dataset if it already exists. Defaults to False.
url (str, optional) – The base URL for the dataset files. Defaults to ‘https://data.pyg.org/datasets/elliptic’.
- elliptic_toolkit.process_dataset(folder_path: str = 'elliptic_bitcoin_dataset', features_file: str = 'elliptic_txs_features.csv', classes_file: str = 'elliptic_txs_classes.csv', edges_file: str = 'elliptic_txs_edgelist.csv')[source]
Loads, validates, and processes the Elliptic Bitcoin dataset.
- Returns:
nodes_df (pandas.DataFrame) – DataFrame with shape (203769, 167). Columns:
’time’: Discrete time step (int)
’feat_0’ … ‘feat_164’: Node features (float)
’class’: Node label (int: 1 for illicit, 0 for licit, -1 for unknown/missing)
The ‘class’ column uses -1 to indicate missing labels (transductive setting). The ‘txId’ column is dropped in the returned DataFrame; its original order matches the input file.
edges_df (pandas.DataFrame) – DataFrame with shape (234355, 2). Columns:
’txId1’: Source node index (int, row index in nodes_df)
’txId2’: Target node index (int, row index in nodes_df)
Each row represents a directed edge in the transaction graph, with node indices corresponding to rows in nodes_df.
Notes
All IDs in ‘edges_df’ are mapped to row indices in ‘nodes_df’.
The function performs strict validation on shapes, unique values, and label distribution.
- elliptic_toolkit.temporal_split(times, test_size=0.2)[source]
- elliptic_toolkit.temporal_split(times: ndarray, test_size=0.2)
- elliptic_toolkit.temporal_split(times: Tensor, test_size=0.2)
- elliptic_toolkit.temporal_split(nodes_df: DataFrame, test_size=0.2, return_X_y=True)
Split data into temporal train/test sets based on unique time steps.
- Parameters:
times (np.ndarray, torch.Tensor, or pandas.DataFrame) – The time information or data to split. For DataFrames, must contain a ‘time’ column.
test_size (float, default=0.2) – Proportion of unique time steps to include in the test split (between 0.0 and 1.0).
- Returns:
For array/tensor input –
- train_indices, test_indicesarray-like
Indices for training and test sets.
For DataFrame input –
- (X_train, y_train), (X_test, y_test)tuple of tuples
- X_trainpandas.DataFrame
Training features (all columns except ‘class’).
- y_trainpandas.Series
Training labels (the ‘class’ column).
- X_testpandas.DataFrame
Test features (all columns except ‘class’).
- y_testpandas.Series
Test labels (the ‘class’ column).
- Or, if return_X_y=False:
- train_df, test_dfpandas.DataFrame
The full training and test DataFrames, already sliced by time.
Type-specific behavior
———————
- np.ndarray (Uses numpy operations to split by unique time values.)
- torch.Tensor (Uses torch operations to split by unique time values (no CPU/GPU transfer).)
- pandas.DataFrame (Splits based on the ‘time’ column. If return_X_y=True, unpacks X and y)
based on the ‘class’ column; otherwise, returns the sliced DataFrames.
- elliptic_toolkit.load_labeled_data(test_size=0.2, root='elliptic_bitcoin_dataset')[source]
Utility function to load data, select only labeled data and split temporally into train and test sets. :param test_size: Proportion of unique time steps to include in the test split (between 0.0 and 1.0). :type test_size: float, default=0.2 :param root: The root directory where the dataset is stored. Defaults to “elliptic_bitcoin_dataset”. :type root: str, optional
- Returns:
(X_train, y_train), (X_test, y_test) – X_train, y_train: training features and labels X_test, y_test: test features and labels
- Return type:
tuple of tuples
- class elliptic_toolkit.GNNBinaryClassifier(data, model, hidden_dim=64, num_layers=3, dropout=0.5, norm=None, jk='last', learning_rate_init=0.01, weight_decay=0.0005, balance_loss=True, max_iter=200, verbose=False, n_iter_no_change=10, tol=0.0001, device='auto', heads=None, **kwargs)[source]
Bases:
ClassifierMixin
,BaseEstimator
Graph Neural Network Binary Classifier with early stopping.
A scikit-learn compatible binary classifier that wraps around PyTorch Geometric GNN models. Currently supports transductive and full batch learning models (GCN, GAT).
The training loss is monitored and the model is considered converged if the loss does not improve for n_iter_no_change consecutive iterations by at least tol. This early stopping mechanism is always enabled, similar to MLPClassifier in scikit-learn.
- Parameters:
data (torch_geometric.data.Data) – Graph data object containing node features (x), edge indices (edge_index), and node labels (y).
model (torch.nn.Module) – The GNN model class to instantiate for training.
hidden_dim (int, default=64) – Number of hidden units in each layer.
num_layers (int, default=3) – Number of layers in the neural network.
dropout (float, default=0.5) – Dropout probability for regularization.
learning_rate_init (float, default=0.01) – Initial learning rate for the Adam optimizer.
weight_decay (float, default=5e-4) – L2 regularization strength.
balance_loss (bool, default=True) – Whether to balance the loss function by weighting positive samples. If True, uses positive class weighting in BCEWithLogitsLoss based on class frequencies. If False, uses unweighted loss.
max_iter (int, default=200) – Maximum number of training iterations.
verbose (bool, default=False) – Whether to print training progress.
n_iter_no_change (int, default=10) – Number of consecutive iterations with no improvement to trigger early stopping.
tol (float, default=1e-4) – Tolerance for improvement. Training stops if loss improvement is less than this value.
device (str or torch.device, default='auto') – Device to use for computation. Can be ‘cpu’, ‘cuda’, ‘auto’, or a torch.device object. If ‘auto’, will use CUDA if available, otherwise CPU.
heads (int, default=None) – Number of attention heads for GAT models. Only applicable when model=GAT. Ignored with a warning for other model types.
**kwargs (dict) – Additional keyword arguments passed to the model constructor.
Attributes
----------
loss_curve (list) – List of loss values at each training iteration.
model – The trained GNN model after calling fit.
- __init__(data, model, hidden_dim=64, num_layers=3, dropout=0.5, norm=None, jk='last', learning_rate_init=0.01, weight_decay=0.0005, balance_loss=True, max_iter=200, verbose=False, n_iter_no_change=10, tol=0.0001, device='auto', heads=None, **kwargs)[source]
- fit(X, y=None)[source]
Fit the GNN model to the training data.
Training automatically stops when the loss stops improving for n_iter_no_change consecutive iterations, similar to MLPClassifier.
- Parameters:
train_indices (array-like) – Indices of training samples in the graph.
y (array-like, default=None) – Target values (ignored, present for sklearn compatibility).
- Returns:
self – Returns self for method chaining.
- Return type:
- Warns:
UserWarning – If training stops due to max_iter being reached without convergence.
- predict(X)[source]
Predict class labels for samples in test_indices.
- Parameters:
test_indices (array-like) – Indices of test samples in the graph.
- Returns:
predictions – Predicted class labels (0 or 1).
- Return type:
ndarray of shape (n_samples,)
- Raises:
ValueError – If the classifier has not been fitted yet.
- predict_proba(X)[source]
Predict class probabilities for samples in test_indices.
- Parameters:
test_indices (array-like) – Indices of test samples in the graph.
- Returns:
probabilities – Predicted class probabilities. First column contains probabilities for class 0, second column for class 1.
- Return type:
ndarray of shape (n_samples, 2)
- Raises:
ValueError – If the classifier has not been fitted yet.
- property classes_
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GNNBinaryClassifier
Configure whether metadata should be requested to be passed to the
score
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class elliptic_toolkit.MLPWrapper(num_layers=2, hidden_dim=16, hidden_layer_sizes=None, alpha=0.0001, learning_rate_init=0.001, batch_size='auto', max_iter=1000)[source]
Bases:
MLPClassifier
Wrapper around sklearn’s MLPClassifier to allow specifying the number of layers and hidden dimension directly. This is useful for hyperparameter tuning where hyperparameters need to be independent. Some parameters of the base MLPClassifier are fixed to ensure consistent behavior: - shuffle=False: Disable shuffling to maintain temporal order. - early_stopping=False: Disable internal test/validation split for validation loss based early stopping and use training loss based early stopping instead.
- Parameters:
num_layers (int, default=2) – Number of hidden layers in the MLP.
hidden_dim (int, default=16) – Number of units in each hidden layer.
hidden_layer_sizes (tuple or None, default=None) – If provided, this overrides num_layers and hidden_dim. Should be a tuple specifying the size of each hidden layer.
alpha (float, default=0.0001) – L2 regularization term.
learning_rate_init (float, default=0.001) – Initial learning rate.
batch_size (int or 'auto', default='auto') – Size of minibatches for stochastic optimizers.
max_iter (int, default=1000) – Maximum number of iterations.
- __init__(num_layers=2, hidden_dim=16, hidden_layer_sizes=None, alpha=0.0001, learning_rate_init=0.001, batch_size='auto', max_iter=1000)[source]
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MLPWrapper
Configure whether metadata should be requested to be passed to the
fit
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- set_partial_fit_request(*, classes: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') MLPWrapper
Configure whether metadata should be requested to be passed to the
partial_fit
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topartial_fit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topartial_fit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
- Returns:
self – The updated object.
- Return type:
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MLPWrapper
Configure whether metadata should be requested to be passed to the
score
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- set_params(**params)[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class elliptic_toolkit.DropTime(drop=True)[source]
Bases:
BaseEstimator
,TransformerMixin
Transformer for dropping the ‘time’ column from a DataFrame. Useful in scikit-learn pipelines.
- class elliptic_toolkit.TemporalRollingCV(n_splits=5, *, test_size=None, max_train_size=None, gap=0, time_col='time')[source]
Bases:
TimeSeriesSplit
Time-based cross-validation iterator that extends scikit-learn’s TimeSeriesSplit to work with data that has explicit time step values (like the Elliptic Bitcoin dataset).
This class inherits from TimeSeriesSplit and adds functionality to handle datasets where multiple samples can belong to the same time step. It maps the time step indices to actual row indices in the dataset, allowing it to be used with datasets like the Elliptic Bitcoin dataset.
This CV strategy ensures that for each fold: 1. Training data comes from earlier time periods 2. The test set is a continuous time window following the training data 3. Each fold expands the training window and shifts the test window forward
Parameters:
- n_splitsint, default=5
Number of splits to generate
- test_sizeint, default=None
Size of test window in time steps. If None, will be calculated based on n_splits.
- max_train_sizeint, default=None
Maximum number of time steps to use for training. If None, all available time steps will be used.
- gapint, default=0
Number of time steps to skip between training and test sets
- time_colstr, default=’time’
Name of the column containing time step information
- split(X, y=None, groups=None)[source]
Generate indices to split data into training and test sets.
Unlike standard TimeSeriesSplit, this method works with explicit time step values and maps them to actual row indices in the dataset. This allows it to handle datasets where multiple samples can belong to the same time step.
Parameters:
- Xarray-like, DataFrame
Training data. If DataFrame, must contain the column specified by time_col. Otherwise, time values must be passed through the groups parameter.
- yarray-like, optional
Targets for the training data (ignored)
- groupsarray-like, optional
Time values for each sample if X doesn’t have the time column specified by time_col
Yields:
- train_indexndarray
Indices of rows in the training set
- test_indexndarray
Indices of rows in the test set
Notes:
The yielded indices refer to rows in the original dataset, not time steps. This makes the cross-validator compatible with scikit-learn’s model selection tools.
- elliptic_toolkit.plot_evals(est, X_test, y_test, y_train, *, time_steps_test=None)[source]
Generate two evaluation plots for a classifier: 1. Precision-Recall curve on the test set. 2. Rolling/cumulative AP and illicit rate by time step.
- Parameters:
est (classifier) – Trained classifier with predict_proba method.
X_test (pd.DataFrame, array-like) – Test features. Must contain a ‘time’ column unless time_steps_test is provided.
y_test (numpy.ndarray) – Test labels (binary).
y_train (numpy.ndarray) – Training labels (binary), used for reference illicit rate.
time_steps_test (numpy.ndarray, optional) – Time step values for test set. If None, will use X_test[‘time’].
- Returns:
pr_fig (matplotlib.figure.Figure) – Figure for the precision-recall curve.
temporal_fig (matplotlib.figure.Figure) – Figure for the rolling/cumulative AP and illicit rate by time step.
Notes
This function assumes arrays to be numpy ndarrays.
X_test
is allowed to be a torch.Tensor but est.predict_proba must return numpy arrays.
- elliptic_toolkit.plot_marginals(cv_results, max_ticks=10)[source]
For each hyperparameter in
cv_results
, plot the marginal mean and standard deviation (error bar) of test scores.The marginal mean/std for each hyperparameter value is computed by averaging across all other hyperparameters the mean/std across the cv folds (i.e., by computing the average of the
mean_test_score
andstd_test_score
columns).
- elliptic_toolkit.parse_search_cv_logs(file_path, trim=True)[source]
Parse the hyperparameter search results. If trim is True, only return columns with more than one unique value.
- Parameters:
- Returns:
res – DataFrame with hyperparameter results.
- Return type:
Notes
Assumes each relevant line in the log file contains ‘END’, cv number as [CV x/y] and hyperparameters in the format
param=value
.Specific example line:
[CV 1/5] END accuracy=0.95, learning_rate=0.01, num_layers=3,; acc=0.95, total time=3min
The specific regex patterns can be adjusted in the
regex_map
dictionary.