Note

This page was generated from a Jupyter notebook.

Elliptic Bitcoin Dataset: Random Forest Example

This notebook demonstrates how to load, preprocess, and model the Elliptic Bitcoin dataset using a Random Forest classifier. It covers:

Downloading and preparing the dataset
Training a machine learning model with a scikit-learn pipeline
Performing hyperparameter tuning with temporal cross-validation
Visualizing results and evaluating model performance over time

This example is intended for both documentation and as a reproducible reference for users of the elliptic_toolkit package.

[1]:

from elliptic_toolkit.dataset import download_dataset, load_labeled_data

Loading the Elliptic Bitcoin Dataset

To work with the Elliptic Bitcoin dataset, you first need to ensure the data is available locally. Use the download_dataset function to automatically download the dataset from PyTorch Geometric. The data will be saved in the elliptic_bitcoin_dataset folder by default, which will be created if it does not already exist.

If the dataset files are already present, they will not be downloaded again unless you set force=True.
This process ensures you always have the required data in the correct location for further analysis.

[2]:

download_dataset()

Now that the dataset is available, you can load it into memory using the load_labeled_data utility function. This function:

Maps the class labels as follows:
- 1: Illicit
- 0: Licit
- -1: Unknown
Maps transaction indices to row indices for easier data handling.
Automatically performs a temporal train/test split, where by default the latest 20% of time steps are reserved for testing.

This setup ensures your data is ready for machine learning workflows, with clear class labels and a reproducible split between training and testing sets.

[3]:

(X_train, y_train), (X_test, y_test) = load_labeled_data()

Training a model

With the data prepared, you can now train a machine learning model using your preferred scikit-learn estimator. In this example, we use a pipeline that includes the custom DropTime transformer, which removes the time column from the features. This step is important to prevent the model from learning spurious correlations based on time.

Select any scikit-learn compatible model (here, a random forest is used).
The pipeline ensures preprocessing and modeling steps are applied consistently.

This approach helps maintain the integrity of your evaluation by avoiding data leakage from temporal information.

[4]:

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import PrecisionRecallDisplay

from elliptic_toolkit.model_wrappers import DropTime

pipe = Pipeline([
    ("drop_time", DropTime()),
    ("clf", RandomForestClassifier(n_estimators=10, random_state=42))
])

pipe.fit(X_train, y_train)

PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test, name="Random Forest")
plt.show()

../_images/examples_random_forest_7_0.svg

Hyperparameter Tuning

To further improve model performance, you can perform hyperparameter tuning using cross-validation. The TemporalRollingCV cross-validator is designed for time-dependent data: it creates temporally ordered splits, similar to sklearn.TimeSeriesSplit, but ensures that training and test sets do not have overlapping time indices (important since time is an aggregated feature in this dataset).

This approach helps prevent data leakage and provides a more realistic evaluation of model performance on future, unseen data.

[5]:

from sklearn.model_selection import GridSearchCV

from elliptic_toolkit.temporal_cv import TemporalRollingCV

grid = GridSearchCV(
    pipe,
    param_grid={
        "clf__n_estimators": [10, 50, 100],
    },
    cv=TemporalRollingCV(n_splits=3),
    scoring="average_precision",
    n_jobs=1,
    verbose=10
)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 1/3; 1/3] START clf__n_estimators=10........................................
[CV 1/3; 1/3] END .........clf__n_estimators=10;, score=0.972 total time=   0.4s
[CV 2/3; 1/3] START clf__n_estimators=10........................................
[CV 2/3; 1/3] END .........clf__n_estimators=10;, score=0.753 total time=   0.7s
[CV 3/3; 1/3] START clf__n_estimators=10........................................
[CV 3/3; 1/3] END .........clf__n_estimators=10;, score=0.927 total time=   1.1s
[CV 1/3; 2/3] START clf__n_estimators=50........................................
[CV 1/3; 2/3] END .........clf__n_estimators=50;, score=0.983 total time=   2.0s
[CV 2/3; 2/3] START clf__n_estimators=50........................................
[CV 2/3; 2/3] END .........clf__n_estimators=50;, score=0.788 total time=   3.4s
[CV 3/3; 2/3] START clf__n_estimators=50........................................
[CV 3/3; 2/3] END .........clf__n_estimators=50;, score=0.949 total time=   5.3s
[CV 1/3; 3/3] START clf__n_estimators=100.......................................
[CV 1/3; 3/3] END ........clf__n_estimators=100;, score=0.985 total time=   3.9s
[CV 2/3; 3/3] START clf__n_estimators=100.......................................
[CV 2/3; 3/3] END ........clf__n_estimators=100;, score=0.783 total time=   6.7s
[CV 3/3; 3/3] START clf__n_estimators=100.......................................
[CV 3/3; 3/3] END ........clf__n_estimators=100;, score=0.952 total time=  10.5s

[5]:

GridSearchCV(cv=TemporalRollingCV(gap=0, max_train_size=None, n_splits=3, test_size=None,
         time_col='time'),
             estimator=Pipeline(steps=[('drop_time', DropTime()),
                                       ('clf',
                                        RandomForestClassifier(n_estimators=10,
                                                               random_state=42))]),
             n_jobs=1, param_grid={'clf__n_estimators': [10, 50, 100]},
             scoring='average_precision', verbose=10)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

?Documentation for GridSearchCViFitted

Parameters

	estimator	Pipeline(step...m_state=42))])
	param_grid	{'clf__n_estimators': [10, 50, ...]}
	scoring	'average_precision'
	n_jobs	1
	refit	True
	cv	TemporalRolli...me_col='time')
	verbose	10
	pre_dispatch	'2*n_jobs'
	error_score	nan
	return_train_score	False

best_estimator_: Pipeline

DropTime

Parameters

drop

True

RandomForestClassifier

?Documentation for RandomForestClassifier

Parameters

	n_estimators	50
	criterion	'gini'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	'sqrt'
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	42
	verbose	0
	warm_start	False
	class_weight	None
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

Visualizing hyperparameter search results

To better understand the impact of each hyperparameter on model performance, you can visualize the marginal effects using the plot_marginals utility function. These plots show how changes in a single hyperparameter affect the evaluation score, helping you identify which parameters are most influential and guiding further tuning decisions.

[6]:

from elliptic_toolkit.plots import plot_marginals

for marginal in plot_marginals(grid.cv_results_):
    plt.plot()

Model evaluation

To thoroughly evaluate your model, you can use the plot_evals utility function. This function provides both a Precision-Recall curve and a rolling evaluation plot, allowing you to:

Assess the overall precision and recall of your model.
Visualize how model performance changes as you test on data further away in time from the training period.
Reference the illicit rate, which is plotted on a separate axis, to help contextualize model performance relative to the prevalence of illicit transactions over time.

This temporal evaluation is especially useful for understanding how well your model generalizes to future, unseen data and for detecting any performance degradation over time.

[7]:

from elliptic_toolkit.plots import plot_evals

for fig in plot_evals(grid, X_test, y_test, y_train):
    plt.plot()

../_images/examples_random_forest_13_0.svg

../_images/examples_random_forest_13_1.svg