Note
This page was generated from a Jupyter notebook.
Elliptic Bitcoin Dataset: Random Forest Example
This notebook demonstrates how to load, preprocess, and model the Elliptic Bitcoin dataset using a Random Forest classifier. It covers:
Downloading and preparing the dataset
Training a machine learning model with a scikit-learn pipeline
Performing hyperparameter tuning with temporal cross-validation
Visualizing results and evaluating model performance over time
This example is intended for both documentation and as a reproducible reference for users of the elliptic_toolkit
package.
[1]:
from elliptic_toolkit.dataset import download_dataset, load_labeled_data
Loading the Elliptic Bitcoin Dataset
To work with the Elliptic Bitcoin dataset, you first need to ensure the data is available locally. Use the download_dataset
function to automatically download the dataset from PyTorch Geometric. The data will be saved in the elliptic_bitcoin_dataset
folder by default, which will be created if it does not already exist.
If the dataset files are already present, they will not be downloaded again unless you set
force=True
.This process ensures you always have the required data in the correct location for further analysis.
[2]:
download_dataset()
Now that the dataset is available, you can load it into memory using the load_labeled_data
utility function. This function:
Maps the class labels as follows:
1
: Illicit0
: Licit-1
: Unknown
Maps transaction indices to row indices for easier data handling.
Automatically performs a temporal train/test split, where by default the latest 20% of time steps are reserved for testing.
This setup ensures your data is ready for machine learning workflows, with clear class labels and a reproducible split between training and testing sets.
[3]:
(X_train, y_train), (X_test, y_test) = load_labeled_data()
Training a model
With the data prepared, you can now train a machine learning model using your preferred scikit-learn estimator. In this example, we use a pipeline that includes the custom DropTime
transformer, which removes the time column from the features. This step is important to prevent the model from learning spurious correlations based on time.
Select any scikit-learn compatible model (here, a random forest is used).
The pipeline ensures preprocessing and modeling steps are applied consistently.
This approach helps maintain the integrity of your evaluation by avoiding data leakage from temporal information.
[4]:
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import PrecisionRecallDisplay
from elliptic_toolkit.model_wrappers import DropTime
pipe = Pipeline([
("drop_time", DropTime()),
("clf", RandomForestClassifier(n_estimators=10, random_state=42))
])
pipe.fit(X_train, y_train)
PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test, name="Random Forest")
plt.show()
Hyperparameter Tuning
To further improve model performance, you can perform hyperparameter tuning using cross-validation. The TemporalRollingCV
cross-validator is designed for time-dependent data: it creates temporally ordered splits, similar to sklearn.TimeSeriesSplit
, but ensures that training and test sets do not have overlapping time indices (important since time is an aggregated feature in this dataset).
This approach helps prevent data leakage and provides a more realistic evaluation of model performance on future, unseen data.
[5]:
from sklearn.model_selection import GridSearchCV
from elliptic_toolkit.temporal_cv import TemporalRollingCV
grid = GridSearchCV(
pipe,
param_grid={
"clf__n_estimators": [10, 50, 100],
},
cv=TemporalRollingCV(n_splits=3),
scoring="average_precision",
n_jobs=1,
verbose=10
)
grid.fit(X_train, y_train)
Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 1/3; 1/3] START clf__n_estimators=10........................................
[CV 1/3; 1/3] END .........clf__n_estimators=10;, score=0.972 total time= 0.4s
[CV 2/3; 1/3] START clf__n_estimators=10........................................
[CV 2/3; 1/3] END .........clf__n_estimators=10;, score=0.753 total time= 0.7s
[CV 3/3; 1/3] START clf__n_estimators=10........................................
[CV 3/3; 1/3] END .........clf__n_estimators=10;, score=0.927 total time= 1.1s
[CV 1/3; 2/3] START clf__n_estimators=50........................................
[CV 1/3; 2/3] END .........clf__n_estimators=50;, score=0.983 total time= 2.0s
[CV 2/3; 2/3] START clf__n_estimators=50........................................
[CV 2/3; 2/3] END .........clf__n_estimators=50;, score=0.788 total time= 3.4s
[CV 3/3; 2/3] START clf__n_estimators=50........................................
[CV 3/3; 2/3] END .........clf__n_estimators=50;, score=0.949 total time= 5.3s
[CV 1/3; 3/3] START clf__n_estimators=100.......................................
[CV 1/3; 3/3] END ........clf__n_estimators=100;, score=0.985 total time= 3.9s
[CV 2/3; 3/3] START clf__n_estimators=100.......................................
[CV 2/3; 3/3] END ........clf__n_estimators=100;, score=0.783 total time= 6.7s
[CV 3/3; 3/3] START clf__n_estimators=100.......................................
[CV 3/3; 3/3] END ........clf__n_estimators=100;, score=0.952 total time= 10.5s
[5]:
GridSearchCV(cv=TemporalRollingCV(gap=0, max_train_size=None, n_splits=3, test_size=None, time_col='time'), estimator=Pipeline(steps=[('drop_time', DropTime()), ('clf', RandomForestClassifier(n_estimators=10, random_state=42))]), n_jobs=1, param_grid={'clf__n_estimators': [10, 50, 100]}, scoring='average_precision', verbose=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator | Pipeline(step...m_state=42))]) | |
param_grid | {'clf__n_estimators': [10, 50, ...]} | |
scoring | 'average_precision' | |
n_jobs | 1 | |
refit | True | |
cv | TemporalRolli...me_col='time') | |
verbose | 10 | |
pre_dispatch | '2*n_jobs' | |
error_score | nan | |
return_train_score | False |
Parameters
drop | True |
Parameters
n_estimators | 50 | |
criterion | 'gini' | |
max_depth | None | |
min_samples_split | 2 | |
min_samples_leaf | 1 | |
min_weight_fraction_leaf | 0.0 | |
max_features | 'sqrt' | |
max_leaf_nodes | None | |
min_impurity_decrease | 0.0 | |
bootstrap | True | |
oob_score | False | |
n_jobs | None | |
random_state | 42 | |
verbose | 0 | |
warm_start | False | |
class_weight | None | |
ccp_alpha | 0.0 | |
max_samples | None | |
monotonic_cst | None |
Visualizing hyperparameter search results
To better understand the impact of each hyperparameter on model performance, you can visualize the marginal effects using the plot_marginals
utility function. These plots show how changes in a single hyperparameter affect the evaluation score, helping you identify which parameters are most influential and guiding further tuning decisions.
[6]:
from elliptic_toolkit.plots import plot_marginals
for marginal in plot_marginals(grid.cv_results_):
plt.plot()
Model evaluation
To thoroughly evaluate your model, you can use the plot_evals
utility function. This function provides both a Precision-Recall curve and a rolling evaluation plot, allowing you to:
Assess the overall precision and recall of your model.
Visualize how model performance changes as you test on data further away in time from the training period.
Reference the illicit rate, which is plotted on a separate axis, to help contextualize model performance relative to the prevalence of illicit transactions over time.
This temporal evaluation is especially useful for understanding how well your model generalizes to future, unseen data and for detecting any performance degradation over time.
[7]:
from elliptic_toolkit.plots import plot_evals
for fig in plot_evals(grid, X_test, y_test, y_train):
plt.plot()