elliptic_toolkit.dataset module

elliptic_toolkit.dataset.download_dataset(root: str = 'elliptic_bitcoin_dataset', raw_file_names=['elliptic_txs_features.csv', 'elliptic_txs_edgelist.csv', 'elliptic_txs_classes.csv'], force: bool = False, url: str = 'https://data.pyg.org/datasets/elliptic')[source]

Download the Elliptic Bitcoin dataset from PyTorch Geometric’s dataset repository.

Parameters:

root (str, optional) – The root directory where the dataset will be stored. Defaults to “elliptic_bitcoin_dataset”.
raw_file_names (list, optional) – List of raw file names to download. Defaults to [ ‘elliptic_txs_features.csv’,
'elliptic_txs_edgelist.csv'
'elliptic_txs_classes.csv'
].
force (bool, optional) – Whether to force re-download the dataset if it already exists. Defaults to False.
url (str, optional) – The base URL for the dataset files. Defaults to ‘https://data.pyg.org/datasets/elliptic’.

elliptic_toolkit.dataset.process_dataset(folder_path: str = 'elliptic_bitcoin_dataset', features_file: str = 'elliptic_txs_features.csv', classes_file: str = 'elliptic_txs_classes.csv', edges_file: str = 'elliptic_txs_edgelist.csv')[source]

Loads, validates, and processes the Elliptic Bitcoin dataset.

Returns:

nodes_df (pandas.DataFrame) – DataFrame with shape (203769, 167). Columns:
- ’time’: Discrete time step (int)
- ’feat_0’ … ‘feat_164’: Node features (float)
- ’class’: Node label (int: 1 for illicit, 0 for licit, -1 for unknown/missing)
The ‘class’ column uses -1 to indicate missing labels (transductive setting). The ‘txId’ column is dropped in the returned DataFrame; its original order matches the input file.
edges_df (pandas.DataFrame) – DataFrame with shape (234355, 2). Columns:
- ’txId1’: Source node index (int, row index in nodes_df)
- ’txId2’: Target node index (int, row index in nodes_df)
Each row represents a directed edge in the transaction graph, with node indices corresponding to rows in nodes_df.

Notes

All IDs in ‘edges_df’ are mapped to row indices in ‘nodes_df’.
The function performs strict validation on shapes, unique values, and label distribution.

elliptic_toolkit.dataset.temporal_split(times, test_size=0.2)[source]

elliptic_toolkit.dataset.temporal_split(times: ndarray, test_size=0.2)

elliptic_toolkit.dataset.temporal_split(times: Tensor, test_size=0.2)

elliptic_toolkit.dataset.temporal_split(nodes_df: DataFrame, test_size=0.2, return_X_y=True)

Split data into temporal train/test sets based on unique time steps.

Parameters:

times (np.ndarray, torch.Tensor, or pandas.DataFrame) – The time information or data to split. For DataFrames, must contain a ‘time’ column.
test_size (float, default=0.2) – Proportion of unique time steps to include in the test split (between 0.0 and 1.0).

Returns:

For array/tensor input –

train_indices, test_indicesarray-like
Indices for training and test sets.
For DataFrame input –

(X_train, y_train), (X_test, y_test)tuple of tuples

X_trainpandas.DataFrame
Training features (all columns except ‘class’).

y_trainpandas.Series
Training labels (the ‘class’ column).

X_testpandas.DataFrame
Test features (all columns except ‘class’).

y_testpandas.Series
Test labels (the ‘class’ column).

Or, if return_X_y=False:

train_df, test_dfpandas.DataFrame
The full training and test DataFrames, already sliced by time.
Type-specific behavior
———————
- np.ndarray (Uses numpy operations to split by unique time values.)
- torch.Tensor (Uses torch operations to split by unique time values (no CPU/GPU transfer).)
- pandas.DataFrame (Splits based on the ‘time’ column. If return_X_y=True, unpacks X and y)
based on the ‘class’ column; otherwise, returns the sliced DataFrames.

elliptic_toolkit.dataset.load_labeled_data(test_size=0.2, root='elliptic_bitcoin_dataset')[source]

Utility function to load data, select only labeled data and split temporally into train and test sets. :param test_size: Proportion of unique time steps to include in the test split (between 0.0 and 1.0). :type test_size: float, default=0.2 :param root: The root directory where the dataset is stored. Defaults to “elliptic_bitcoin_dataset”. :type root: str, optional

Returns:: (X_train, y_train), (X_test, y_test) – X_train, y_train: training features and labels X_test, y_test: test features and labels
Return type:: tuple of tuples