elliptic_toolkit.dataset module
- elliptic_toolkit.dataset.download_dataset(root: str = 'elliptic_bitcoin_dataset', raw_file_names=['elliptic_txs_features.csv', 'elliptic_txs_edgelist.csv', 'elliptic_txs_classes.csv'], force: bool = False, url: str = 'https://data.pyg.org/datasets/elliptic')[source]
Download the Elliptic Bitcoin dataset from PyTorch Geometric’s dataset repository.
- Parameters:
root (str, optional) – The root directory where the dataset will be stored. Defaults to “elliptic_bitcoin_dataset”.
raw_file_names (list, optional) – List of raw file names to download. Defaults to [ ‘elliptic_txs_features.csv’,
'elliptic_txs_edgelist.csv'
'elliptic_txs_classes.csv'
].
force (bool, optional) – Whether to force re-download the dataset if it already exists. Defaults to False.
url (str, optional) – The base URL for the dataset files. Defaults to ‘https://data.pyg.org/datasets/elliptic’.
- elliptic_toolkit.dataset.process_dataset(folder_path: str = 'elliptic_bitcoin_dataset', features_file: str = 'elliptic_txs_features.csv', classes_file: str = 'elliptic_txs_classes.csv', edges_file: str = 'elliptic_txs_edgelist.csv')[source]
Loads, validates, and processes the Elliptic Bitcoin dataset.
- Returns:
nodes_df (pandas.DataFrame) – DataFrame with shape (203769, 167). Columns:
’time’: Discrete time step (int)
’feat_0’ … ‘feat_164’: Node features (float)
’class’: Node label (int: 1 for illicit, 0 for licit, -1 for unknown/missing)
The ‘class’ column uses -1 to indicate missing labels (transductive setting). The ‘txId’ column is dropped in the returned DataFrame; its original order matches the input file.
edges_df (pandas.DataFrame) – DataFrame with shape (234355, 2). Columns:
’txId1’: Source node index (int, row index in nodes_df)
’txId2’: Target node index (int, row index in nodes_df)
Each row represents a directed edge in the transaction graph, with node indices corresponding to rows in nodes_df.
Notes
All IDs in ‘edges_df’ are mapped to row indices in ‘nodes_df’.
The function performs strict validation on shapes, unique values, and label distribution.
- elliptic_toolkit.dataset.temporal_split(times, test_size=0.2)[source]
- elliptic_toolkit.dataset.temporal_split(times: ndarray, test_size=0.2)
- elliptic_toolkit.dataset.temporal_split(times: Tensor, test_size=0.2)
- elliptic_toolkit.dataset.temporal_split(nodes_df: DataFrame, test_size=0.2, return_X_y=True)
Split data into temporal train/test sets based on unique time steps.
- Parameters:
times (np.ndarray, torch.Tensor, or pandas.DataFrame) – The time information or data to split. For DataFrames, must contain a ‘time’ column.
test_size (float, default=0.2) – Proportion of unique time steps to include in the test split (between 0.0 and 1.0).
- Returns:
For array/tensor input –
- train_indices, test_indicesarray-like
Indices for training and test sets.
For DataFrame input –
- (X_train, y_train), (X_test, y_test)tuple of tuples
- X_trainpandas.DataFrame
Training features (all columns except ‘class’).
- y_trainpandas.Series
Training labels (the ‘class’ column).
- X_testpandas.DataFrame
Test features (all columns except ‘class’).
- y_testpandas.Series
Test labels (the ‘class’ column).
- Or, if return_X_y=False:
- train_df, test_dfpandas.DataFrame
The full training and test DataFrames, already sliced by time.
Type-specific behavior
———————
- np.ndarray (Uses numpy operations to split by unique time values.)
- torch.Tensor (Uses torch operations to split by unique time values (no CPU/GPU transfer).)
- pandas.DataFrame (Splits based on the ‘time’ column. If return_X_y=True, unpacks X and y)
based on the ‘class’ column; otherwise, returns the sliced DataFrames.
- elliptic_toolkit.dataset.load_labeled_data(test_size=0.2, root='elliptic_bitcoin_dataset')[source]
Utility function to load data, select only labeled data and split temporally into train and test sets. :param test_size: Proportion of unique time steps to include in the test split (between 0.0 and 1.0). :type test_size: float, default=0.2 :param root: The root directory where the dataset is stored. Defaults to “elliptic_bitcoin_dataset”. :type root: str, optional
- Returns:
(X_train, y_train), (X_test, y_test) – X_train, y_train: training features and labels X_test, y_test: test features and labels
- Return type:
tuple of tuples