elliptic_toolkit.dataset module

elliptic_toolkit.dataset.download_dataset(root: str = 'elliptic_bitcoin_dataset', raw_file_names=['elliptic_txs_features.csv', 'elliptic_txs_edgelist.csv', 'elliptic_txs_classes.csv'], force: bool = False, url: str = 'https://data.pyg.org/datasets/elliptic')[source]

Download the Elliptic Bitcoin dataset from PyTorch Geometric’s dataset repository.

Parameters:
  • root (str, optional) – The root directory where the dataset will be stored. Defaults to “elliptic_bitcoin_dataset”.

  • raw_file_names (list, optional) – List of raw file names to download. Defaults to [ ‘elliptic_txs_features.csv’,

  • 'elliptic_txs_edgelist.csv'

  • 'elliptic_txs_classes.csv'

  • ].

  • force (bool, optional) – Whether to force re-download the dataset if it already exists. Defaults to False.

  • url (str, optional) – The base URL for the dataset files. Defaults to ‘https://data.pyg.org/datasets/elliptic’.

elliptic_toolkit.dataset.process_dataset(folder_path: str = 'elliptic_bitcoin_dataset', features_file: str = 'elliptic_txs_features.csv', classes_file: str = 'elliptic_txs_classes.csv', edges_file: str = 'elliptic_txs_edgelist.csv')[source]

Loads, validates, and processes the Elliptic Bitcoin dataset.

Returns:

  • nodes_df (pandas.DataFrame) – DataFrame with shape (203769, 167). Columns:

    • ’time’: Discrete time step (int)

    • ’feat_0’ … ‘feat_164’: Node features (float)

    • ’class’: Node label (int: 1 for illicit, 0 for licit, -1 for unknown/missing)

    The ‘class’ column uses -1 to indicate missing labels (transductive setting). The ‘txId’ column is dropped in the returned DataFrame; its original order matches the input file.

  • edges_df (pandas.DataFrame) – DataFrame with shape (234355, 2). Columns:

    • ’txId1’: Source node index (int, row index in nodes_df)

    • ’txId2’: Target node index (int, row index in nodes_df)

    Each row represents a directed edge in the transaction graph, with node indices corresponding to rows in nodes_df.

Notes

  • All IDs in ‘edges_df’ are mapped to row indices in ‘nodes_df’.

  • The function performs strict validation on shapes, unique values, and label distribution.

elliptic_toolkit.dataset.temporal_split(times, test_size=0.2)[source]
elliptic_toolkit.dataset.temporal_split(times: ndarray, test_size=0.2)
elliptic_toolkit.dataset.temporal_split(times: Tensor, test_size=0.2)
elliptic_toolkit.dataset.temporal_split(nodes_df: DataFrame, test_size=0.2, return_X_y=True)

Split data into temporal train/test sets based on unique time steps.

Parameters:
  • times (np.ndarray, torch.Tensor, or pandas.DataFrame) – The time information or data to split. For DataFrames, must contain a ‘time’ column.

  • test_size (float, default=0.2) – Proportion of unique time steps to include in the test split (between 0.0 and 1.0).

Returns:

  • For array/tensor input

    train_indices, test_indicesarray-like

    Indices for training and test sets.

  • For DataFrame input

    (X_train, y_train), (X_test, y_test)tuple of tuples
    X_trainpandas.DataFrame

    Training features (all columns except ‘class’).

    y_trainpandas.Series

    Training labels (the ‘class’ column).

    X_testpandas.DataFrame

    Test features (all columns except ‘class’).

    y_testpandas.Series

    Test labels (the ‘class’ column).

    Or, if return_X_y=False:
    train_df, test_dfpandas.DataFrame

    The full training and test DataFrames, already sliced by time.

  • Type-specific behavior

  • ———————

  • - np.ndarray (Uses numpy operations to split by unique time values.)

  • - torch.Tensor (Uses torch operations to split by unique time values (no CPU/GPU transfer).)

  • - pandas.DataFrame (Splits based on the ‘time’ column. If return_X_y=True, unpacks X and y)

  • based on the ‘class’ column; otherwise, returns the sliced DataFrames.

elliptic_toolkit.dataset.load_labeled_data(test_size=0.2, root='elliptic_bitcoin_dataset')[source]

Utility function to load data, select only labeled data and split temporally into train and test sets. :param test_size: Proportion of unique time steps to include in the test split (between 0.0 and 1.0). :type test_size: float, default=0.2 :param root: The root directory where the dataset is stored. Defaults to “elliptic_bitcoin_dataset”. :type root: str, optional

Returns:

(X_train, y_train), (X_test, y_test) – X_train, y_train: training features and labels X_test, y_test: test features and labels

Return type:

tuple of tuples