gents.dataset package

Subpackages

Submodules

gents.dataset.base module

class gents.dataset.base.BaseDataModule(seq_len: int, seq_dim: int, condition: str, batch_size: int, inference_batch_size: int, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: LightningDataModule, ABC

Base class for time series data module in PyTorch Lightning.

Parameters:
  • seq_len (int) – Target sequence length

  • seq_dim (int) – Target sequence dimension, for univariate time series, set as 1

  • condition (str) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • batch_size (int) – Training and validation batch size.

  • inference_batch_size (int) – Testing batch size.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.

  • train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].

  • **kwargs – Additional arguments for the dataset

predict_dataloader()

An iterable or collection of iterables specifying prediction samples.

For more information about multiple dataloaders, see this section.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:

A torch.utils.data.DataLoader or a sequence of them specifying prediction samples.

prepare_data() None

Use this to download and save time series files.

Default path is data/DATASET_NAME/data_tsl{SEQ_LEN}_tsd{SEQ_DIM}_ir{IRREGULAR_DROPOUT}.pt,

First, default data path will be checked, if there is no data file. Then, call self.get_data(), and save a tuple (data, data_mask, class_label) at the default path.

setup(stage)

Use this to load data file, perform train/val/test splits, and build create datasets.

Default train/val/test ratio = [0.7, 0.2, 0.1].

test_dataloader()

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader()

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader()

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

class gents.dataset.base.TSDataset(data: Tensor, data_mask: BoolTensor | None = None, class_label: IntTensor | None = None, condition: str | None = None, max_time: int | None = None, add_coeffs: str | None = None, **kwargs)

Bases: Dataset

Time series dataset.

Parameters:
  • data (torch.Tensor) – Time series data, in shape of [n_sample, total_seq_len, seq_dim]. For each sample, it could be a window from a long time series, or a single signal from a object (e.g. a patient measurements, a trajectory)

  • data_mask (torch.BoolTensor, optional) – Time series data mask, in shape of [n_sample, total_seq_len, seq_dim]. It records whether the data has original missing values. 1 for oberved, 0 for missing. If given None, assumed non-missing. Defaults to None.

  • class_label (torch.IntTensor, optional) – Class labels for time series, in shape of [n_sample, ]. Defaults to None.

  • condition (str, optional) – Condition type. Choose from [None, ‘predict’, ‘impute’, ‘class’] Defaults to None.

  • max_time (int, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • **kwargs – Additional arguments for the model, e.g. obs_len, missing_rate

A batch dictionary:
  • “seq”: [batch_size, total_seq_len, seq_dim]. Target time series window

  • “t”: [batch_size, total_seq_len]. Time step index at each time step in the window.

  • “data_mask”: [batch_size, total_seq_len, seq_dim]. Time series data mask

  • “c”: [batch_size, obs_len / seq_len]. Condition. Empty if unconditional.

  • “coeffs”: [batch_size, total_seq_len, seq_dim]. Coefficients of cubic spline/linear interp. of NCDE-related models. Empty if add_coeffs is False.

class gents.dataset.base.WebDownloadDataModule(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: BaseDataModule

Datamodule that directly download data (.csv or .zip that contains .csv) from url.

Class attributes:

D

Total sequence dimensions in the original data.

Type:

int

index_col

Index column (i.e. time stamp column). For pd.read_csv

Type:

str | int

url

download url link

Type:

str | int

csv_dir

if data_source=’zip’, csv_dir should be given for locating the .csv file

Type:

str | int

data_source

Choose from [‘zip’, ‘csv].

Type:

str

Parameters:
  • seq_len (int) – Target sequence length

  • select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “./data”.

  • train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

  • **kwargs – Additional arguments for the model

get_data()

Use this to download time series, (optionally) scale data and (optionally) simulate irregular time series

The downloaded file will be saved as archive.csv or archive.zip depending on the data_source.