gents.dataset package

Subpackages

gents.dataset.modules package
- Module contents
  - AirQuality
  - ECG
  - ETTh1
  - ETTh2
  - ETTm1
  - ETTm2
  - Electricity
  - Energy
  - Exchange
  - MuJoCo
  - Physionet
  - SineND
  - Spiral2D
  - Stocks
  - Traffic
  - Weather

Submodules

gents.dataset.base module

class gents.dataset.base.BaseDataModule(seq_len: int, seq_dim: int, condition: str, batch_size: int, inference_batch_size: int, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: LightningDataModule, ABC

Base class for time series data module in PyTorch Lightning.

Parameters:

seq_len (int) – Target sequence length
seq_dim (int) – Target sequence dimension, for univariate time series, set as 1
condition (str) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
batch_size (int) – Training and validation batch size.
inference_batch_size (int) – Testing batch size.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.
train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].
**kwargs – Additional arguments for the dataset

predict_dataloader()

An iterable or collection of iterables specifying prediction samples.

For more information about multiple dataloaders, see this section.

It’s recommended that all data downloads and preparation happen in prepare_data().

predict()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Returns:: A torch.utils.data.DataLoader or a sequence of them specifying prediction samples.

prepare_data() → None

Use this to download and save time series files.

Default path is data/DATASET_NAME/data_tsl{SEQ_LEN}_tsd{SEQ_DIM}_ir{IRREGULAR_DROPOUT}.pt,

First, default data path will be checked, if there is no data file. Then, call self.get_data(), and save a tuple (data, data_mask, class_label) at the default path.

setup(stage)

Use this to load data file, perform train/val/test splits, and build create datasets.

Default train/val/test ratio = [0.7, 0.2, 0.1].

test_dataloader()

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

test()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader()

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader()

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

fit()
validate()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Bases: Dataset

Time series dataset.

Parameters:

data (torch.Tensor) – Time series data, in shape of [n_sample, total_seq_len, seq_dim]. For each sample, it could be a window from a long time series, or a single signal from a object (e.g. a patient measurements, a trajectory)
data_mask (torch.BoolTensor, optional) – Time series data mask, in shape of [n_sample, total_seq_len, seq_dim]. It records whether the data has original missing values. 1 for oberved, 0 for missing. If given None, assumed non-missing. Defaults to None.
class_label (torch.IntTensor, optional) – Class labels for time series, in shape of [n_sample, ]. Defaults to None.
condition (str, optional) – Condition type. Choose from [None, ‘predict’, ‘impute’, ‘class’] Defaults to None.
max_time (int, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
**kwargs – Additional arguments for the model, e.g. obs_len, missing_rate

A batch dictionary:

“seq”: [batch_size, total_seq_len, seq_dim]. Target time series window
“t”: [batch_size, total_seq_len]. Time step index at each time step in the window.
“data_mask”: [batch_size, total_seq_len, seq_dim]. Time series data mask
“c”: [batch_size, obs_len / seq_len]. Condition. Empty if unconditional.
“coeffs”: [batch_size, total_seq_len, seq_dim]. Coefficients of cubic spline/linear interp. of NCDE-related models. Empty if add_coeffs is False.

class gents.dataset.base.WebDownloadDataModule(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: BaseDataModule

Datamodule that directly download data (.csv or .zip that contains .csv) from url.

Class attributes:

D

Total sequence dimensions in the original data.

Type:: int

index_col

Index column (i.e. time stamp column). For pd.read_csv

Type:: str | int

url

download url link

Type:: str | int

csv_dir

if data_source=’zip’, csv_dir should be given for locating the .csv file

Type:: str | int

data_source

Choose from [‘zip’, ‘csv].

Type:: str

Parameters:

seq_len (int) – Target sequence length
select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “./data”.
train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
**kwargs – Additional arguments for the model

get_data()

Use this to download time series, (optionally) scale data and (optionally) simulate irregular time series

The downloaded file will be saved as archive.csv or archive.zip depending on the data_source.