gents.dataset package
Subpackages
Submodules
gents.dataset.base module
- class gents.dataset.base.BaseDataModule(seq_len: int, seq_dim: int, condition: str, batch_size: int, inference_batch_size: int, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)
Bases:
LightningDataModule,ABCBase class for time series data module in PyTorch Lightning.
- Parameters:
seq_len (int) – Target sequence length
seq_dim (int) – Target sequence dimension, for univariate time series, set as 1
condition (str) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
batch_size (int) – Training and validation batch size.
inference_batch_size (int) – Testing batch size.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.
train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].
**kwargs – Additional arguments for the dataset
- predict_dataloader()
An iterable or collection of iterables specifying prediction samples.
For more information about multiple dataloaders, see this section.
It’s recommended that all data downloads and preparation happen in
prepare_data().predict()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Returns:
A
torch.utils.data.DataLoaderor a sequence of them specifying prediction samples.
- prepare_data() None
Use this to download and save time series files.
Default path is data/DATASET_NAME/data_tsl{SEQ_LEN}_tsd{SEQ_DIM}_ir{IRREGULAR_DROPOUT}.pt,
First, default data path will be checked, if there is no data file. Then, call self.get_data(), and save a tuple (data, data_mask, class_label) at the default path.
- setup(stage)
Use this to load data file, perform train/val/test splits, and build create datasets.
Default train/val/test ratio = [0.7, 0.2, 0.1].
- test_dataloader()
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- train_dataloader()
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- val_dataloader()
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- class gents.dataset.base.TSDataset(data: Tensor, data_mask: BoolTensor | None = None, class_label: IntTensor | None = None, condition: str | None = None, max_time: int | None = None, add_coeffs: str | None = None, **kwargs)
Bases:
DatasetTime series dataset.
- Parameters:
data (torch.Tensor) – Time series data, in shape of [n_sample, total_seq_len, seq_dim]. For each sample, it could be a window from a long time series, or a single signal from a object (e.g. a patient measurements, a trajectory)
data_mask (torch.BoolTensor, optional) – Time series data mask, in shape of [n_sample, total_seq_len, seq_dim]. It records whether the data has original missing values. 1 for oberved, 0 for missing. If given None, assumed non-missing. Defaults to None.
class_label (torch.IntTensor, optional) – Class labels for time series, in shape of [n_sample, ]. Defaults to None.
condition (str, optional) – Condition type. Choose from [None, ‘predict’, ‘impute’, ‘class’] Defaults to None.
max_time (int, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
**kwargs – Additional arguments for the model, e.g. obs_len, missing_rate
- A batch dictionary:
“seq”: [batch_size, total_seq_len, seq_dim]. Target time series window
“t”: [batch_size, total_seq_len]. Time step index at each time step in the window.
“data_mask”: [batch_size, total_seq_len, seq_dim]. Time series data mask
“c”: [batch_size, obs_len / seq_len]. Condition. Empty if unconditional.
“coeffs”: [batch_size, total_seq_len, seq_dim]. Coefficients of cubic spline/linear interp. of NCDE-related models. Empty if add_coeffs is False.
- class gents.dataset.base.WebDownloadDataModule(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
BaseDataModuleDatamodule that directly download data (.csv or .zip that contains .csv) from url.
Class attributes:
- D
Total sequence dimensions in the original data.
- Type:
int
- index_col
Index column (i.e. time stamp column). For pd.read_csv
- Type:
str | int
- url
download url link
- Type:
str | int
- csv_dir
if data_source=’zip’, csv_dir should be given for locating the .csv file
- Type:
str | int
- data_source
Choose from [‘zip’, ‘csv].
- Type:
str
- Parameters:
seq_len (int) – Target sequence length
select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “./data”.
train_val_test (List[float], optional) – Ratios of training, validation and testing dataset. Should be sum as 1.0. Defaults to [0.7, 0.2, 0.1].
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
**kwargs – Additional arguments for the model
- get_data()
Use this to download time series, (optionally) scale data and (optionally) simulate irregular time series
The downloaded file will be saved as archive.csv or archive.zip depending on the data_source.