gents.dataset.modules package

Module contents

class gents.dataset.modules.AirQuality(seq_len: int = 24, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: BaseDataModule

AirQuality dataset (only use Beijing stations).

Note

Originally has missing values. irregular_dropout is disabled.

feat_name

Feature names of multivariate time series. [“PM2.5”, “PM10”, “NO2”, “CO”, “O3”, “SO2”]

Type:

List[str]

D

Number of variates, i.e. len(feat_name)

Type:

int

urls

download link

Type:

str

Parameters:
  • seq_len (int) – Target sequence length

  • select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “data”.

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

  • **kwargs – Additional arguments for the model

class gents.dataset.modules.ECG(seq_len: int = 140, seq_dim: int = 1, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: BaseDataModule

ECG5000 dataset. Raw data has already be scaled.

Note

This is a univariate time series dataset, i.e. seq_dim = 1.

Note

This is a fixed-length dataset, i.e. for each time series total_seq_len=140, is fixed. For predict condition, following the rule that total_seq_len = obs_len + seq_len <= 140.

Note

Class labels of ECG are patient statuses (in total 5 labels).

L

Total sequence length, fixed to 140.

Type:

int

D

Number of variates, fixed to 140.

Type:

int

n_classes

Total number of class labels, fixed to 5.

Type:

int

urls

download link

Type:

str

Parameters:
  • seq_len (int, optional) – Target sequence length, fixed to 140.

  • seq_dim (int, optional) – Target sequence dimensions, fixed to 1.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “data”.

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

  • **kwargs – Additional arguments for the model

class gents.dataset.modules.ETTh1(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

ETTh1 dataset. We download the preprocessed data according to TSLib

D

Total number of variates, 7.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

data_source

Original data file type, fixed to ‘csv’.

Type:

str

class gents.dataset.modules.ETTh2(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: ETTh1

class gents.dataset.modules.ETTm1(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: ETTh1

class gents.dataset.modules.ETTm2(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: ETTh1

class gents.dataset.modules.Electricity(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Electricity dataset. The original data is resampled from 15-min to 1-hour. We download the preprocessed data according to TSLib

D

Total number of variates, 321.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

csv_dir

.csv file path in the .zip file, fixed to “electricity/electricity.csv”.

Type:

str

data_source

Original data file type, fixed to ‘zip’.

Type:

str

class gents.dataset.modules.Energy(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Energy dataset. We download the preprocessed data from the TimeGAN repo

D

Total number of variates, 28.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

data_source

Original data file type, fixed to ‘csv’.

Type:

str

class gents.dataset.modules.Exchange(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Exchange dataset We download the preprocessed data according to TSLib

D

Total number of variates, 8.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

csv_dir

.csv file path in the .zip file, fixed to “exchange_rate/exchange_rate.csv”.

Type:

str

data_source

Original data file type, fixed to ‘zip’.

Type:

str

class gents.dataset.modules.MuJoCo(seq_len: int = 200, select_seq_dim: List[int] | None = None, num_samples: int = 5000, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: BaseDataModule

MuJoCo data set with hopper standing task.

Note

Require Deepmind Control Suite to run. (pip install dm_control)

D

Total number of variates, 14.

Type:

int

Parameters:
  • seq_len (int) – Target sequence length

  • select_seq_dim (List[int]) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice. If None, use all channels. Defaults to None.

  • num_samples (int, optional) – Number of total simulated curves. Defaults to 5000.

  • batch_size (int) – Training and validation batch size.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.

  • condition (str) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • inference_batch_size (int) – Testing batch size.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

  • **kwargs – Additional arguments for the model

class gents.dataset.modules.Physionet(agg_minutes: int = 60, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)

Bases: BaseDataModule

Physionet Challenge 2012 Dataset

There are two sets: set-a, and set-b, and we concat them together to have the final datasets. For each patient measurement, the original data is irregularly recorded for 48 hours. For each record, the time format is “HH:MM”. Therefore, it can be treated as a long sequence at 1-min resolution, and for 48 hours, the total time steps are 48*60. We allow the users to aggreate (resample) to make the sequence shorter by setting agg_minutes, e.g. agg_minutes=60 is resampled at 1-hour level, resulting in sequences with 48 time steps.

In this way, this dataset has missing values at the unrecorded time steps.

Note

Originally has missing values. irregular_dropout is disabled.

Note

If set condition=’class’, then only set-a data is used since set-b data has no class labels.

D

Total sequence dimensions in the original data, fixed to 35.

Type:

int

L

Maximum sequence length, fixed to 2880.

Type:

int

url

download url link.

Type:

str | int

Parameters:
  • agg_minutes (int, optional) – Aggregation minutes. Defaults to 60.

  • select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “./data”.

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

class gents.dataset.modules.SineND(seq_len: int = 200, seq_dim: int = 1, num_samples: int = 1000, batch_size: int = 32, data_dir: Path | str = PosixPath('/home/wcx/GenTS/docs/data'), train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: BaseDataModule

Simulated sine waves with N dimensions. For each dimension, \(x(t)=\sin(at+b), a \sim \mathcal{U}[0.05, 0.4], b \sim \mathcal{U}[0., 1.5]\).

Parameters:
  • seq_len (int, optional) – Target sequence length. Defaults to 200.

  • seq_dim (int, optional) – Total simulated dimensions. Defaults to 200.

  • num_samples (int, optional) – Number of total simulated curves. Defaults to 1000.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (Path | str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

class gents.dataset.modules.Spiral2D(seq_len: int = 200, num_samples: int = 1000, batch_size: int = 32, data_dir: Path | str = PosixPath('/home/wcx/GenTS/docs/data'), train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: BaseDataModule

n_classes = 2

Simulated 2D spiral curves with clock-wise or counter clock-wise direction. For one curve, \(x_1(t)=\pm r(t) \cos(t), x_2(t)=r(t) \sin(t)\) where \(r(t)=a + bt, a \sim \mathcal{U}[0, 0.5), b \sim \mathcal{U}[0, 0.2)\).

Parameters:
  • seq_len (int, optional) – Target sequence length. Defaults to 200.

  • num_samples (int, optional) – Number of total simulated curves. Defaults to 1000.

  • batch_size (int, optional) – Training and validation batch size. Defaults to 32.

  • data_dir (Path | str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.

  • condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.

  • inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.

  • max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.

  • add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.

  • irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.

class gents.dataset.modules.Stocks(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Stocks dataset. We download the preprocessed data from the TimeGAN repo

D

Total number of variates, 6.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

data_source

Original data file type, fixed to ‘csv’.

Type:

str

class gents.dataset.modules.Traffic(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Traffic dataset We download the preprocessed data according to TSLib

D

Total number of variates, 862.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

csv_dir

.csv file path in the .zip file, fixed to “traffic/traffic.csv”.

Type:

str

data_source

Original data file type, fixed to ‘zip’.

Type:

str

class gents.dataset.modules.Weather(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)

Bases: WebDownloadDataModule

Weather dataset. We download the preprocessed data according to TSLib

D

Total number of variates, 21.

Type:

int

index_col

Time index column name.

Type:

int | str

urls

download link

Type:

str

csv_dir

.csv file path in the .zip file, fixed to “weather/weather.csv”.

Type:

str

data_source

Original data file type, fixed to ‘zip’.

Type:

str