gents.dataset.modules package
Module contents
- class gents.dataset.modules.AirQuality(seq_len: int = 24, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)
Bases:
BaseDataModuleAirQuality dataset (only use Beijing stations).
Note
Originally has missing values. irregular_dropout is disabled.
- feat_name
Feature names of multivariate time series. [“PM2.5”, “PM10”, “NO2”, “CO”, “O3”, “SO2”]
- Type:
List[str]
- D
Number of variates, i.e. len(feat_name)
- Type:
int
- urls
-
- Type:
str
- Parameters:
seq_len (int) – Target sequence length
select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “data”.
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
**kwargs – Additional arguments for the model
- class gents.dataset.modules.ECG(seq_len: int = 140, seq_dim: int = 1, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)
Bases:
BaseDataModuleECG5000 dataset. Raw data has already be scaled.
Note
This is a univariate time series dataset, i.e. seq_dim = 1.
Note
This is a fixed-length dataset, i.e. for each time series total_seq_len=140, is fixed. For predict condition, following the rule that total_seq_len = obs_len + seq_len <= 140.
Note
Class labels of ECG are patient statuses (in total 5 labels).
- L
Total sequence length, fixed to 140.
- Type:
int
- D
Number of variates, fixed to 140.
- Type:
int
- n_classes
Total number of class labels, fixed to 5.
- Type:
int
- urls
-
- Type:
str
- Parameters:
seq_len (int, optional) – Target sequence length, fixed to 140.
seq_dim (int, optional) – Target sequence dimensions, fixed to 1.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “data”.
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
**kwargs – Additional arguments for the model
- class gents.dataset.modules.ETTh1(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleETTh1 dataset. We download the preprocessed data according to TSLib
- D
Total number of variates, 7.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- data_source
Original data file type, fixed to ‘csv’.
- Type:
str
- class gents.dataset.modules.ETTh2(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
ETTh1
- class gents.dataset.modules.ETTm1(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
ETTh1
- class gents.dataset.modules.ETTm2(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
ETTh1
- class gents.dataset.modules.Electricity(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleElectricity dataset. The original data is resampled from 15-min to 1-hour. We download the preprocessed data according to TSLib
- D
Total number of variates, 321.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- csv_dir
.csv file path in the .zip file, fixed to “electricity/electricity.csv”.
- Type:
str
- data_source
Original data file type, fixed to ‘zip’.
- Type:
str
- class gents.dataset.modules.Energy(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleEnergy dataset. We download the preprocessed data from the TimeGAN repo
- D
Total number of variates, 28.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- data_source
Original data file type, fixed to ‘csv’.
- Type:
str
- class gents.dataset.modules.Exchange(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleExchange dataset We download the preprocessed data according to TSLib
- D
Total number of variates, 8.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- csv_dir
.csv file path in the .zip file, fixed to “exchange_rate/exchange_rate.csv”.
- Type:
str
- data_source
Original data file type, fixed to ‘zip’.
- Type:
str
- class gents.dataset.modules.MuJoCo(seq_len: int = 200, select_seq_dim: List[int] | None = None, num_samples: int = 5000, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)
Bases:
BaseDataModuleMuJoCo data set with hopper standing task.
Note
Require Deepmind Control Suite to run. (pip install dm_control)
- D
Total number of variates, 14.
- Type:
int
- Parameters:
seq_len (int) – Target sequence length
select_seq_dim (List[int]) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice. If None, use all channels. Defaults to None.
num_samples (int, optional) – Number of total simulated curves. Defaults to 5000.
batch_size (int) – Training and validation batch size.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.
condition (str) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
inference_batch_size (int) – Testing batch size.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
**kwargs – Additional arguments for the model
- class gents.dataset.modules.Physionet(agg_minutes: int = 60, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float = 1.0, add_coeffs: str | None = None, irregular_dropout: float = 0.0, train_val_test: List[float] = [0.7, 0.2, 0.1], **kwargs)
Bases:
BaseDataModulePhysionet Challenge 2012 Dataset
There are two sets: set-a, and set-b, and we concat them together to have the final datasets. For each patient measurement, the original data is irregularly recorded for 48 hours. For each record, the time format is “HH:MM”. Therefore, it can be treated as a long sequence at 1-min resolution, and for 48 hours, the total time steps are 48*60. We allow the users to aggreate (resample) to make the sequence shorter by setting agg_minutes, e.g. agg_minutes=60 is resampled at 1-hour level, resulting in sequences with 48 time steps.
In this way, this dataset has missing values at the unrecorded time steps.
Note
Originally has missing values. irregular_dropout is disabled.
Note
If set condition=’class’, then only set-a data is used since set-b data has no class labels.
- D
Total sequence dimensions in the original data, fixed to 35.
- Type:
int
- L
Maximum sequence length, fixed to 2880.
- Type:
int
- url
download url link.
- Type:
str | int
- Parameters:
agg_minutes (int, optional) – Aggregation minutes. Defaults to 60.
select_seq_dim (List[int | str], optional) – Subset of all sequence channels. Could be a list of int indicating the chosen channel indice or a list of str indicating the column names for pd.Dataframe object. If None, use all channels. Defaults to None.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to “./data”.
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
scale (bool, optional) – If True, StandardScaler will be used for z-score normalization. Training data will be used for calculating mu and sigma, then transform all time steps. Defaults to True.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
- class gents.dataset.modules.SineND(seq_len: int = 200, seq_dim: int = 1, num_samples: int = 1000, batch_size: int = 32, data_dir: Path | str = PosixPath('/home/wcx/GenTS/docs/data'), train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
BaseDataModuleSimulated sine waves with N dimensions. For each dimension, \(x(t)=\sin(at+b), a \sim \mathcal{U}[0.05, 0.4], b \sim \mathcal{U}[0., 1.5]\).
- Parameters:
seq_len (int, optional) – Target sequence length. Defaults to 200.
seq_dim (int, optional) – Total simulated dimensions. Defaults to 200.
num_samples (int, optional) – Number of total simulated curves. Defaults to 1000.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (Path | str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
- class gents.dataset.modules.Spiral2D(seq_len: int = 200, num_samples: int = 1000, batch_size: int = 32, data_dir: Path | str = PosixPath('/home/wcx/GenTS/docs/data'), train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
BaseDataModule- n_classes = 2
Simulated 2D spiral curves with clock-wise or counter clock-wise direction. For one curve, \(x_1(t)=\pm r(t) \cos(t), x_2(t)=r(t) \sin(t)\) where \(r(t)=a + bt, a \sim \mathcal{U}[0, 0.5), b \sim \mathcal{U}[0, 0.2)\).
- Parameters:
seq_len (int, optional) – Target sequence length. Defaults to 200.
num_samples (int, optional) – Number of total simulated curves. Defaults to 1000.
batch_size (int, optional) – Training and validation batch size. Defaults to 32.
data_dir (Path | str, optional) – Directory to save the data file (default name: “data_tsl{total_seq_len}_tsd{seq_dim}_ir{irregular_dropout}.pt”). Defaults to Path.cwd()/”data”.
condition (str, optional) – Possible condition type, choose from [None, ‘predict’,’impute’, ‘class’]. None standards for unconditional generation.
inference_batch_size (int, optional) – Testing batch size. Defaults to 1024.
max_time (float, optional) – Time step index [0, 1, …, total_seq_len - 1] will be automatically generated. If max_time is given, then scale the time step index, [0, …, max_time]. Defaults to None.
add_coeffs (str, optional) – Include interpolation coefficients or not. Needed for KoVAE, GTGAN and SDEGAN. Choose from [None, ‘linear’, ‘cubic_spline’]. If None, don’t include. Defaults to None.
irregular_dropout (float, optional) – Dropout rate to similate irregular time series data by randomly dropout some time steps in the original data. Set between [0.0, 1.0] Defaults to 0.0.
- class gents.dataset.modules.Stocks(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleStocks dataset. We download the preprocessed data from the TimeGAN repo
- D
Total number of variates, 6.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- data_source
Original data file type, fixed to ‘csv’.
- Type:
str
- class gents.dataset.modules.Traffic(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleTraffic dataset We download the preprocessed data according to TSLib
- D
Total number of variates, 862.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- csv_dir
.csv file path in the .zip file, fixed to “traffic/traffic.csv”.
- Type:
str
- data_source
Original data file type, fixed to ‘zip’.
- Type:
str
- class gents.dataset.modules.Weather(seq_len: int, select_seq_dim: List[int | str] | None = None, batch_size: int = 32, data_dir: str = './data', train_val_test: List[float] = [0.7, 0.2, 0.1], condition: str | None = None, scale: bool = True, inference_batch_size: int = 1024, max_time: float | None = None, add_coeffs: str | None = None, irregular_dropout: float = 0.0, **kwargs)
Bases:
WebDownloadDataModuleWeather dataset. We download the preprocessed data according to TSLib
- D
Total number of variates, 21.
- Type:
int
- index_col
Time index column name.
- Type:
int | str
- urls
-
- Type:
str
- csv_dir
.csv file path in the .zip file, fixed to “weather/weather.csv”.
- Type:
str
- data_source
Original data file type, fixed to ‘zip’.
- Type:
str