Data Module

Data Module#

Dataloaders and datasets for modeling optimization problems.

Datasets#

class rlaopt.data.Dataset(X, y, device=None, dtype=torch.float32)[source]#

Bases: BaseDataset, TensorDataset

In-memory dataset for classical machine learning tasks.

Handles data matrices with labels/response vectors that fit in memory. Automatically converts numpy arrays and pandas DataFrames/Series to PyTorch tensors. Suitable for GLMs, classical statistical problems, and convex optimization tasks.

Parameters:

X (Tensor, np.ndarray, pd.DataFrame, or pd.Series) – Feature matrix of shape (n_samples, n_features).
y (Tensor, np.ndarray, pd.DataFrame, or pd.Series) – Target array of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.

Raises:

ValueError – If X is not 2-dimensional or if X and y have mismatched
sample sizes. –

Examples

>>> # From numpy
>>> X = np.random.randn(100, 10)
>>> y = np.random.randn(100)
>>> data = Dataset(X, y)

>>> # From pandas with device specification
>>> df = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4]})
>>> y = pd.Series([5, 6])
>>> data = Dataset(df, y, device='cuda')

>>> # Multi-target
>>> y_multi = np.random.randn(100, 3)
>>> data = Dataset(X, y_multi)

__init__(X, y, device=None, dtype=torch.float32)[source]#

Initialize Dataset with feature matrix and target array.

Parameters:

X (Tensor | ndarray | DataFrame | Series)
y (Tensor | ndarray | DataFrame | Series)
device (str | device | None)
dtype (dtype)

__getitem__(index)[source]#: Retrieve a sample and its index.

classmethod from_numpy(X, y, device=None, dtype=torch.float32)[source]#

Create a Dataset from numpy arrays.

Parameters:

X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
y (np.ndarray) – Target array of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.

Returns:

Dataset instance with data on specified device.

Return type:

Dataset

Examples

>>> X = np.random.randn(100, 10)
>>> y = np.random.randn(100)
>>> data = Dataset.from_numpy(X, y, device='cuda')

classmethod from_pandas(X, y, device=None, dtype=torch.float32)[source]#

Create a Dataset from pandas DataFrames or Series.

Parameters:

X (pd.DataFrame or pd.Series) – Feature data of shape (n_samples, n_features).
y (pd.DataFrame or pd.Series) – Target data of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.

Returns:

Dataset instance with data on specified device.

Return type:

Dataset

Examples

>>> # From separate DataFrames
>>> df_X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})
>>> df_y = pd.Series([7, 8, 9])
>>> data = Dataset.from_pandas(df_X, df_y)

>>> # From a single DataFrame using column selection
>>> df = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6], 'y': [7, 8, 9]})
>>> data = Dataset.from_pandas(df[['x1', 'x2']], df['y'])

>>> # Multi-target
>>> df_multi = pd.DataFrame({'x1': [1, 2], 'y1': [3, 4], 'y2': [5, 6]})
>>> data = Dataset.from_pandas(df_multi[['x1']], df_multi[['y1', 'y2']])

to(device)[source]#

Move dataset to specified device.

Parameters:: device (str or torch.device) – Target device (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’).
Returns:: New Dataset instance on the specified device.
Return type:: Dataset

Examples

>>> data = Dataset(X, y, device='cpu')
>>> data_gpu = data.to('cuda')
>>> print(data_gpu.device)  # cuda:0

property device: device#

Device where the dataset tensors are stored.

Type:: torch.device

property dtype: dtype#

Data type of the dataset tensors.

Type:: torch.dtype

property num_samples#

Total number of samples in the dataset.

Type:: int

property feature_dimension#

Number of features in the dataset.

Type:: int

property target_dimension#

Dimension(s) of the target.

Returns 1 for 1D targets (shape (n,)). For multi-dimensional targets, returns a tuple of dimensions excluding the sample dimension.

Examples

>>> # 1D target
>>> y = torch.randn(100)
>>> data = Dataset(X, y)
>>> data.target_dimension  # 1

>>> # 2D multi-target
>>> y = torch.randn(100, 5)
>>> data = Dataset(X, y)
>>> data.target_dimension  # (5,)

>>> # 3D target (e.g., images)
>>> y = torch.randn(100, 3, 28, 28)
>>> data = Dataset(X, y)
>>> data.target_dimension  # (3, 28, 28)

Type:: Int or tuple

property X#

Feature matrix of shape (n_samples, n_features).

Type:: Tensor

property y#

Target array of shape (n_samples, …).

Type:: Tensor

__repr__()[source]#

Return string representation of the dataset.

Returns:: String showing dataset dimensions and device.
Return type:: str

class rlaopt.data.BatchedDataset[source]#

Bases: BaseDataset, ABC

Abstract base class for datasets that are too large to fit in memory.

Subclasses must implement __getitem__ and __len__ following torch.utils.data.Dataset conventions, as well as properties to introspect feature and target dimensions.

This class is designed for datasets that can only be accessed in batches, where loading the entire dataset into memory is infeasible.

Examples

>>> class MyLargeDataset(BatchedDataset):
...     def __init__(self, data_path):
...         self.data_path = data_path
...         # Load metadata to determine shapes
...
...     def __getitem__(self, idx):
...         # Load sample(s) from disk
...         return X, y
...
...     def __len__(self):
...         return self.total_samples
...
...     @property
...     def feature_dimension(self):
...         return self.n_features
...
...     @property
...     def target_dimension(self):
...         return self.n_targets

__init__()[source]#: Initialize BatchedDataset.

abstractmethod __getitem__(idx)[source]#

Retrieve a sample or batch of samples.

Parameters:: idx (int or slice) – Index or slice of samples to retrieve.
Returns:: (X, y, idx) where X is features and y is target(s).
Return type:: tuple

abstractmethod __len__()[source]#

Return the total number of samples in the dataset.

Returns:: Total number of samples.
Return type:: int

property num_samples#

Total number of samples in the dataset.

Type:: int

abstract property feature_dimension#

Dimension(s) of the feature space.

Subclasses should implement this by inspecting metadata or a sample.

Type:: Int or tuple

abstract property target_dimension#

Dimension(s) of the target space.

Subclasses should implement this by inspecting metadata or a sample.

Type:: Int or tuple

Data Loading#

class rlaopt.data.DataLoader(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', in_order=True)[source]#

Bases: DataLoader

Extended PyTorch DataLoader with custom Dataset support and lazy labels.

This DataLoader extends torch.utils.data.DataLoader to work specifically with Dataset and BatchedDataset types, providing additional functionality for accessing training labels efficiently based on the dataset type.

Parameters:

dataset (Dataset | BatchedDataset) – A Dataset or BatchedDataset instance to load data from.
batch_size – Number of samples per batch. Default: 1.
shuffle – Whether to shuffle the data at every epoch. Default: None.
sampler – Strategy to draw samples from the dataset. Default: None.
batch_sampler – Strategy to draw batches of samples. Default: None.
num_workers – Number of subprocesses for data loading. Default: 0.
pin_memory – Whether to copy tensors into CUDA pinned memory. Default: False.
drop_last – Whether to drop the last incomplete batch. Default: False.
timeout – Timeout value for collecting a batch from workers. Default: 0.
worker_init_fn – Function called on each worker subprocess. Default: None.
multiprocessing_context – Multiprocessing context for workers. Default: None.
generator – Random number generator for sampling. Default: None.
prefetch_factor – Number of batches loaded in advance by each worker. Default: None.
persistent_workers – Whether to keep workers alive between epochs. Default: False.
pin_memory_device – Device where tensors should be pinned. Default: “”.
in_order – Whether to maintain order when loading data. Default: True.

Raises:

TypeError – If dataset is not an instance of Dataset or BatchedDataset.

y#: Property that returns all training labels from the dataset. For Dataset instances, labels are retrieved directly from memory. For BatchedDataset instances, labels are collected by iterating through all batches.

Example

>>> dataset = MyDataset(...)
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
>>> labels = loader.y  # Access all training labels
>>> for batch_x, batch_y in loader:
...     # Training loop

__init__(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', in_order=True)[source]#

Initialize the DataLoader with the given dataset and parameters.

Parameters:: dataset (Dataset | BatchedDataset)

get_batch()[source]#

Fetch the next batch from the DataLoader.

Automatically resets the iterator upon consumption (end of epoch).

Return type:: tuple[Tensor, …]

property shuffle: bool#: Whether the dataloader shuffles data each epoch.

property y#: Get all training labels from the dataset.

Data Module

Contents

Data Module#

Datasets#

Data Loading#