Data Module#
Dataloaders and datasets for modeling optimization problems.
Datasets#
- class rlaopt.data.Dataset(X, y, device=None, dtype=torch.float32)[source]#
Bases:
BaseDataset,TensorDatasetIn-memory dataset for classical machine learning tasks.
Handles data matrices with labels/response vectors that fit in memory. Automatically converts numpy arrays and pandas DataFrames/Series to PyTorch tensors. Suitable for GLMs, classical statistical problems, and convex optimization tasks.
- Parameters:
X (Tensor, np.ndarray, pd.DataFrame, or pd.Series) – Feature matrix of shape (n_samples, n_features).
y (Tensor, np.ndarray, pd.DataFrame, or pd.Series) – Target array of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.
- Raises:
ValueError – If X is not 2-dimensional or if X and y have mismatched
sample sizes. –
Examples
>>> # From numpy >>> X = np.random.randn(100, 10) >>> y = np.random.randn(100) >>> data = Dataset(X, y)
>>> # From pandas with device specification >>> df = pd.DataFrame({'x1': [1, 2], 'x2': [3, 4]}) >>> y = pd.Series([5, 6]) >>> data = Dataset(df, y, device='cuda')
>>> # Multi-target >>> y_multi = np.random.randn(100, 3) >>> data = Dataset(X, y_multi)
- __init__(X, y, device=None, dtype=torch.float32)[source]#
Initialize Dataset with feature matrix and target array.
- classmethod from_numpy(X, y, device=None, dtype=torch.float32)[source]#
Create a Dataset from numpy arrays.
- Parameters:
X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
y (np.ndarray) – Target array of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.
- Returns:
Dataset instance with data on specified device.
- Return type:
Examples
>>> X = np.random.randn(100, 10) >>> y = np.random.randn(100) >>> data = Dataset.from_numpy(X, y, device='cuda')
- classmethod from_pandas(X, y, device=None, dtype=torch.float32)[source]#
Create a Dataset from pandas DataFrames or Series.
- Parameters:
X (pd.DataFrame or pd.Series) – Feature data of shape (n_samples, n_features).
y (pd.DataFrame or pd.Series) – Target data of shape (n_samples, …). Can be any dimensionality.
device (str or torch.device, optional) – Device to place tensors on (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’). Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to torch.float32.
- Returns:
Dataset instance with data on specified device.
- Return type:
Examples
>>> # From separate DataFrames >>> df_X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]}) >>> df_y = pd.Series([7, 8, 9]) >>> data = Dataset.from_pandas(df_X, df_y)
>>> # From a single DataFrame using column selection >>> df = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6], 'y': [7, 8, 9]}) >>> data = Dataset.from_pandas(df[['x1', 'x2']], df['y'])
>>> # Multi-target >>> df_multi = pd.DataFrame({'x1': [1, 2], 'y1': [3, 4], 'y2': [5, 6]}) >>> data = Dataset.from_pandas(df_multi[['x1']], df_multi[['y1', 'y2']])
- to(device)[source]#
Move dataset to specified device.
- Parameters:
device (str or torch.device) – Target device (e.g., ‘cpu’, ‘cuda’, ‘cuda:0’).
- Returns:
New Dataset instance on the specified device.
- Return type:
Examples
>>> data = Dataset(X, y, device='cpu') >>> data_gpu = data.to('cuda') >>> print(data_gpu.device) # cuda:0
- property target_dimension#
Dimension(s) of the target.
Returns 1 for 1D targets (shape (n,)). For multi-dimensional targets, returns a tuple of dimensions excluding the sample dimension.
Examples
>>> # 1D target >>> y = torch.randn(100) >>> data = Dataset(X, y) >>> data.target_dimension # 1
>>> # 2D multi-target >>> y = torch.randn(100, 5) >>> data = Dataset(X, y) >>> data.target_dimension # (5,)
>>> # 3D target (e.g., images) >>> y = torch.randn(100, 3, 28, 28) >>> data = Dataset(X, y) >>> data.target_dimension # (3, 28, 28)
- Type:
Int or tuple
- property X#
Feature matrix of shape (n_samples, n_features).
- Type:
Tensor
- property y#
Target array of shape (n_samples, …).
- Type:
Tensor
- class rlaopt.data.BatchedDataset[source]#
Bases:
BaseDataset,ABCAbstract base class for datasets that are too large to fit in memory.
Subclasses must implement __getitem__ and __len__ following torch.utils.data.Dataset conventions, as well as properties to introspect feature and target dimensions.
This class is designed for datasets that can only be accessed in batches, where loading the entire dataset into memory is infeasible.
Examples
>>> class MyLargeDataset(BatchedDataset): ... def __init__(self, data_path): ... self.data_path = data_path ... # Load metadata to determine shapes ... ... def __getitem__(self, idx): ... # Load sample(s) from disk ... return X, y ... ... def __len__(self): ... return self.total_samples ... ... @property ... def feature_dimension(self): ... return self.n_features ... ... @property ... def target_dimension(self): ... return self.n_targets
- abstractmethod __len__()[source]#
Return the total number of samples in the dataset.
- Returns:
Total number of samples.
- Return type:
Data Loading#
- class rlaopt.data.DataLoader(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', in_order=True)[source]#
Bases:
DataLoaderExtended PyTorch DataLoader with custom Dataset support and lazy labels.
This DataLoader extends torch.utils.data.DataLoader to work specifically with Dataset and BatchedDataset types, providing additional functionality for accessing training labels efficiently based on the dataset type.
- Parameters:
dataset (Dataset | BatchedDataset) – A Dataset or BatchedDataset instance to load data from.
batch_size – Number of samples per batch. Default: 1.
shuffle – Whether to shuffle the data at every epoch. Default: None.
sampler – Strategy to draw samples from the dataset. Default: None.
batch_sampler – Strategy to draw batches of samples. Default: None.
num_workers – Number of subprocesses for data loading. Default: 0.
pin_memory – Whether to copy tensors into CUDA pinned memory. Default: False.
drop_last – Whether to drop the last incomplete batch. Default: False.
timeout – Timeout value for collecting a batch from workers. Default: 0.
worker_init_fn – Function called on each worker subprocess. Default: None.
multiprocessing_context – Multiprocessing context for workers. Default: None.
generator – Random number generator for sampling. Default: None.
prefetch_factor – Number of batches loaded in advance by each worker. Default: None.
persistent_workers – Whether to keep workers alive between epochs. Default: False.
pin_memory_device – Device where tensors should be pinned. Default: “”.
in_order – Whether to maintain order when loading data. Default: True.
- Raises:
TypeError – If dataset is not an instance of Dataset or BatchedDataset.
- y#
Property that returns all training labels from the dataset. For Dataset instances, labels are retrieved directly from memory. For BatchedDataset instances, labels are collected by iterating through all batches.
Example
>>> dataset = MyDataset(...) >>> loader = DataLoader(dataset, batch_size=32, shuffle=True) >>> labels = loader.y # Access all training labels >>> for batch_x, batch_y in loader: ... # Training loop
- __init__(dataset, batch_size=1, shuffle=None, sampler=None, batch_sampler=None, num_workers=0, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', in_order=True)[source]#
Initialize the DataLoader with the given dataset and parameters.
- Parameters:
dataset (Dataset | BatchedDataset)
- get_batch()[source]#
Fetch the next batch from the DataLoader.
Automatically resets the iterator upon consumption (end of epoch).
- property y#
Get all training labels from the dataset.