Data¶
DeepChem dc.data
provides APIs for handling your data.
If your data is stored by the file like CSV and SDF, you can use the Data Loaders. The Data Loaders read your data, convert them to features (ex: SMILES to ECFP) and save the features to Dataset class. If your data is python objects like Numpy arrays or Pandas DataFrames, you can use the Datasets directly.
Contents
Datasets¶
DeepChem dc.data.Dataset
objects are one of the core building blocks of DeepChem programs.
Dataset
objects hold representations of data for machine learning and are widely used throughout DeepChem.
The goal of the Dataset
class is to be maximally interoperable
with other common representations of machine learning datasets.
For this reason we provide interconversion methods mapping from Dataset
objects
to pandas DataFrames, TensorFlow Datasets, and PyTorch datasets.
NumpyDataset¶
The dc.data.NumpyDataset
class provides an in-memory implementation of the abstract Dataset
which stores its data in numpy.ndarray
objects.
-
class
NumpyDataset
(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1)[source]¶ A Dataset defined by in-memory numpy arrays.
This subclass of Dataset stores arrays X,y,w,ids in memory as numpy arrays. This makes it very easy to construct NumpyDataset objects.
Examples
>>> import numpy as np >>> dataset = NumpyDataset(X=np.random.rand(5, 3), y=np.random.rand(5,), ids=np.arange(5))
-
__init__
(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1) → None[source]¶ Initialize this object.
- Parameters
X (np.ndarray) – Input features. A numpy array of shape (n_samples,…).
y (np.ndarray, optional (default None)) – Labels. A numpy array of shape (n_samples, …). Note that each label can have an arbitrary shape.
w (np.ndarray, optional (default None)) – Weights. Should either be 1D array of shape (n_samples,) or if there’s more than one task, of shape (n_samples, n_tasks).
ids (np.ndarray, optional (default None)) – Identifiers. A numpy array of shape (n_samples,)
n_tasks (int, default 1) – Number of learning tasks.
-
get_shape
() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶ Get the shape of the dataset.
Returns four tuples, giving the shape of the X, y, w, and ids arrays.
-
iterbatches
(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over minibatches from the dataset.
Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).
- Parameters
batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, default 1) – Number of epochs to walk over dataset.
deterministic (bool, optional (default False)) – If True, follow deterministic order.
pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
itersamples
() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over the samples in the dataset.
- Returns
Iterator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
Examples
>>> dataset = NumpyDataset(np.ones((2,2))) >>> for x, y, w, id in dataset.itersamples(): ... print(x.tolist(), y.tolist(), w.tolist(), id) [1.0, 1.0] [0.0] [0.0] 0 [1.0, 1.0] [0.0] [0.0] 1
-
transform
(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]¶ Construct a new dataset by applying a transformation to every sample in this dataset.
The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)
It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.
- Parameters
transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
- Returns
A newly constructed NumpyDataset object
- Return type
-
select
(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.NumpyDataset[source]¶ Creates a new dataset from a selection of indices from self.
- Parameters
indices (List[int]) – List of indices to select.
select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since NumpyDataset is purely in-memory.
- Returns
A selected NumpyDataset object
- Return type
-
make_pytorch_dataset
(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶ Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.
- Parameters
epochs (int, default 1) – The number of times to iterate over the Dataset
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
- Returns
torch.utils.data.IterableDataset that iterates over the data in this dataset.
- Return type
torch.utils.data.IterableDataset
Note
This method requires PyTorch to be installed.
-
static
from_DiskDataset
(ds: deepchem.data.datasets.DiskDataset) → deepchem.data.datasets.NumpyDataset[source]¶ Convert DiskDataset to NumpyDataset.
- Parameters
ds (DiskDataset) – DiskDataset to transform to NumpyDataset.
- Returns
A new NumpyDataset created from DiskDataset.
- Return type
-
static
to_json
(self, fname: str) → None[source]¶ Dump NumpyDataset to the json file .
- Parameters
fname (str) – The name of the json file.
-
static
from_json
(fname: str) → deepchem.data.datasets.NumpyDataset[source]¶ Create NumpyDataset from the json file.
- Parameters
fname (str) – The name of the json file.
- Returns
A new NumpyDataset created from the json file.
- Return type
-
static
merge
(datasets: Sequence[deepchem.data.datasets.Dataset]) → deepchem.data.datasets.NumpyDataset[source]¶ Merge multiple NumpyDatasets.
- Parameters
datasets (List[Dataset]) – List of datasets to merge.
- Returns
A single NumpyDataset containing all the samples from all datasets.
- Return type
-
static
from_dataframe
(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶ Construct a Dataset from the contents of a pandas DataFrame.
- Parameters
df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
-
get_statistics
(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶ Compute and return statistics of this dataset.
Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.
- Parameters
X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
- Returns
If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).
- Return type
Tuple
-
make_tf_dataset
(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶ Create a tf.data.Dataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.
- Parameters
batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
- Returns
TensorFlow Dataset that iterates over the same data.
- Return type
tf.data.Dataset
Note
This class requires TensorFlow to be installed.
-
to_dataframe
() → pandas.core.frame.DataFrame[source]¶ Construct a pandas DataFrame containing the data from this Dataset.
- Returns
Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
- Return type
pd.DataFrame
-
DiskDataset¶
The dc.data.DiskDataset
class allows for the storage of larger
datasets on disk. Each DiskDataset
is associated with a
directory in which it writes its contents to disk. Note that a
DiskDataset
can be very large, so some of the utility methods
to access fields of a Dataset
can be prohibitively expensive.
-
class
DiskDataset
(data_dir: str)[source]¶ A Dataset that is stored as a set of files on disk.
The DiskDataset is the workhorse class of DeepChem that facilitates analyses on large datasets. Use this class whenever you’re working with a large dataset that can’t be easily manipulated in RAM.
On disk, a DiskDataset has a simple structure. All files for a given DiskDataset are stored in a data_dir. The contents of data_dir should be laid out as follows:
data_dir/The metadata is constructed by static method DiskDataset._construct_metadata and saved to disk by DiskDataset._save_metadata. The metadata itself consists of a csv file which has columns (‘ids’, ‘X’, ‘y’, ‘w’, ‘ids_shape’, ‘X_shape’, ‘y_shape’, ‘w_shape’). tasks.json consists of a list of task names for this dataset.
The actual data is stored in .npy files (numpy array files) of the form ‘shard-0-X.npy’, ‘shard-0-y.npy’, etc.
The basic structure of DiskDataset is quite robust and will likely serve you well for datasets up to about 100 GB or larger. However note that DiskDataset has not been tested for very large datasets at the terabyte range and beyond. You may be better served by implementing a custom Dataset class for those use cases.
Examples
Let’s walk through a simple example of constructing a new DiskDataset.
>>> import deepchem as dc >>> import numpy as np >>> X = np.random.rand(10, 10) >>> dataset = dc.data.DiskDataset.from_numpy(X)
If you have already saved a DiskDataset to data_dir, you can reinitialize it with
>> data_dir = “/path/to/my/data” >> dataset = dc.data.DiskDataset(data_dir)
Once you have a dataset you can access its attributes as follows
>>> X = np.random.rand(10, 10) >>> y = np.random.rand(10,) >>> w = np.ones_like(y) >>> dataset = dc.data.DiskDataset.from_numpy(X) >>> X, y, w = dataset.X, dataset.y, dataset.w
One thing to beware of is that dataset.X, dataset.y, dataset.w are loading data from disk! If you have a large dataset, these operations can be extremely slow. Instead try iterating through the dataset instead.
>>> for (xi, yi, wi, idi) in dataset.itersamples(): ... pass
Note
DiskDataset originally had a simpler metadata format without shape information. Older DiskDataset objects had metadata files with columns (‘ids’, ‘X’, ‘y’, ‘w’) and not additional shape columns. DiskDataset maintains backwards compatibility with this older metadata format, but we recommend for performance reasons not using legacy metadata for new projects.
-
__init__
(data_dir: str) → None[source]¶ Load a constructed DiskDataset from disk
Note that this method cannot construct a new disk dataset. Instead use static methods DiskDataset.create_dataset or DiskDataset.from_numpy for that purpose. Use this constructor instead to load a DiskDataset that has already been created on disk.
- Parameters
data_dir (str) – Location on disk of an existing DiskDataset.
-
static
create_dataset
(shard_generator: Iterable[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]], data_dir: Optional[str] = None, tasks: Optional[Sequence] = []) → deepchem.data.datasets.DiskDataset[source]¶ Creates a new DiskDataset
- Parameters
shard_generator (Iterable[Batch]) – An iterable (either a list or generator) that provides tuples of data (X, y, w, ids). Each tuple will be written to a separate shard on disk.
data_dir (str, optional (default None)) – Filename for data directory. Creates a temp directory if none specified.
tasks (Sequence, optional (default [])) – List of tasks for this dataset.
- Returns
A new DiskDataset constructed from the given data
- Return type
-
load_metadata
() → Tuple[List[str], pandas.core.frame.DataFrame][source]¶ Helper method that loads metadata from disk.
-
static
write_data_to_disk
(data_dir: str, basename: str, tasks: numpy.ndarray, X: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → List[Optional[str]][source]¶ Static helper method to write data to disk.
This helper method is used to write a shard of data to disk.
- Parameters
data_dir (str) – Data directory to write shard to.
basename (str) – Basename for the shard in question.
tasks (np.ndarray) – The names of the tasks in question.
X (np.ndarray, optional (default None)) – The features array.
y (np.ndarray, optional (default None)) – The labels array.
w (np.ndarray, optional (default None)) – The weights array.
ids (np.ndarray, optional (default None)) – The identifiers array.
- Returns
List with values [out_ids, out_X, out_y, out_w, out_ids_shape, out_X_shape, out_y_shape, out_w_shape] with filenames of locations to disk which these respective arrays were written.
- Return type
List[Optional[str]]
-
move
(new_data_dir: str, delete_if_exists: Optional[bool] = True) → None[source]¶ Moves dataset to new directory.
- Parameters
new_data_dir (str) – The new directory name to move this to dataset to.
delete_if_exists (bool, optional (default True)) – If this option is set, delete the destination directory if it exists before moving. This is set to True by default to be backwards compatible with behavior in earlier versions of DeepChem.
Note
This is a stateful operation! self.data_dir will be moved into new_data_dir. If delete_if_exists is set to True (by default this is set True), then new_data_dir is deleted if it’s a pre-existing directory.
-
copy
(new_data_dir: str) → deepchem.data.datasets.DiskDataset[source]¶ Copies dataset to new directory.
- Parameters
new_data_dir (str) – The new directory name to copy this to dataset to.
- Returns
A copied DiskDataset object.
- Return type
Note
This is a stateful operation! Any data at new_data_dir will be deleted and self.data_dir will be deep copied into new_data_dir.
-
reshard
(shard_size: int) → None[source]¶ Reshards data to have specified shard size.
- Parameters
shard_size (int) – The size of shard.
Examples
>>> import deepchem as dc >>> import numpy as np >>> X = np.random.rand(100, 10) >>> d = dc.data.DiskDataset.from_numpy(X) >>> d.reshard(shard_size=10) >>> d.get_number_shards() 10
Note
If this DiskDataset is in legacy_metadata format, reshard will convert this dataset to have non-legacy metadata.
-
itershards
() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Return an object that iterates over all shards in dataset.
Datasets are stored in sharded fashion on disk. Each call to next() for the generator defined by this function returns the data from a particular shard. The order of shards returned is guaranteed to remain fixed.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
iterbatches
(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over minibatches from the dataset.
It is guaranteed that the number of batches returned is math.ceil(len(dataset)/batch_size). Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).
- Parameters
batch_size (int, optional (default None)) – Number of elements in a batch. If None, then it yields batches with size equal to the size of each individual shard.
epoch (int, default 1) – Number of epochs to walk over dataset
deterministic (bool, default False) – Whether or not we should should shuffle each shard before generating the batches. Note that this is only local in the sense that it does not ever mix between different shards.
pad_batches (bool, default False) – Whether or not we should pad the last batch, globally, such that it has exactly batch_size elements.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
itersamples
() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over the samples in the dataset.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
Examples
>>> dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1))) >>> for x, y, w, id in dataset.itersamples(): ... print(x.tolist(), y.tolist(), w.tolist(), id) [1.0, 1.0] [1.0] [1.0] 0 [1.0, 1.0] [1.0] [1.0] 1
-
transform
(transformer: transformers.Transformer, parallel: bool = False, out_dir: Optional[str] = None, **args) → deepchem.data.datasets.DiskDataset[source]¶ Construct a new dataset by applying a transformation to every sample in this dataset.
The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)
It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.
- Parameters
transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
parallel (bool, default False) – If True, use multiple processes to transform the dataset in parallel.
out_dir (str, optional (default None)) – The directory to save the new dataset in. If this is omitted, a temporary directory is created automaticall.
- Returns
A newly constructed Dataset object
- Return type
-
make_pytorch_dataset
(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶ Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.
- Parameters
epochs (int, default 1) – The number of times to iterate over the Dataset
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
- Returns
torch.utils.data.IterableDataset that iterates over the data in this dataset.
- Return type
torch.utils.data.IterableDataset
Note
This method requires PyTorch to be installed.
-
static
from_numpy
(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, tasks: Optional[Sequence] = None, data_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶ Creates a DiskDataset object from specified Numpy arrays.
- Parameters
X (np.ndarray) – Feature array.
y (np.ndarray, optional (default None)) – Labels array.
w (np.ndarray, optional (default None)) – Weights array.
ids (np.ndarray, optional (default None)) – Identifiers array.
tasks (Sequence, optional (default None)) – Tasks in this dataset
data_dir (str, optional (default None)) – The directory to write this dataset to. If none is specified, will use a temporary directory instead.
- Returns
A new DiskDataset constructed from the provided information.
- Return type
-
static
merge
(datasets: Iterable[deepchem.data.datasets.Dataset], merge_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶ Merges provided datasets into a merged dataset.
- Parameters
datasets (Iterable[Dataset]) – List of datasets to merge.
merge_dir (str, optional (default None)) – The new directory path to store the merged DiskDataset.
- Returns
A merged DiskDataset.
- Return type
-
subset
(shard_nums: Sequence[int], subset_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶ Creates a subset of the original dataset on disk.
- Parameters
shard_nums (Sequence[int]) – The indices of shard to extract from the original DiskDataset.
subset_dir (str, optional (default None)) – The new directory path to store the subset DiskDataset.
- Returns
A subset DiskDataset.
- Return type
-
sparse_shuffle
() → None[source]¶ Shuffling that exploits data sparsity to shuffle large datasets.
If feature vectors are sparse, say circular fingerprints or any other representation that contains few nonzero values, it can be possible to exploit the sparsity of the vector to simplify shuffles. This method implements a sparse shuffle by compressing sparse feature vectors down into a compressed representation, then shuffles this compressed dataset in memory and writes the results to disk.
Note
This method only works for 1-dimensional feature vectors (does not work for tensorial featurizations). Note that this shuffle is performed in place.
-
complete_shuffle
(data_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]¶ Completely shuffle across all data, across all shards.
Note
The algorithm used for this complete shuffle is O(N^2) where N is the number of shards. It simply constructs each shard of the output dataset one at a time. Since the complete shuffle can take a long time, it’s useful to watch the logging output. Each shuffled shard is constructed using select() which logs as it selects from each original shard. This will results in O(N^2) logging statements, one for each extraction of shuffled shard i’s contributions from original shard j.
- Parameters
data_dir (Optional[str], (default None)) – Directory to write the shuffled dataset to. If none is specified a temporary directory will be used.
- Returns
A DiskDataset whose data is a randomly shuffled version of this dataset.
- Return type
-
shuffle_each_shard
(shard_basenames: Optional[List[str]] = None) → None[source]¶ Shuffles elements within each shard of the dataset.
- Parameters
shard_basenames (List[str], optional (default None)) – The basenames for each shard. If this isn’t specified, will assume the basenames of form “shard-i” used by create_dataset and reshard.
-
get_shard
(i: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶ Retrieves data for the i-th shard from disk.
- Parameters
i (int) – Shard index for shard to retrieve batch from.
- Returns
A batch data for i-th shard.
- Return type
Batch
-
get_shard_ids
(i: int) → numpy.ndarray[source]¶ Retrieves the list of IDs for the i-th shard from disk.
- Parameters
i (int) – Shard index for shard to retrieve weights from.
- Returns
A numpy array of ids for i-th shard.
- Return type
np.ndarray
-
get_shard_y
(i: int) → numpy.ndarray[source]¶ Retrieves the labels for the i-th shard from disk.
- Parameters
i (int) – Shard index for shard to retrieve labels from.
- Returns
A numpy array of labels for i-th shard.
- Return type
np.ndarray
-
get_shard_w
(i: int) → numpy.ndarray[source]¶ Retrieves the weights for the i-th shard from disk.
- Parameters
i (int) – Shard index for shard to retrieve weights from.
- Returns
A numpy array of weights for i-th shard.
- Return type
np.ndarray
-
add_shard
(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶ Adds a data shard.
- Parameters
X (np.ndarray) – Feature array.
y (np.ndarray, optioanl (default None)) – Labels array.
w (np.ndarray, optioanl (default None)) – Weights array.
ids (np.ndarray, optioanl (default None)) – Identifiers array.
-
set_shard
(shard_num: int, X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶ Writes data shard to disk.
- Parameters
shard_num (int) – Shard index for shard to set new data.
X (np.ndarray) – Feature array.
y (np.ndarray, optioanl (default None)) – Labels array.
w (np.ndarray, optioanl (default None)) – Weights array.
ids (np.ndarray, optioanl (default None)) – Identifiers array.
-
select
(indices: Sequence[int], select_dir: Optional[str] = None, select_shard_size: Optional[int] = None, output_numpy_dataset: Optional[bool] = False) → deepchem.data.datasets.Dataset[source]¶ Creates a new dataset from a selection of indices from self.
Examples
>>> import numpy as np >>> X = np.random.rand(10, 10) >>> dataset = dc.data.DiskDataset.from_numpy(X) >>> selected = dataset.select([1, 3, 4]) >>> len(selected) 3
- Parameters
indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.
select_shard_size (Optional[int], (default None)) – If specified, the shard-size to use for output selected DiskDataset. If not output_numpy_dataset, then this is set to this current dataset’s shard size if not manually specified.
output_numpy_dataset (Optional[bool], (default False)) – If True, output an in-memory NumpyDataset instead of a DiskDataset. Note that select_dir and select_shard_size must be None if this is True
- Returns
A dataset containing the selected samples. The default dataset is DiskDataset. If output_numpy_dataset is True, the dataset is NumpyDataset.
- Return type
-
property
memory_cache_size
[source]¶ Get the size of the memory cache for this dataset, measured in bytes.
-
get_shape
() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶ Finds shape of dataset.
Returns four tuples, giving the shape of the X, y, w, and ids arrays.
-
static
from_dataframe
(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶ Construct a Dataset from the contents of a pandas DataFrame.
- Parameters
df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
-
get_statistics
(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶ Compute and return statistics of this dataset.
Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.
- Parameters
X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
- Returns
If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).
- Return type
Tuple
-
make_tf_dataset
(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶ Create a tf.data.Dataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.
- Parameters
batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
- Returns
TensorFlow Dataset that iterates over the same data.
- Return type
tf.data.Dataset
Note
This class requires TensorFlow to be installed.
-
to_dataframe
() → pandas.core.frame.DataFrame[source]¶ Construct a pandas DataFrame containing the data from this Dataset.
- Returns
Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
- Return type
pd.DataFrame
-
ImageDataset¶
The dc.data.ImageDataset
class is optimized to allow
for convenient processing of image based datasets.
-
class
ImageDataset
(X: Union[numpy.ndarray, List[str]], y: Optional[Union[numpy.ndarray, List[str]]], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None)[source]¶ A Dataset that loads data from image files on disk.
-
__init__
(X: Union[numpy.ndarray, List[str]], y: Optional[Union[numpy.ndarray, List[str]]], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶ Create a dataset whose X and/or y array is defined by image files on disk.
- Parameters
X (np.ndarray or List[str]) – The dataset’s input data. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
y (np.ndarray or List[str]) – The dataset’s labels. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
w (np.ndarray, optional (default None)) – a 1D or 2D array containing the weights for each sample or sample/task pair
ids (np.ndarray, optional (default None)) – the sample IDs
-
get_shape
() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶ Get the shape of the dataset.
Returns four tuples, giving the shape of the X, y, w, and ids arrays.
-
iterbatches
(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over minibatches from the dataset.
Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).
- Parameters
batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, default 1) – Number of epochs to walk over dataset.
deterministic (bool, default False) – If True, follow deterministic order.
pad_batches (bool, default False) – If True, pad each batch to batch_size.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
itersamples
() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over the samples in the dataset.
- Returns
Iterator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
transform
(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]¶ Construct a new dataset by applying a transformation to every sample in this dataset.
The argument is a function that can be called as follows:
>> newx, newy, neww = fn(x, y, w)
It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.
- Parameters
transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
- Returns
A newly constructed NumpyDataset object
- Return type
-
select
(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.ImageDataset[source]¶ Creates a new dataset from a selection of indices from self.
- Parameters
indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since ImageDataset is purely in-memory.
- Returns
A selected ImageDataset object
- Return type
-
make_pytorch_dataset
(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶ Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.
- Parameters
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
- Returns
torch.utils.data.IterableDataset that iterates over the data in this dataset.
- Return type
torch.utils.data.IterableDataset
Note
This method requires PyTorch to be installed.
-
static
from_dataframe
(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶ Construct a Dataset from the contents of a pandas DataFrame.
- Parameters
df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
-
get_statistics
(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶ Compute and return statistics of this dataset.
Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.
- Parameters
X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
- Returns
If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).
- Return type
Tuple
-
make_tf_dataset
(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶ Create a tf.data.Dataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.
- Parameters
batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
- Returns
TensorFlow Dataset that iterates over the same data.
- Return type
tf.data.Dataset
Note
This class requires TensorFlow to be installed.
-
to_dataframe
() → pandas.core.frame.DataFrame[source]¶ Construct a pandas DataFrame containing the data from this Dataset.
- Returns
Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
- Return type
pd.DataFrame
-
Data Loaders¶
Processing large amounts of input data to construct a dc.data.Dataset
object can require some amount of hacking.
To simplify this process for you, you can use the dc.data.DataLoader
classes.
These classes provide utilities for you to load and process large amounts of data.
CSVLoader¶
-
class
CSVLoader
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Creates Dataset objects from input CSV files.
This class provides conveniences to load data from CSV files. It’s possible to directly featurize data from CSV files using pandas, but this class may prove useful if you’re processing large CSV files that you don’t want to manipulate directly in memory.
Examples
Let’s suppose we have some smiles and labels
>>> smiles = ["C", "CCC"] >>> labels = [1.5, 2.3]
Let’s put these in a dataframe.
>>> import pandas as pd >>> df = pd.DataFrame(list(zip(smiles, labels)), columns=["smiles", "task1"])
Let’s now write this to disk somewhere. We can now use CSVLoader to process this CSV dataset.
>>> import tempfile >>> import deepchem as dc >>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile: ... df.to_csv(tmpfile.name) ... loader = dc.data.CSVLoader(["task1"], feature_field="smiles", ... featurizer=dc.feat.CircularFingerprint()) ... dataset = loader.create_dataset(tmpfile.name) >>> len(dataset) 2
Of course in practice you should already have your data in a CSV file if you’re using CSVLoader. If your data is already in memory, use InMemoryLoader instead.
-
__init__
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Initializes CSVLoader.
- Parameters
tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
feature_field (str, optional (default None)) – Field with data to be featurized.
id_field (str, optional, (default None)) – CSV column that holds sample identifier
smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
create_dataset
(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶ Creates and returns a Dataset object by featurizing provided files.
Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.
This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A DiskDataset object containing a featurized representation of data from inputs.
- Return type
-
UserCSVLoader¶
-
class
UserCSVLoader
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Handles loading of CSV files with user-defined features.
This is a convenience class that allows for descriptors already present in a CSV file to be extracted without any featurization necessary.
Examples
Let’s suppose we have some descriptors and labels. (Imagine that these descriptors have been computed by an external program.)
>>> desc1 = [1, 43] >>> desc2 = [-2, -22] >>> labels = [1.5, 2.3] >>> ids = ["cp1", "cp2"]
Let’s put these in a dataframe.
>>> import pandas as pd >>> df = pd.DataFrame(list(zip(ids, desc1, desc2, labels)), columns=["id", "desc1", "desc2", "task1"])
Let’s now write this to disk somewhere. We can now use UserCSVLoader to process this CSV dataset.
>>> import tempfile >>> import deepchem as dc >>> featurizer = dc.feat.UserDefinedFeaturizer(["desc1", "desc2"]) >>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile: ... df.to_csv(tmpfile.name) ... loader = dc.data.UserCSVLoader(["task1"], id_field="id", ... featurizer=featurizer) ... dataset = loader.create_dataset(tmpfile.name) >>> len(dataset) 2 >>> dataset.X[0, 0] 1
The difference between UserCSVLoader and CSVLoader is that our descriptors (our features) have already been computed for us, but are spread across multiple columns of the CSV file.
Of course in practice you should already have your data in a CSV file if you’re using UserCSVLoader. If your data is already in memory, use InMemoryLoader instead.
-
__init__
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Initializes CSVLoader.
- Parameters
tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
feature_field (str, optional (default None)) – Field with data to be featurized.
id_field (str, optional, (default None)) – CSV column that holds sample identifier
smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
create_dataset
(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶ Creates and returns a Dataset object by featurizing provided files.
Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.
This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A DiskDataset object containing a featurized representation of data from inputs.
- Return type
-
ImageLoader¶
-
class
ImageLoader
(tasks: Optional[List[str]] = None)[source]¶ Handles loading of image files.
This class allows for loading of images in various formats. For user convenience, also accepts zip-files and directories of images and uses some limited intelligence to attempt to traverse subdirectories which contain images.
-
__init__
(tasks: Optional[List[str]] = None)[source]¶ Initialize image loader.
At present, custom image featurizers aren’t supported by this loader class.
- Parameters
tasks (List[str], optional (default None)) – List of task names for image labels.
-
create_dataset
(inputs: Union[str, Sequence[str], Tuple[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192, in_memory: bool = False) → deepchem.data.datasets.Dataset[source]¶ Creates and returns a Dataset object by featurizing provided image files and labels/weights.
- Parameters
inputs (Union[OneOrMany[str], Tuple[Any]]) –
The inputs provided should be one of the following
filename
list of filenames
Tuple (list of filenames, labels)
Tuple (list of filenames, labels, weights)
Each file in a given list of filenames should either be of a supported image format (.png, .tif only for now) or of a compressed folder of image files (only .zip for now). If labels or weights are provided, they must correspond to the sorted order of all filenames provided, with one label/weight per file.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Shard size when loading data.
in_memory (bool, optioanl (default False)) – If true, return in-memory NumpyDataset. Else return ImageDataset.
- Returns
if in_memory == False, the return value is ImageDataset.
if in_memory == True and data_dir is None, the return value is NumpyDataset.
if in_memory == True and data_dir is not None, the return value is DiskDataset.
- Return type
-
JsonLoader¶
JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.
-
class
JsonLoader
(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Creates Dataset objects from input json files.
This class provides conveniences to load data from json files. It’s possible to directly featurize data from json files using pandas, but this class may prove useful if you’re processing large json files that you don’t want to manipulate directly in memory.
It is meant to load JSON files formatted as “records” in line delimited format, which allows for sharding.
list like [{column -> value}, ... , {column -> value}]
.Examples
Let’s create the sample dataframe.
>>> composition = ["LiCoO2", "MnO2"] >>> labels = [1.5, 2.3] >>> import pandas as pd >>> df = pd.DataFrame(list(zip(composition, labels)), columns=["composition", "task"])
Dump the dataframe to the JSON file formatted as “records” in line delimited format and load the json file by JsonLoader.
>>> import tempfile >>> import deepchem as dc >>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile: ... df.to_json(tmpfile.name, orient='records', lines=True) ... featurizer = dc.feat.ElementPropertyFingerprint() ... loader = dc.data.JsonLoader(["task"], feature_field="composition", featurizer=featurizer) ... dataset = loader.create_dataset(tmpfile.name) >>> len(dataset) 2
-
__init__
(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Initializes JsonLoader.
- Parameters
tasks (List[str]) – List of task names
feature_field (str) – JSON field with data to be featurized.
featurizer (Featurizer) – Featurizer to use to process data
label_field (str, optional (default None)) – Field with target variables.
weight_field (str, optional (default None)) – Field with weights.
id_field (str, optional (default None)) – Field for identifying samples.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
create_dataset
(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]¶ Creates a Dataset from input JSON files.
- Parameters
input_files (OneOrMany[str]) – List of JSON filenames.
data_dir (Optional[str], default None) – Name of directory where featurized data is stored.
shard_size (int, optional (default 8192)) – Shard size when loading data.
- Returns
A DiskDataset object containing a featurized representation of data from input_files.
- Return type
-
SDFLoader¶
-
class
SDFLoader
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]¶ Creates a Dataset object from SDF input files.
This class provides conveniences to load and featurize data from Structure Data Files (SDFs). SDF is a standard format for structural information (3D coordinates of atoms and bonds) of molecular compounds.
Examples
>>> import deepchem as dc >>> import os >>> current_dir = os.path.dirname(os.path.realpath(__file__)) >>> featurizer = dc.feat.CircularFingerprint(size=16) >>> loader = dc.data.SDFLoader(["LogP(RRCK)"], featurizer=featurizer, sanitize=True) >>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) >>> len(dataset) 2
-
__init__
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]¶ Initialize SDF Loader
- Parameters
tasks (list[str]) – List of tasknames. These will be loaded from the SDF file.
featurizer (Featurizer) – Featurizer to use to process data
sanitize (bool, optional (default False)) – Whether to sanitize molecules.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
create_dataset
(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶ Creates and returns a Dataset object by featurizing provided files.
Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.
This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A DiskDataset object containing a featurized representation of data from inputs.
- Return type
-
FASTALoader¶
-
class
FASTALoader
[source]¶ Handles loading of FASTA files.
FASTA files are commonly used to hold sequence data. This class provides convenience files to lead FASTA data and one-hot encode the genomic sequences for use in downstream learning tasks.
-
create_dataset
(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = None) → deepchem.data.datasets.DiskDataset[source]¶ Creates a Dataset from input FASTA files.
At present, FASTA support is limited and only allows for one-hot featurization, and doesn’t allow for sharding.
- Parameters
input_files (List[str]) – List of fasta files.
data_dir (str, optional (default None)) – Name of directory where featurized data is stored.
shard_size (int, optional (default None)) – For now, this argument is ignored and each FASTA file gets its own shard.
- Returns
A DiskDataset object containing a featurized representation of data from input_files.
- Return type
-
InMemoryLoader¶
The dc.data.InMemoryLoader
is designed to facilitate the processing of large datasets
where you already hold the raw data in-memory (say in a pandas dataframe).
-
class
InMemoryLoader
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Facilitate Featurization of In-memory objects.
When featurizing a dataset, it’s often the case that the initial set of data (pre-featurization) fits handily within memory. (For example, perhaps it fits within a column of a pandas DataFrame.) In this case, it would be convenient to directly be able to featurize this column of data. However, the process of featurization often generates large arrays which quickly eat up available memory. This class provides convenient capabilities to process such in-memory data by checkpointing generated features periodically to disk.
Example
Here’s an example with only datapoints and no labels or weights.
>>> import deepchem as dc >>> smiles = ["C", "CC", "CCC", "CCCC"] >>> featurizer = dc.feat.CircularFingerprint() >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer) >>> dataset = loader.create_dataset(smiles, shard_size=2) >>> len(dataset) 4
Here’s an example with both datapoints and labels
>>> import deepchem as dc >>> smiles = ["C", "CC", "CCC", "CCCC"] >>> labels = [1, 0, 1, 0] >>> featurizer = dc.feat.CircularFingerprint() >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer) >>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2) >>> len(dataset) 4
Here’s an example with datapoints, labels, weights and ids all provided.
>>> import deepchem as dc >>> smiles = ["C", "CC", "CCC", "CCCC"] >>> labels = [1, 0, 1, 0] >>> weights = [1.5, 0, 1.5, 0] >>> ids = ["C", "CC", "CCC", "CCCC"] >>> featurizer = dc.feat.CircularFingerprint() >>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer) >>> dataset = loader.create_dataset(zip(smiles, labels, weights, ids), shard_size=2) >>> len(dataset) 4
-
__init__
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Construct a DataLoader object.
This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.
- Parameters
tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
create_dataset
(inputs: Sequence[Any], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]¶ Creates and returns a Dataset object by featurizing provided files.
Reads in inputs and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.
This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (Sequence[Any]) – List of inputs to process. Entries can be arbitrary objects so long as they are understood by self.featurizer
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A DiskDataset object containing a featurized representation of data from inputs.
- Return type
-
Data Classes¶
DeepChem featurizers often transform members into “data classes”. These are
classes that hold all the information needed to train a model on that data
point. Models then transform these into the tensors for training in their
default_generator
methods.
Graph Data¶
These classes document the data classes for graph convolutions.
We plan to simplify these classes (ConvMol
, MultiConvMol
, WeaveMol
)
into a joint data representation (GraphData
) for all graph convolutions in a future version of DeepChem,
so these APIs may not remain stable.
The graph convolution models which inherit KerasModel
depend on ConvMol
, MultiConvMol
, or WeaveMol
.
On the other hand, the graph convolution models which inherit TorchModel
depend on GraphData
.
-
class
ConvMol
(atom_features, adj_list, max_deg=10, min_deg=0)[source]¶ Holds information about a molecules.
Resorts order of atoms internally to be in order of increasing degree. Note that only heavy atoms (hydrogens excluded) are considered here.
-
__init__
(atom_features, adj_list, max_deg=10, min_deg=0)[source]¶ - Parameters
atom_features (np.ndarray) – Has shape (n_atoms, n_feat)
adj_list (list) – List of length n_atoms, with neighor indices of each atom.
max_deg (int, optional) – Maximum degree of any atom.
min_deg (int, optional) – Minimum degree of any atom.
-
get_atom_features
()[source]¶ Returns canonicalized version of atom features.
Features are sorted by atom degree, with original order maintained when degrees are same.
-
get_adjacency_list
()[source]¶ Returns a canonicalized adjacency list.
Canonicalized means that the atoms are re-ordered by degree.
- Returns
Canonicalized form of adjacency list.
- Return type
list
-
get_deg_adjacency_lists
()[source]¶ Returns adjacency lists grouped by atom degree.
- Returns
Has length (max_deg+1-min_deg). The element at position deg is itself a list of the neighbor-lists for atoms with degree deg.
- Return type
list
-
get_deg_slice
()[source]¶ Returns degree-slice tensor.
The deg_slice tensor allows indexing into a flattened version of the molecule’s atoms. Assume atoms are sorted in order of degree. Then deg_slice[deg][0] is the starting position for atoms of degree deg in flattened list, and deg_slice[deg][1] is the number of atoms with degree deg.
Note deg_slice has shape (max_deg+1-min_deg, 2).
- Returns
deg_slice – Shape (max_deg+1-min_deg, 2)
- Return type
np.ndarray
-
static
get_null_mol
(n_feat, max_deg=10, min_deg=0)[source]¶ Constructs a null molecules
Get one molecule with one atom of each degree, with all the atoms connected to themselves, and containing n_feat features.
- Parameters
n_feat (int) – number of features for the nodes in the null molecule
-
static
agglomerate_mols
(mols, max_deg=10, min_deg=0)[source]¶ Concatenates list of ConvMol’s into one mol object that can be used to feed into tensorflow placeholders. The indexing of the molecules are preseved during the combination, but the indexing of the atoms are greatly changed.
- Parameters
mols (list) – ConvMol objects to be combined into one molecule.
-
-
class
MultiConvMol
(nodes, deg_adj_lists, deg_slice, membership, num_mols)[source]¶ Holds information about multiple molecules, for use in feeding information into tensorflow. Generated using the agglomerate_mols function
-
class
WeaveMol
(nodes, pairs, pair_edges)[source]¶ Molecular featurization object for weave convolutions.
These objects are produced by WeaveFeaturizer, and feed into WeaveModel. The underlying implementation is inspired by 1.
References
- 1
Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.
-
class
GraphData
(node_features: numpy.ndarray, edge_index: numpy.ndarray, edge_features: Optional[numpy.ndarray] = None, node_pos_features: Optional[numpy.ndarray] = None)[source]¶ GraphData class
This data class is almost same as torch_geometric.data.Data.
-
node_features
[source]¶ Node feature matrix with shape [num_nodes, num_node_features]
- Type
np.ndarray
-
edge_index
[source]¶ Graph connectivity in COO format with shape [2, num_edges]
- Type
np.ndarray, dtype int
-
edge_features
[source]¶ Edge feature matrix with shape [num_edges, num_edge_features]
- Type
np.ndarray, optional (default None)
-
node_pos_features
[source]¶ Node position matrix with shape [num_nodes, num_dimensions].
- Type
np.ndarray, optional (default None)
-
num_edges_features
[source]¶ The number of features per edge in the graph
- Type
int, optional (default None)
Examples
>>> import numpy as np >>> node_features = np.random.rand(5, 10) >>> edge_index = np.array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 0]], dtype=np.int64) >>> graph = GraphData(node_features=node_features, edge_index=edge_index)
-
__init__
(node_features: numpy.ndarray, edge_index: numpy.ndarray, edge_features: Optional[numpy.ndarray] = None, node_pos_features: Optional[numpy.ndarray] = None)[source]¶ - Parameters
node_features (np.ndarray) – Node feature matrix with shape [num_nodes, num_node_features]
edge_index (np.ndarray, dtype int) – Graph connectivity in COO format with shape [2, num_edges]
edge_features (np.ndarray, optional (default None)) – Edge feature matrix with shape [num_edges, num_edge_features]
node_pos_features (np.ndarray, optional (default None)) – Node position matrix with shape [num_nodes, num_dimensions].
-
Base Classes (for develop)¶
Dataset¶
The dc.data.Dataset
class is the abstract parent class for all
datasets. This class should never be directly initialized, but
contains a number of useful method implementations.
-
class
Dataset
[source]¶ Abstract base class for datasets defined by X, y, w elements.
Dataset objects are used to store representations of a dataset as used in a machine learning task. Datasets contain features X, labels y, weights w and identifiers ids. Different subclasses of Dataset may choose to hold X, y, w, ids in memory or on disk.
The Dataset class attempts to provide for strong interoperability with other machine learning representations for datasets. Interconversion methods allow for Dataset objects to be converted to and from numpy arrays, pandas dataframes, tensorflow datasets, and pytorch datasets (only to and not from for pytorch at present).
Note that you can never instantiate a Dataset object directly. Instead you will need to instantiate one of the concrete subclasses.
-
__len__
() → int[source]¶ Get the number of elements in the dataset.
- Returns
The number of elements in the dataset.
- Return type
int
-
get_shape
() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶ Get the shape of the dataset.
Returns four tuples, giving the shape of the X, y, w, and ids arrays.
- Returns
The tuple contains four elements, which are the shapes of the X, y, w, and ids arrays.
- Return type
Tuple
-
property
X
[source]¶ Get the X vector for this dataset as a single numpy array.
- Returns
A numpy array of identifiers X.
- Return type
np.ndarray
Note
If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.
-
property
y
[source]¶ Get the y vector for this dataset as a single numpy array.
- Returns
A numpy array of identifiers y.
- Return type
np.ndarray
Note
If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.
-
property
ids
[source]¶ Get the ids vector for this dataset as a single numpy array.
- Returns
A numpy array of identifiers ids.
- Return type
np.ndarray
Note
If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.
-
property
w
[source]¶ Get the weight vector for this dataset as a single numpy array.
- Returns
A numpy array of weights w.
- Return type
np.ndarray
Note
If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.
-
iterbatches
(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over minibatches from the dataset.
Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).
- Parameters
batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, optional (default 1)) – Number of epochs to walk over dataset.
deterministic (bool, optional (default False)) – If True, follow deterministic order.
pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.
- Returns
Generator which yields tuples of four numpy arrays (X, y, w, ids).
- Return type
Iterator[Batch]
-
itersamples
() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶ Get an object that iterates over the samples in the dataset.
Examples
>>> dataset = NumpyDataset(np.ones((2,2))) >>> for x, y, w, id in dataset.itersamples(): ... print(x.tolist(), y.tolist(), w.tolist(), id) [1.0, 1.0] [0.0] [0.0] 0 [1.0, 1.0] [0.0] [0.0] 1
-
transform
(transformer: transformers.Transformer, **args) → deepchem.data.datasets.Dataset[source]¶ Construct a new dataset by applying a transformation to every sample in this dataset.
The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)
It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.
- Parameters
transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
- Returns
A newly constructed Dataset object.
- Return type
-
select
(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]¶ Creates a new dataset from a selection of indices from self.
- Parameters
indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.
-
get_statistics
(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶ Compute and return statistics of this dataset.
Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.
- Parameters
X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.
- Returns
If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).
- Return type
Tuple
-
make_tf_dataset
(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶ Create a tf.data.Dataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.
- Parameters
batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.
- Returns
TensorFlow Dataset that iterates over the same data.
- Return type
tf.data.Dataset
Note
This class requires TensorFlow to be installed.
-
make_pytorch_dataset
(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶ Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.
Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.
- Parameters
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.
- Returns
torch.utils.data.IterableDataset that iterates over the data in this dataset.
- Return type
torch.utils.data.IterableDataset
Note
This class requires PyTorch to be installed.
-
to_dataframe
() → pandas.core.frame.DataFrame[source]¶ Construct a pandas DataFrame containing the data from this Dataset.
- Returns
Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
- Return type
pd.DataFrame
-
static
from_dataframe
(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶ Construct a Dataset from the contents of a pandas DataFrame.
- Parameters
df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().
-
DataLoader¶
The dc.data.DataLoader
class is the abstract parent class for all
dataloaders. This class should never be directly initialized, but
contains a number of useful method implementations.
-
class
DataLoader
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Handles loading/featurizing of data from disk.
The main use of DataLoader and its child classes is to make it easier to load large datasets into Dataset objects.`
DataLoader is an abstract superclass that provides a general framework for loading data into DeepChem. This class should never be instantiated directly. To load your own type of data, make a subclass of DataLoader and provide your own implementation for the create_dataset() method.
To construct a Dataset from input data, first instantiate a concrete data loader (that is, an object which is an instance of a subclass of DataLoader) with a given Featurizer object. Then call the data loader’s create_dataset() method on a list of input files that hold the source data to process. Note that each subclass of DataLoader is specialized to handle one type of input data so you will have to pick the loader class suitable for your input data type.
Note that it isn’t necessary to use a data loader to process input data. You can directly use Featurizer objects to featurize provided input into numpy arrays, but note that this calculation will be performed in memory, so you will have to write generators that walk the source files and write featurized data to disk yourself. DataLoader and its subclasses make this process easier for you by performing this work under the hood.
-
__init__
(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶ Construct a DataLoader object.
This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.
- Parameters
tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.
-
featurize
(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶ Featurize provided files and write to specified location.
DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.
For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, default None) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A Dataset object containing a featurized representation of data from inputs.
- Return type
-
create_dataset
(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶ Creates and returns a Dataset object by featurizing provided files.
Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.
This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.
- Parameters
inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.
- Returns
A DiskDataset object containing a featurized representation of data from inputs.
- Return type
-