Data¶

DeepChem dc.data provides APIs for handling your data.

If your data is stored by the file like CSV and SDF, you can use the Data Loaders. The Data Loaders read your data, convert them to features (ex: SMILES to ECFP) and save the features to Dataset class. If your data is python objects like Numpy arrays or Pandas DataFrames, you can use the Datasets directly.

Contents

Datasets
Data Loaders
Data Classes
- Graph Data
Base Classes (for develop)
- Dataset
- DataLoader

Datasets ¶

DeepChem dc.data.Dataset objects are one of the core building blocks of DeepChem programs. Dataset objects hold representations of data for machine learning and are widely used throughout DeepChem.

The goal of the Dataset class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset objects to pandas DataFrames, TensorFlow Datasets, and PyTorch datasets.

NumpyDataset ¶

The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset which stores its data in numpy.ndarray objects.

class NumpyDataset(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1)[source]¶

A Dataset defined by in-memory numpy arrays.

This subclass of Dataset stores arrays X,y,w,ids in memory as numpy arrays. This makes it very easy to construct NumpyDataset objects.

Examples

>>> import numpy as np
>>> dataset = NumpyDataset(X=np.random.rand(5, 3), y=np.random.rand(5,), ids=np.arange(5))

__init__(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, n_tasks: int = 1) → None[source]¶

Initialize this object.

Parameters

X (np.ndarray) – Input features. A numpy array of shape (n_samples,…).
y (np.ndarray, optional (default None)) – Labels. A numpy array of shape (n_samples, …). Note that each label can have an arbitrary shape.
w (np.ndarray, optional (default None)) – Weights. Should either be 1D array of shape (n_samples,) or if there’s more than one task, of shape (n_samples, n_tasks).
ids (np.ndarray, optional (default None)) – Identifiers. A numpy array of shape (n_samples,)
n_tasks (int, default 1) – Number of learning tasks.

__len__() → int[source]¶: Get the number of elements in the dataset.

get_shape() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_task_names() → numpy.ndarray[source]¶: Get the names of the tasks associated with this dataset.

property X[source]¶: Get the X vector for this dataset as a single numpy array.

property y[source]¶: Get the y vector for this dataset as a single numpy array.

property ids[source]¶: Get the ids vector for this dataset as a single numpy array.

property w[source]¶: Get the weight vector for this dataset as a single numpy array.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters

batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, default 1) – Number of epochs to walk over dataset.
deterministic (bool, optional (default False)) – If True, follow deterministic order.
pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.

Returns

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over the samples in the dataset.

Returns: Iterator which yields tuples of four numpy arrays (X, y, w, ids).
Return type: Iterator[Batch]

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1

transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]¶

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters: transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
Returns: A newly constructed NumpyDataset object
Return type: NumpyDataset

select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.NumpyDataset[source]¶

Creates a new dataset from a selection of indices from self.

Parameters

indices (List[int]) – List of indices to select.
select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since NumpyDataset is purely in-memory.

Returns

A selected NumpyDataset object

Return type

NumpyDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters

epochs (int, default 1) – The number of times to iterate over the Dataset
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_DiskDataset(ds: deepchem.data.datasets.DiskDataset) → deepchem.data.datasets.NumpyDataset[source]¶

Convert DiskDataset to NumpyDataset.

Parameters: ds (DiskDataset) – DiskDataset to transform to NumpyDataset.
Returns: A new NumpyDataset created from DiskDataset.
Return type: NumpyDataset

static to_json(self, fname: str) → None[source]¶

Dump NumpyDataset to the json file .

Parameters: fname (str) – The name of the json file.

static from_json(fname: str) → deepchem.data.datasets.NumpyDataset[source]¶

Create NumpyDataset from the json file.

Parameters: fname (str) – The name of the json file.
Returns: A new NumpyDataset created from the json file.
Return type: NumpyDataset

static merge(datasets: Sequence[deepchem.data.datasets.Dataset]) → deepchem.data.datasets.NumpyDataset[source]¶

Merge multiple NumpyDatasets.

Parameters: datasets (List[Dataset]) – List of datasets to merge.
Returns: A single NumpyDataset containing all the samples from all datasets.
Return type: NumpyDataset

static from_dataframe(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶

Construct a Dataset from the contents of a pandas DataFrame.

Parameters

df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters

X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns

If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters

batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns

TensorFlow Dataset that iterates over the same data.

Return type

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_dataframe() → pandas.core.frame.DataFrame[source]¶

Construct a pandas DataFrame containing the data from this Dataset.

Returns: Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type: pd.DataFrame

DiskDataset ¶

The dc.data.DiskDataset class allows for the storage of larger datasets on disk. Each DiskDataset is associated with a directory in which it writes its contents to disk. Note that a DiskDataset can be very large, so some of the utility methods to access fields of a Dataset can be prohibitively expensive.

class DiskDataset(data_dir: str)[source]¶

A Dataset that is stored as a set of files on disk.

The DiskDataset is the workhorse class of DeepChem that facilitates analyses on large datasets. Use this class whenever you’re working with a large dataset that can’t be easily manipulated in RAM.

On disk, a DiskDataset has a simple structure. All files for a given DiskDataset are stored in a data_dir. The contents of data_dir should be laid out as follows:

data_dir/

|
—> metadata.csv.gzip
|
—> tasks.json
|
—> shard-0-X.npy
|
—> shard-0-y.npy
|
—> shard-0-w.npy
|
—> shard-0-ids.npy
|
—> shard-1-X.npy
.
.
.

The metadata is constructed by static method DiskDataset._construct_metadata and saved to disk by DiskDataset._save_metadata. The metadata itself consists of a csv file which has columns (‘ids’, ‘X’, ‘y’, ‘w’, ‘ids_shape’, ‘X_shape’, ‘y_shape’, ‘w_shape’). tasks.json consists of a list of task names for this dataset.

The actual data is stored in .npy files (numpy array files) of the form ‘shard-0-X.npy’, ‘shard-0-y.npy’, etc.

The basic structure of DiskDataset is quite robust and will likely serve you well for datasets up to about 100 GB or larger. However note that DiskDataset has not been tested for very large datasets at the terabyte range and beyond. You may be better served by implementing a custom Dataset class for those use cases.

Examples

Let’s walk through a simple example of constructing a new DiskDataset.

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)

If you have already saved a DiskDataset to data_dir, you can reinitialize it with

>> data_dir = “/path/to/my/data” >> dataset = dc.data.DiskDataset(data_dir)

Once you have a dataset you can access its attributes as follows

>>> X = np.random.rand(10, 10)
>>> y = np.random.rand(10,)
>>> w = np.ones_like(y)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> X, y, w = dataset.X, dataset.y, dataset.w

One thing to beware of is that dataset.X, dataset.y, dataset.w are loading data from disk! If you have a large dataset, these operations can be extremely slow. Instead try iterating through the dataset instead.

>>> for (xi, yi, wi, idi) in dataset.itersamples():
...   pass

data_dir[source]¶

Location of directory where this DiskDataset is stored to disk

Type: str

metadata_df[source]¶

Pandas Dataframe holding metadata for this DiskDataset

Type: pd.DataFrame

legacy_metadata[source]¶

Whether this DiskDataset uses legacy format.

Type: bool

Note

DiskDataset originally had a simpler metadata format without shape information. Older DiskDataset objects had metadata files with columns (‘ids’, ‘X’, ‘y’, ‘w’) and not additional shape columns. DiskDataset maintains backwards compatibility with this older metadata format, but we recommend for performance reasons not using legacy metadata for new projects.

__init__(data_dir: str) → None[source]¶

Load a constructed DiskDataset from disk

Note that this method cannot construct a new disk dataset. Instead use static methods DiskDataset.create_dataset or DiskDataset.from_numpy for that purpose. Use this constructor instead to load a DiskDataset that has already been created on disk.

Parameters: data_dir (str) – Location on disk of an existing DiskDataset.

static create_dataset(shard_generator: Iterable[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]], data_dir: Optional[str] = None, tasks: Optional[Sequence] = []) → deepchem.data.datasets.DiskDataset[source]¶

Creates a new DiskDataset

Parameters

shard_generator (Iterable[Batch]) – An iterable (either a list or generator) that provides tuples of data (X, y, w, ids). Each tuple will be written to a separate shard on disk.
data_dir (str, optional (default None)) – Filename for data directory. Creates a temp directory if none specified.
tasks (Sequence, optional (default [])) – List of tasks for this dataset.

Returns

A new DiskDataset constructed from the given data

Return type

DiskDataset

load_metadata() → Tuple[List[str], pandas.core.frame.DataFrame][source]¶: Helper method that loads metadata from disk.

static write_data_to_disk(data_dir: str, basename: str, tasks: numpy.ndarray, X: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → List[Optional[str]][source]¶

Static helper method to write data to disk.

This helper method is used to write a shard of data to disk.

Parameters

data_dir (str) – Data directory to write shard to.
basename (str) – Basename for the shard in question.
tasks (np.ndarray) – The names of the tasks in question.
X (np.ndarray, optional (default None)) – The features array.
y (np.ndarray, optional (default None)) – The labels array.
w (np.ndarray, optional (default None)) – The weights array.
ids (np.ndarray, optional (default None)) – The identifiers array.

Returns

List with values [out_ids, out_X, out_y, out_w, out_ids_shape, out_X_shape, out_y_shape, out_w_shape] with filenames of locations to disk which these respective arrays were written.

Return type

List[Optional[str]]

save_to_disk() → None[source]¶: Save dataset to disk.

move(new_data_dir: str, delete_if_exists: Optional[bool] = True) → None[source]¶

Moves dataset to new directory.

Parameters

new_data_dir (str) – The new directory name to move this to dataset to.
delete_if_exists (bool, optional (default True)) – If this option is set, delete the destination directory if it exists before moving. This is set to True by default to be backwards compatible with behavior in earlier versions of DeepChem.

Note

This is a stateful operation! self.data_dir will be moved into new_data_dir. If delete_if_exists is set to True (by default this is set True), then new_data_dir is deleted if it’s a pre-existing directory.

copy(new_data_dir: str) → deepchem.data.datasets.DiskDataset[source]¶

Copies dataset to new directory.

Parameters: new_data_dir (str) – The new directory name to copy this to dataset to.
Returns: A copied DiskDataset object.
Return type: DiskDataset

Note

This is a stateful operation! Any data at new_data_dir will be deleted and self.data_dir will be deep copied into new_data_dir.

get_task_names() → List[str][source]¶: Gets learning tasks associated with this dataset.

reshard(shard_size: int) → None[source]¶

Reshards data to have specified shard size.

Parameters: shard_size (int) – The size of shard.

Examples

>>> import deepchem as dc
>>> import numpy as np
>>> X = np.random.rand(100, 10)
>>> d = dc.data.DiskDataset.from_numpy(X)
>>> d.reshard(shard_size=10)
>>> d.get_number_shards()
10

Note

If this DiskDataset is in legacy_metadata format, reshard will convert this dataset to have non-legacy metadata.

get_data_shape() → Tuple[int, …][source]¶: Gets array shape of datapoints in this dataset.

get_shard_size() → int[source]¶: Gets size of shards on disk.

get_number_shards() → int[source]¶: Returns the number of shards for this dataset.

itershards() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Return an object that iterates over all shards in dataset.

Datasets are stored in sharded fashion on disk. Each call to next() for the generator defined by this function returns the data from a particular shard. The order of shards returned is guaranteed to remain fixed.

Returns: Generator which yields tuples of four numpy arrays (X, y, w, ids).
Return type: Iterator[Batch]

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over minibatches from the dataset.

It is guaranteed that the number of batches returned is math.ceil(len(dataset)/batch_size). Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters

batch_size (int, optional (default None)) – Number of elements in a batch. If None, then it yields batches with size equal to the size of each individual shard.
epoch (int, default 1) – Number of epochs to walk over dataset
deterministic (bool, default False) – Whether or not we should should shuffle each shard before generating the batches. Note that this is only local in the sense that it does not ever mix between different shards.
pad_batches (bool, default False) – Whether or not we should pad the last batch, globally, such that it has exactly batch_size elements.

Returns

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over the samples in the dataset.

Returns: Generator which yields tuples of four numpy arrays (X, y, w, ids).
Return type: Iterator[Batch]

Examples

>>> dataset = DiskDataset.from_numpy(np.ones((2,2)), np.ones((2,1)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [1.0] [1.0] 0
[1.0, 1.0] [1.0] [1.0] 1

transform(transformer: transformers.Transformer, parallel: bool = False, out_dir: Optional[str] = None, **args) → deepchem.data.datasets.DiskDataset[source]¶

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters

transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
parallel (bool, default False) – If True, use multiple processes to transform the dataset in parallel.
out_dir (str, optional (default None)) – The directory to save the new dataset in. If this is omitted, a temporary directory is created automaticall.

Returns

A newly constructed Dataset object

Return type

DiskDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters

epochs (int, default 1) – The number of times to iterate over the Dataset
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_numpy(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None, tasks: Optional[Sequence] = None, data_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶

Creates a DiskDataset object from specified Numpy arrays.

Parameters

X (np.ndarray) – Feature array.
y (np.ndarray, optional (default None)) – Labels array.
w (np.ndarray, optional (default None)) – Weights array.
ids (np.ndarray, optional (default None)) – Identifiers array.
tasks (Sequence, optional (default None)) – Tasks in this dataset
data_dir (str, optional (default None)) – The directory to write this dataset to. If none is specified, will use a temporary directory instead.

Returns

A new DiskDataset constructed from the provided information.

Return type

DiskDataset

static merge(datasets: Iterable[deepchem.data.datasets.Dataset], merge_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶

Merges provided datasets into a merged dataset.

Parameters

datasets (Iterable[Dataset]) – List of datasets to merge.
merge_dir (str, optional (default None)) – The new directory path to store the merged DiskDataset.

Returns

A merged DiskDataset.

Return type

DiskDataset

subset(shard_nums: Sequence[int], subset_dir: Optional[str] = None) → deepchem.data.datasets.DiskDataset[source]¶

Creates a subset of the original dataset on disk.

Parameters

shard_nums (Sequence[int]) – The indices of shard to extract from the original DiskDataset.
subset_dir (str, optional (default None)) – The new directory path to store the subset DiskDataset.

Returns

A subset DiskDataset.

Return type

DiskDataset

sparse_shuffle() → None[source]¶

Shuffling that exploits data sparsity to shuffle large datasets.

If feature vectors are sparse, say circular fingerprints or any other representation that contains few nonzero values, it can be possible to exploit the sparsity of the vector to simplify shuffles. This method implements a sparse shuffle by compressing sparse feature vectors down into a compressed representation, then shuffles this compressed dataset in memory and writes the results to disk.

Note

This method only works for 1-dimensional feature vectors (does not work for tensorial featurizations). Note that this shuffle is performed in place.

complete_shuffle(data_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]¶

Completely shuffle across all data, across all shards.

Note

The algorithm used for this complete shuffle is O(N^2) where N is the number of shards. It simply constructs each shard of the output dataset one at a time. Since the complete shuffle can take a long time, it’s useful to watch the logging output. Each shuffled shard is constructed using select() which logs as it selects from each original shard. This will results in O(N^2) logging statements, one for each extraction of shuffled shard i’s contributions from original shard j.

Parameters: data_dir (Optional[str], (default None)) – Directory to write the shuffled dataset to. If none is specified a temporary directory will be used.
Returns: A DiskDataset whose data is a randomly shuffled version of this dataset.
Return type: DiskDataset

shuffle_each_shard(shard_basenames: Optional[List[str]] = None) → None[source]¶

Shuffles elements within each shard of the dataset.

Parameters: shard_basenames (List[str], optional (default None)) – The basenames for each shard. If this isn’t specified, will assume the basenames of form “shard-i” used by create_dataset and reshard.

shuffle_shards() → None[source]¶: Shuffles the order of the shards for this dataset.

get_shard(i: int) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶

Retrieves data for the i-th shard from disk.

Parameters: i (int) – Shard index for shard to retrieve batch from.
Returns: A batch data for i-th shard.
Return type: Batch

get_shard_ids(i: int) → numpy.ndarray[source]¶

Retrieves the list of IDs for the i-th shard from disk.

Parameters: i (int) – Shard index for shard to retrieve weights from.
Returns: A numpy array of ids for i-th shard.
Return type: np.ndarray

get_shard_y(i: int) → numpy.ndarray[source]¶

Retrieves the labels for the i-th shard from disk.

Parameters: i (int) – Shard index for shard to retrieve labels from.
Returns: A numpy array of labels for i-th shard.
Return type: np.ndarray

get_shard_w(i: int) → numpy.ndarray[source]¶

Retrieves the weights for the i-th shard from disk.

Parameters: i (int) – Shard index for shard to retrieve weights from.
Returns: A numpy array of weights for i-th shard.
Return type: np.ndarray

add_shard(X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶

Adds a data shard.

Parameters

X (np.ndarray) – Feature array.
y (np.ndarray, optioanl (default None)) – Labels array.
w (np.ndarray, optioanl (default None)) – Weights array.
ids (np.ndarray, optioanl (default None)) – Identifiers array.

set_shard(shard_num: int, X: numpy.ndarray, y: Optional[numpy.ndarray] = None, w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶

Writes data shard to disk.

Parameters

shard_num (int) – Shard index for shard to set new data.
X (np.ndarray) – Feature array.
y (np.ndarray, optioanl (default None)) – Labels array.
w (np.ndarray, optioanl (default None)) – Weights array.
ids (np.ndarray, optioanl (default None)) – Identifiers array.

select(indices: Sequence[int], select_dir: Optional[str] = None, select_shard_size: Optional[int] = None, output_numpy_dataset: Optional[bool] = False) → deepchem.data.datasets.Dataset[source]¶

Creates a new dataset from a selection of indices from self.

Examples

>>> import numpy as np
>>> X = np.random.rand(10, 10)
>>> dataset = dc.data.DiskDataset.from_numpy(X)
>>> selected = dataset.select([1, 3, 4])
>>> len(selected)
3

Parameters

indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.
select_shard_size (Optional[int], (default None)) – If specified, the shard-size to use for output selected DiskDataset. If not output_numpy_dataset, then this is set to this current dataset’s shard size if not manually specified.
output_numpy_dataset (Optional[bool], (default False)) – If True, output an in-memory NumpyDataset instead of a DiskDataset. Note that select_dir and select_shard_size must be None if this is True

Returns

A dataset containing the selected samples. The default dataset is DiskDataset. If output_numpy_dataset is True, the dataset is NumpyDataset.

Return type

Dataset

property ids[source]¶: Get the ids vector for this dataset as a single numpy array.

property X[source]¶: Get the X vector for this dataset as a single numpy array.

property y[source]¶: Get the y vector for this dataset as a single numpy array.

property w[source]¶: Get the weight vector for this dataset as a single numpy array.

property memory_cache_size[source]¶: Get the size of the memory cache for this dataset, measured in bytes.

__len__() → int[source]¶: Finds number of elements in dataset.

get_shape() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶

Finds shape of dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_label_means() → pandas.core.frame.DataFrame[source]¶: Return pandas series of label means.

get_label_stds() → pandas.core.frame.DataFrame[source]¶: Return pandas series of label stds.

static from_dataframe(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶

Construct a Dataset from the contents of a pandas DataFrame.

Parameters

df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters

X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns

If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters

batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns

TensorFlow Dataset that iterates over the same data.

Return type

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_dataframe() → pandas.core.frame.DataFrame[source]¶

Construct a pandas DataFrame containing the data from this Dataset.

Returns: Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type: pd.DataFrame

ImageDataset ¶

The dc.data.ImageDataset class is optimized to allow for convenient processing of image based datasets.

class ImageDataset(X: Union[numpy.ndarray, List[str]], y: Optional[Union[numpy.ndarray, List[str]]], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None)[source]¶

A Dataset that loads data from image files on disk.

__init__(X: Union[numpy.ndarray, List[str]], y: Optional[Union[numpy.ndarray, List[str]]], w: Optional[numpy.ndarray] = None, ids: Optional[numpy.ndarray] = None) → None[source]¶

Create a dataset whose X and/or y array is defined by image files on disk.

Parameters

X (np.ndarray or List[str]) – The dataset’s input data. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
y (np.ndarray or List[str]) – The dataset’s labels. This may be either a single NumPy array directly containing the data, or a list containing the paths to the image files
w (np.ndarray, optional (default None)) – a 1D or 2D array containing the weights for each sample or sample/task pair
ids (np.ndarray, optional (default None)) – the sample IDs

__len__() → int[source]¶: Get the number of elements in the dataset.

get_shape() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

get_task_names() → numpy.ndarray[source]¶: Get the names of the tasks associated with this dataset.

property X[source]¶: Get the X vector for this dataset as a single numpy array.

property y[source]¶: Get the y vector for this dataset as a single numpy array.

property ids[source]¶: Get the ids vector for this dataset as a single numpy array.

property w[source]¶: Get the weight vector for this dataset as a single numpy array.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters

batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, default 1) – Number of epochs to walk over dataset.
deterministic (bool, default False) – If True, follow deterministic order.
pad_batches (bool, default False) – If True, pad each batch to batch_size.

Returns

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over the samples in the dataset.

Returns: Iterator which yields tuples of four numpy arrays (X, y, w, ids).
Return type: Iterator[Batch]

transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.NumpyDataset[source]¶

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows:

>> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters: transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset
Returns: A newly constructed NumpyDataset object
Return type: NumpyDataset

select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.ImageDataset[source]¶

Creates a new dataset from a selection of indices from self.

Parameters

indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Used to provide same API as DiskDataset. Ignored since ImageDataset is purely in-memory.

Returns

A selected ImageDataset object

Return type

ImageDataset

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters

epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type

torch.utils.data.IterableDataset

Note

This method requires PyTorch to be installed.

static from_dataframe(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶

Construct a Dataset from the contents of a pandas DataFrame.

Parameters

df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters

X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns

If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters

batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns

TensorFlow Dataset that iterates over the same data.

Return type

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

to_dataframe() → pandas.core.frame.DataFrame[source]¶

Construct a pandas DataFrame containing the data from this Dataset.

Returns: Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type: pd.DataFrame

Data Loaders ¶

Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader classes. These classes provide utilities for you to load and process large amounts of data.

CSVLoader ¶

class CSVLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Creates Dataset objects from input CSV files.

This class provides conveniences to load data from CSV files. It’s possible to directly featurize data from CSV files using pandas, but this class may prove useful if you’re processing large CSV files that you don’t want to manipulate directly in memory.

Examples

Let’s suppose we have some smiles and labels

>>> smiles = ["C", "CCC"]
>>> labels = [1.5, 2.3]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(smiles, labels)), columns=["smiles", "task1"])

Let’s now write this to disk somewhere. We can now use CSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...   df.to_csv(tmpfile.name)
...   loader = dc.data.CSVLoader(["task1"], feature_field="smiles",
...                              featurizer=dc.feat.CircularFingerprint())
...   dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2

Of course in practice you should already have your data in a CSV file if you’re using CSVLoader. If your data is already in memory, use InMemoryLoader instead.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Initializes CSVLoader.

Parameters

tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
feature_field (str, optional (default None)) – Field with data to be featurized.
id_field (str, optional, (default None)) – CSV column that holds sample identifier
smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A DiskDataset object containing a featurized representation of data from inputs.

Return type

DiskDataset

UserCSVLoader ¶

class UserCSVLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Handles loading of CSV files with user-defined features.

This is a convenience class that allows for descriptors already present in a CSV file to be extracted without any featurization necessary.

Examples

Let’s suppose we have some descriptors and labels. (Imagine that these descriptors have been computed by an external program.)

>>> desc1 = [1, 43]
>>> desc2 = [-2, -22]
>>> labels = [1.5, 2.3]
>>> ids = ["cp1", "cp2"]

Let’s put these in a dataframe.

>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(ids, desc1, desc2, labels)), columns=["id", "desc1", "desc2", "task1"])

Let’s now write this to disk somewhere. We can now use UserCSVLoader to process this CSV dataset.

>>> import tempfile
>>> import deepchem as dc
>>> featurizer = dc.feat.UserDefinedFeaturizer(["desc1", "desc2"])
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...   df.to_csv(tmpfile.name)
...   loader = dc.data.UserCSVLoader(["task1"], id_field="id",
...                              featurizer=featurizer)
...   dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2
>>> dataset.X[0, 0]
1

The difference between UserCSVLoader and CSVLoader is that our descriptors (our features) have already been computed for us, but are spread across multiple columns of the CSV file.

Of course in practice you should already have your data in a CSV file if you’re using UserCSVLoader. If your data is already in memory, use InMemoryLoader instead.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, feature_field: Optional[str] = None, id_field: Optional[str] = None, smiles_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Initializes CSVLoader.

Parameters

tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
feature_field (str, optional (default None)) – Field with data to be featurized.
id_field (str, optional, (default None)) – CSV column that holds sample identifier
smiles_field (str, optional (default None) (DEPRECATED)) – Name of field that holds smiles string.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A DiskDataset object containing a featurized representation of data from inputs.

Return type

DiskDataset

ImageLoader ¶

class ImageLoader(tasks: Optional[List[str]] = None)[source]¶

Handles loading of image files.

This class allows for loading of images in various formats. For user convenience, also accepts zip-files and directories of images and uses some limited intelligence to attempt to traverse subdirectories which contain images.

__init__(tasks: Optional[List[str]] = None)[source]¶

Initialize image loader.

At present, custom image featurizers aren’t supported by this loader class.

Parameters: tasks (List[str], optional (default None)) – List of task names for image labels.

create_dataset(inputs: Union[str, Sequence[str], Tuple[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192, in_memory: bool = False) → deepchem.data.datasets.Dataset[source]¶

Creates and returns a Dataset object by featurizing provided image files and labels/weights.

Parameters

inputs (Union[OneOrMany[str], Tuple[Any]]) –
The inputs provided should be one of the following
- filename
- list of filenames
- Tuple (list of filenames, labels)
- Tuple (list of filenames, labels, weights)
Each file in a given list of filenames should either be of a supported image format (.png, .tif only for now) or of a compressed folder of image files (only .zip for now). If labels or weights are provided, they must correspond to the sorted order of all filenames provided, with one label/weight per file.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Shard size when loading data.
in_memory (bool, optioanl (default False)) – If true, return in-memory NumpyDataset. Else return ImageDataset.

Returns

if in_memory == False, the return value is ImageDataset.
if in_memory == True and data_dir is None, the return value is NumpyDataset.
if in_memory == True and data_dir is not None, the return value is DiskDataset.

Return type

ImageDataset or NumpyDataset or DiskDataset

JsonLoader ¶

JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.

class JsonLoader(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Creates Dataset objects from input json files.

This class provides conveniences to load data from json files. It’s possible to directly featurize data from json files using pandas, but this class may prove useful if you’re processing large json files that you don’t want to manipulate directly in memory.

It is meant to load JSON files formatted as “records” in line delimited format, which allows for sharding. list like [{column -> value}, ... , {column -> value}].

Examples

Let’s create the sample dataframe.

>>> composition = ["LiCoO2", "MnO2"]
>>> labels = [1.5, 2.3]
>>> import pandas as pd
>>> df = pd.DataFrame(list(zip(composition, labels)), columns=["composition", "task"])

Dump the dataframe to the JSON file formatted as “records” in line delimited format and load the json file by JsonLoader.

>>> import tempfile
>>> import deepchem as dc
>>> with dc.utils.UniversalNamedTemporaryFile(mode='w') as tmpfile:
...   df.to_json(tmpfile.name, orient='records', lines=True)
...   featurizer = dc.feat.ElementPropertyFingerprint()
...   loader = dc.data.JsonLoader(["task"], feature_field="composition", featurizer=featurizer)
...   dataset = loader.create_dataset(tmpfile.name)
>>> len(dataset)
2

__init__(tasks: List[str], feature_field: str, featurizer: deepchem.feat.base_classes.Featurizer, label_field: Optional[str] = None, weight_field: Optional[str] = None, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Initializes JsonLoader.

Parameters

tasks (List[str]) – List of task names
feature_field (str) – JSON field with data to be featurized.
featurizer (Featurizer) – Featurizer to use to process data
label_field (str, optional (default None)) – Field with target variables.
weight_field (str, optional (default None)) – Field with weights.
id_field (str, optional (default None)) – Field for identifying samples.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]¶

Creates a Dataset from input JSON files.

Parameters

input_files (OneOrMany[str]) – List of JSON filenames.
data_dir (Optional[str], default None) – Name of directory where featurized data is stored.
shard_size (int, optional (default 8192)) – Shard size when loading data.

Returns

A DiskDataset object containing a featurized representation of data from input_files.

Return type

DiskDataset

SDFLoader ¶

class SDFLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]¶

Creates a Dataset object from SDF input files.

This class provides conveniences to load and featurize data from Structure Data Files (SDFs). SDF is a standard format for structural information (3D coordinates of atoms and bonds) of molecular compounds.

Examples

>>> import deepchem as dc
>>> import os
>>> current_dir = os.path.dirname(os.path.realpath(__file__))
>>> featurizer = dc.feat.CircularFingerprint(size=16)
>>> loader = dc.data.SDFLoader(["LogP(RRCK)"], featurizer=featurizer, sanitize=True)
>>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) 
>>> len(dataset)
2

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, sanitize: bool = False, log_every_n: int = 1000)[source]¶

Initialize SDF Loader

Parameters

tasks (list[str]) – List of tasknames. These will be loaded from the SDF file.
featurizer (Featurizer) – Featurizer to use to process data
sanitize (bool, optional (default False)) – Whether to sanitize molecules.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A DiskDataset object containing a featurized representation of data from inputs.

Return type

DiskDataset

FASTALoader ¶

class FASTALoader[source]¶

Handles loading of FASTA files.

FASTA files are commonly used to hold sequence data. This class provides convenience files to lead FASTA data and one-hot encode the genomic sequences for use in downstream learning tasks.

__init__()[source]¶: Initialize loader.

create_dataset(input_files: Union[str, Sequence[str]], data_dir: Optional[str] = None, shard_size: Optional[int] = None) → deepchem.data.datasets.DiskDataset[source]¶

Creates a Dataset from input FASTA files.

At present, FASTA support is limited and only allows for one-hot featurization, and doesn’t allow for sharding.

Parameters

input_files (List[str]) – List of fasta files.
data_dir (str, optional (default None)) – Name of directory where featurized data is stored.
shard_size (int, optional (default None)) – For now, this argument is ignored and each FASTA file gets its own shard.

Returns

A DiskDataset object containing a featurized representation of data from input_files.

Return type

DiskDataset

InMemoryLoader ¶

The dc.data.InMemoryLoader is designed to facilitate the processing of large datasets where you already hold the raw data in-memory (say in a pandas dataframe).

class InMemoryLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Facilitate Featurization of In-memory objects.

When featurizing a dataset, it’s often the case that the initial set of data (pre-featurization) fits handily within memory. (For example, perhaps it fits within a column of a pandas DataFrame.) In this case, it would be convenient to directly be able to featurize this column of data. However, the process of featurization often generates large arrays which quickly eat up available memory. This class provides convenient capabilities to process such in-memory data by checkpointing generated features periodically to disk.

Example

Here’s an example with only datapoints and no labels or weights.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(smiles, shard_size=2)
>>> len(dataset)
4

Here’s an example with both datapoints and labels

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels), shard_size=2)
>>> len(dataset)
4

Here’s an example with datapoints, labels, weights and ids all provided.

>>> import deepchem as dc
>>> smiles = ["C", "CC", "CCC", "CCCC"]
>>> labels = [1, 0, 1, 0]
>>> weights = [1.5, 0, 1.5, 0]
>>> ids = ["C", "CC", "CCC", "CCCC"]
>>> featurizer = dc.feat.CircularFingerprint()
>>> loader = dc.data.InMemoryLoader(tasks=["task1"], featurizer=featurizer)
>>> dataset = loader.create_dataset(zip(smiles, labels, weights, ids), shard_size=2)
>>> len(dataset)
4

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters

tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

create_dataset(inputs: Sequence[Any], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.DiskDataset[source]¶

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these input files. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (Sequence[Any]) – List of inputs to process. Entries can be arbitrary objects so long as they are understood by self.featurizer
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A DiskDataset object containing a featurized representation of data from inputs.

Return type

DiskDataset

Data Classes ¶

DeepChem featurizers often transform members into “data classes”. These are classes that hold all the information needed to train a model on that data point. Models then transform these into the tensors for training in their default_generator methods.

Graph Data ¶

These classes document the data classes for graph convolutions. We plan to simplify these classes (ConvMol, MultiConvMol, WeaveMol) into a joint data representation (GraphData) for all graph convolutions in a future version of DeepChem, so these APIs may not remain stable.

The graph convolution models which inherit KerasModel depend on ConvMol, MultiConvMol, or WeaveMol. On the other hand, the graph convolution models which inherit TorchModel depend on GraphData.

class ConvMol(atom_features, adj_list, max_deg=10, min_deg=0)[source]¶

Holds information about a molecules.

Resorts order of atoms internally to be in order of increasing degree. Note that only heavy atoms (hydrogens excluded) are considered here.

__init__(atom_features, adj_list, max_deg=10, min_deg=0)[source]¶

Parameters

atom_features (np.ndarray) – Has shape (n_atoms, n_feat)
adj_list (list) – List of length n_atoms, with neighor indices of each atom.
max_deg (int, optional) – Maximum degree of any atom.
min_deg (int, optional) – Minimum degree of any atom.

get_atoms_with_deg(deg)[source]¶: Retrieves atom_features with the specific degree

get_num_atoms_with_deg(deg)[source]¶: Returns the number of atoms with the given degree

get_atom_features()[source]¶

Returns canonicalized version of atom features.

Features are sorted by atom degree, with original order maintained when degrees are same.

get_adjacency_list()[source]¶

Returns a canonicalized adjacency list.

Canonicalized means that the atoms are re-ordered by degree.

Returns: Canonicalized form of adjacency list.
Return type: list

get_deg_adjacency_lists()[source]¶

Returns adjacency lists grouped by atom degree.

Returns: Has length (max_deg+1-min_deg). The element at position deg is itself a list of the neighbor-lists for atoms with degree deg.
Return type: list

get_deg_slice()[source]¶

Returns degree-slice tensor.

The deg_slice tensor allows indexing into a flattened version of the molecule’s atoms. Assume atoms are sorted in order of degree. Then deg_slice[deg][0] is the starting position for atoms of degree deg in flattened list, and deg_slice[deg][1] is the number of atoms with degree deg.

Note deg_slice has shape (max_deg+1-min_deg, 2).

Returns: deg_slice – Shape (max_deg+1-min_deg, 2)
Return type: np.ndarray

static get_null_mol(n_feat, max_deg=10, min_deg=0)[source]¶

Constructs a null molecules

Get one molecule with one atom of each degree, with all the atoms connected to themselves, and containing n_feat features.

Parameters: n_feat (int) – number of features for the nodes in the null molecule

static agglomerate_mols(mols, max_deg=10, min_deg=0)[source]¶

Concatenates list of ConvMol’s into one mol object that can be used to feed into tensorflow placeholders. The indexing of the molecules are preseved during the combination, but the indexing of the atoms are greatly changed.

Parameters: mols (list) – ConvMol objects to be combined into one molecule.

class MultiConvMol(nodes, deg_adj_lists, deg_slice, membership, num_mols)[source]¶

Holds information about multiple molecules, for use in feeding information into tensorflow. Generated using the agglomerate_mols function

__init__(nodes, deg_adj_lists, deg_slice, membership, num_mols)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_deg_adjacency_lists()[source]¶

get_atom_features()[source]¶

get_num_atoms()[source]¶

get_num_molecules()[source]¶

__module__ = 'deepchem.feat.mol_graphs'[source]¶

class WeaveMol(nodes, pairs, pair_edges)[source]¶

Molecular featurization object for weave convolutions.

These objects are produced by WeaveFeaturizer, and feed into WeaveModel. The underlying implementation is inspired by 1.

References

1: Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.

__init__(nodes, pairs, pair_edges)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_pair_edges()[source]¶

get_pair_features()[source]¶

get_atom_features()[source]¶

get_num_atoms()[source]¶

get_num_features()[source]¶

__module__ = 'deepchem.feat.mol_graphs'[source]¶

class GraphData(node_features: numpy.ndarray, edge_index: numpy.ndarray, edge_features: Optional[numpy.ndarray] = None, node_pos_features: Optional[numpy.ndarray] = None)[source]¶

GraphData class

This data class is almost same as torch_geometric.data.Data.

node_features[source]¶

Node feature matrix with shape [num_nodes, num_node_features]

Type: np.ndarray

edge_index[source]¶

Graph connectivity in COO format with shape [2, num_edges]

Type: np.ndarray, dtype int

edge_features[source]¶

Edge feature matrix with shape [num_edges, num_edge_features]

Type: np.ndarray, optional (default None)

node_pos_features[source]¶

Node position matrix with shape [num_nodes, num_dimensions].

Type: np.ndarray, optional (default None)

num_nodes[source]¶

The number of nodes in the graph

Type: int

num_node_features[source]¶

The number of features per node in the graph

Type: int

num_edges[source]¶

The number of edges in the graph

Type: int

num_edges_features[source]¶

The number of features per edge in the graph

Type: int, optional (default None)

Examples

>>> import numpy as np
>>> node_features = np.random.rand(5, 10)
>>> edge_index = np.array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 0]], dtype=np.int64)
>>> graph = GraphData(node_features=node_features, edge_index=edge_index)

__init__(node_features: numpy.ndarray, edge_index: numpy.ndarray, edge_features: Optional[numpy.ndarray] = None, node_pos_features: Optional[numpy.ndarray] = None)[source]¶

Parameters

node_features (np.ndarray) – Node feature matrix with shape [num_nodes, num_node_features]
edge_index (np.ndarray, dtype int) – Graph connectivity in COO format with shape [2, num_edges]
edge_features (np.ndarray, optional (default None)) – Edge feature matrix with shape [num_edges, num_edge_features]
node_pos_features (np.ndarray, optional (default None)) – Node position matrix with shape [num_nodes, num_dimensions].

to_pyg_graph()[source]¶

Convert to PyTorch Geometric graph data instance

Returns: Graph data for PyTorch Geometric
Return type: torch_geometric.data.Data

Note

This method requires PyTorch Geometric to be installed.

to_dgl_graph(self_loop: bool = False)[source]¶

Convert to DGL graph data instance

Returns

dgl.DGLGraph – Graph data for DGL
self_loop (bool) – Whether to add self loops for the nodes, i.e. edges from nodes to themselves. Default to False.

Note

This method requires DGL to be installed.

Base Classes (for develop)¶

Dataset ¶

The dc.data.Dataset class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.

class Dataset[source]¶

Abstract base class for datasets defined by X, y, w elements.

Dataset objects are used to store representations of a dataset as used in a machine learning task. Datasets contain features X, labels y, weights w and identifiers ids. Different subclasses of Dataset may choose to hold X, y, w, ids in memory or on disk.

The Dataset class attempts to provide for strong interoperability with other machine learning representations for datasets. Interconversion methods allow for Dataset objects to be converted to and from numpy arrays, pandas dataframes, tensorflow datasets, and pytorch datasets (only to and not from for pytorch at present).

Note that you can never instantiate a Dataset object directly. Instead you will need to instantiate one of the concrete subclasses.

__init__() → None[source]¶: Initialize self. See help(type(self)) for accurate signature.

__len__() → int[source]¶

Get the number of elements in the dataset.

Returns: The number of elements in the dataset.
Return type: int

get_shape() → Tuple[Tuple[int, …], Tuple[int, …], Tuple[int, …], Tuple[int, …]][source]¶

Get the shape of the dataset.

Returns four tuples, giving the shape of the X, y, w, and ids arrays.

Returns: The tuple contains four elements, which are the shapes of the X, y, w, and ids arrays.
Return type: Tuple

get_task_names() → numpy.ndarray[source]¶: Get the names of the tasks associated with this dataset.

property X[source]¶

Get the X vector for this dataset as a single numpy array.

Returns: A numpy array of identifiers X.
Return type: np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property y[source]¶

Get the y vector for this dataset as a single numpy array.

Returns: A numpy array of identifiers y.
Return type: np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property ids[source]¶

Get the ids vector for this dataset as a single numpy array.

Returns: A numpy array of identifiers ids.
Return type: np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

property w[source]¶

Get the weight vector for this dataset as a single numpy array.

Returns: A numpy array of weights w.
Return type: np.ndarray

Note

If data is stored on disk, accessing this field may involve loading data from disk and could potentially be slow. Using iterbatches() or itersamples() may be more efficient for larger datasets.

iterbatches(batch_size: Optional[int] = None, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False) → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over minibatches from the dataset.

Each minibatch is returned as a tuple of four numpy arrays: (X, y, w, ids).

Parameters

batch_size (int, optional (default None)) – Number of elements in each batch.
epochs (int, optional (default 1)) – Number of epochs to walk over dataset.
deterministic (bool, optional (default False)) – If True, follow deterministic order.
pad_batches (bool, optional (default False)) – If True, pad each batch to batch_size.

Returns

Generator which yields tuples of four numpy arrays (X, y, w, ids).

Return type

Iterator[Batch]

itersamples() → Iterator[Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]][source]¶

Get an object that iterates over the samples in the dataset.

Examples

>>> dataset = NumpyDataset(np.ones((2,2)))
>>> for x, y, w, id in dataset.itersamples():
...   print(x.tolist(), y.tolist(), w.tolist(), id)
[1.0, 1.0] [0.0] [0.0] 0
[1.0, 1.0] [0.0] [0.0] 1

transform(transformer: transformers.Transformer, **args) → deepchem.data.datasets.Dataset[source]¶

Construct a new dataset by applying a transformation to every sample in this dataset.

The argument is a function that can be called as follows: >> newx, newy, neww = fn(x, y, w)

It might be called only once with the whole dataset, or multiple times with different subsets of the data. Each time it is called, it should transform the samples and return the transformed data.

Parameters: transformer (dc.trans.Transformer) – The transformation to apply to each sample in the dataset.
Returns: A newly constructed Dataset object.
Return type: Dataset

select(indices: Sequence[int], select_dir: Optional[str] = None) → deepchem.data.datasets.Dataset[source]¶

Creates a new dataset from a selection of indices from self.

Parameters

indices (Sequence) – List of indices to select.
select_dir (str, optional (default None)) – Path to new directory that the selected indices will be copied to.

get_statistics(X_stats: bool = True, y_stats: bool = True) → Tuple[float, …][source]¶

Compute and return statistics of this dataset.

Uses self.itersamples() to compute means and standard deviations of the dataset. Can compute on large datasets that don’t fit in memory.

Parameters

X_stats (bool, optional (default True)) – If True, compute feature-level mean and standard deviations.
y_stats (bool, optional (default True)) – If True, compute label-level mean and standard deviations.

Returns

If X_stats == True, returns (X_means, X_stds).
If y_stats == True, returns (y_means, y_stds).
If both are true, returns (X_means, X_stds, y_means, y_stds).

Return type

Tuple

make_tf_dataset(batch_size: int = 100, epochs: int = 1, deterministic: bool = False, pad_batches: bool = False)[source]¶

Create a tf.data.Dataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w) for one batch.

Parameters

batch_size (int, default 100) – The number of samples to include in each batch.
epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
pad_batches (bool, default False) – If True, batches are padded as necessary to make the size of each batch exactly equal batch_size.

Returns

TensorFlow Dataset that iterates over the same data.

Return type

tf.data.Dataset

Note

This class requires TensorFlow to be installed.

make_pytorch_dataset(epochs: int = 1, deterministic: bool = False, batch_size: Optional[int] = None)[source]¶

Create a torch.utils.data.IterableDataset that iterates over the data in this Dataset.

Each value returned by the Dataset’s iterator is a tuple of (X, y, w, id) containing the data for one batch, or for a single sample if batch_size is None.

Parameters

epochs (int, default 1) – The number of times to iterate over the Dataset.
deterministic (bool, default False) – If True, the data is produced in order. If False, a different random permutation of the data is used for each epoch.
batch_size (int, optional (default None)) – The number of samples to return in each batch. If None, each returned value is a single sample.

Returns

torch.utils.data.IterableDataset that iterates over the data in this dataset.

Return type

torch.utils.data.IterableDataset

Note

This class requires PyTorch to be installed.

to_dataframe() → pandas.core.frame.DataFrame[source]¶

Construct a pandas DataFrame containing the data from this Dataset.

Returns: Pandas dataframe. If there is only a single feature per datapoint, will have column “X” else will have columns “X1,X2,…” for features. If there is only a single label per datapoint, will have column “y” else will have columns “y1,y2,…” for labels. If there is only a single weight per datapoint will have column “w” else will have columns “w1,w2,…”. Will have column “ids” for identifiers.
Return type: pd.DataFrame

static from_dataframe(df: pandas.core.frame.DataFrame, X: Optional[Union[str, Sequence[str]]] = None, y: Optional[Union[str, Sequence[str]]] = None, w: Optional[Union[str, Sequence[str]]] = None, ids: Optional[str] = None)[source]¶

Construct a Dataset from the contents of a pandas DataFrame.

Parameters

df (pd.DataFrame) – The pandas DataFrame
X (str or List[str], optional (default None)) – The name of the column or columns containing the X array. If this is None, it will look for default column names that match those produced by to_dataframe().
y (str or List[str], optional (default None)) – The name of the column or columns containing the y array. If this is None, it will look for default column names that match those produced by to_dataframe().
w (str or List[str], optional (default None)) – The name of the column or columns containing the w array. If this is None, it will look for default column names that match those produced by to_dataframe().
ids (str, optional (default None)) – The name of the column containing the ids. If this is None, it will look for default column names that match those produced by to_dataframe().

DataLoader ¶

The dc.data.DataLoader class is the abstract parent class for all dataloaders. This class should never be directly initialized, but contains a number of useful method implementations.

class DataLoader(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Handles loading/featurizing of data from disk.

The main use of DataLoader and its child classes is to make it easier to load large datasets into Dataset objects.`

DataLoader is an abstract superclass that provides a general framework for loading data into DeepChem. This class should never be instantiated directly. To load your own type of data, make a subclass of DataLoader and provide your own implementation for the create_dataset() method.

To construct a Dataset from input data, first instantiate a concrete data loader (that is, an object which is an instance of a subclass of DataLoader) with a given Featurizer object. Then call the data loader’s create_dataset() method on a list of input files that hold the source data to process. Note that each subclass of DataLoader is specialized to handle one type of input data so you will have to pick the loader class suitable for your input data type.

Note that it isn’t necessary to use a data loader to process input data. You can directly use Featurizer objects to featurize provided input into numpy arrays, but note that this calculation will be performed in memory, so you will have to write generators that walk the source files and write featurized data to disk yourself. DataLoader and its subclasses make this process easier for you by performing this work under the hood.

__init__(tasks: List[str], featurizer: deepchem.feat.base_classes.Featurizer, id_field: Optional[str] = None, log_every_n: int = 1000)[source]¶

Construct a DataLoader object.

This constructor is provided as a template mainly. You shouldn’t ever call this constructor directly as a user.

Parameters

tasks (List[str]) – List of task names
featurizer (Featurizer) – Featurizer to use to process data.
id_field (str, optional (default None)) – Name of field that holds sample identifier. Note that the meaning of “field” depends on the input data type and can have a different meaning in different subclasses. For example, a CSV file could have a field as a column, and an SDF file could have a field as molecular property.
log_every_n (int, optional (default 1000)) – Writes a logging statement this often.

featurize(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶

Featurize provided files and write to specified location.

DEPRECATED: This method is now a wrapper for create_dataset() and calls that method under the hood.

For large datasets, automatically shards into smaller chunks for convenience. This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, default None) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A Dataset object containing a featurized representation of data from inputs.

Return type

Dataset

create_dataset(inputs: Union[Any, Sequence[Any]], data_dir: Optional[str] = None, shard_size: Optional[int] = 8192) → deepchem.data.datasets.Dataset[source]¶

Creates and returns a Dataset object by featurizing provided files.

Reads in inputs and uses self.featurizer to featurize the data in these inputs. For large files, automatically shards into smaller chunks of shard_size datapoints for convenience. Returns a Dataset object that contains the featurized dataset.

This implementation assumes that the helper methods _get_shards and _featurize_shard are implemented and that each shard returned by _get_shards is a pandas dataframe. You may choose to reuse or override this method in your subclass implementations.

Parameters

inputs (List) – List of inputs to process. Entries can be filenames or arbitrary objects.
data_dir (str, optional (default None)) – Directory to store featurized dataset.
shard_size (int, optional (default 8192)) – Number of examples stored in each shard.

Returns

A DiskDataset object containing a featurized representation of data from inputs.

Return type

DiskDataset