The DeepChem library is packaged alongside the MoleculeNet suite of datasets.
One of the most important parts of machine learning applications is finding a suitable dataset.
The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem
dc.data.Dataset objects for convenience.
When training a model or performing a benchmark, the user needs specific datasets.
However, at the beginning, this search can be exhaustive and confusing. The
following cheatsheet is aimed at helping DeepChem users identify more easily which
dataset to use depending on their purposes.
Each row reprents a dataset where a brief description is given. Also, the columns
represents the type of the data; depending on molecule properties, images or
materials and how many data points they have. Each dataset is referenced with a
link of the paper. Finally, there are some entries that need further information.
If you are proposing a new dataset to be included in the
MoleculeNet benchmarking suite, please follow the instructions below.
Please review the datasets already available in MolNet before contributing.
Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.
Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section.
Note that the “w” matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.
Additionally, the environment variable DEEPCHEM_DATA_DIR can be set like os.environ['DEEPCHEM_DATA_DIR']=path/to/store/featurized/dataset. When the DEEPCHEM_DATA_DIR environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.
BACE dataset with classification labels (“class”). The BACE dataset
contains 1513 compounds and the dataset is a binary classification
dataset with labels 0 or 1.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The BACE dataset provides quantitative IC50 and qualitative (binary label)
binding results for a set of inhibitors of human beta-secretase 1 (BACE-1).
All data are experimental values reported in scientific literature over the
past decade, some with detailed crystal structures available. A collection
of 1522 compounds is provided, along with the regression labels of IC50. The
number of tasks in the dataset is one.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol” - SMILES representation of the molecular structure
“pIC50” - Negative log of the IC50 binding affinity
“class” - Binary labels for inhibitor
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
This dataset contains 6 images of human HT29 colon cancer cells. The task is
to learn to predict the cell counts in these images. This dataset is too small
to serve to train algorithms, but might serve as a good test dataset.
https://data.broadinstitute.org/bbbc/BBBC001/
Parameters:
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
This dataset contains data corresponding to 5 samples of Drosophilia Kc167
cells. There are 10 fields of view for each sample, each an image of size
512x512. Ground truth labels contain cell counts for this dataset. Full
details about this dataset are present at
https://data.broadinstitute.org/bbbc/BBBC002/.
Parameters:
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
This dataset contains data corresponding to 15 samples of Mouse embryos with DIC.
Each image is of size 640x480. Ground truth labels contain cell counts and
segmentation masks for this dataset. Full details about this dataset are present at
https://data.broadinstitute.org/bbbc/BBBC003/.
Parameters:
load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise,
the dataset will contain cell counts as labels.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Examples
Importing necessary modules
>>> importdeepchemasdc>>> importnumpyasnp
We can load the BBBC003 dataset with 2 types of labels: segmentation masks and
cell counts. We will first load the dataset with cell counts as labels.
We now have a dataset with 15 samples, each with 300 cells. The images are of
size 640x480. The labels are segmentation masks. We can verify this as follows:
Note: The image labelled ‘7_19_M2E15.tif’ is transposed to 480x640 in the source file along with it’s
segementation mask. To match it with the other images, we need to transpose it back to 640x480.
This image is found at index 6 in the train dataset (Assuming no shuffling has taken place).
First, we load the dataset as usual and split it into X, y, w and ids. Here, X is the list
of input images, y is the list of labels, w is the list of weights and ids is the list of
IDs for each sample.
We can now transpose the image at index 6 in the input data (train_x):
>>> train_x[6] = train_x[6].T
We can now verify that the image is of size 640x480:
>>> print(train_x[6].shape)
(640, 480)
This is also seen in the segmentation mask with the same filename and index, in which
case, we transpose the label (train_y) instead of the input data:
>>> train_y[6]=train_y[6].T
We can now verify that the image is of size 640x480:
>>> train_y[6].shape
(640, 480)
This dataset contains data corresponding to 20 samples of synthetically generated
fluorescent cell population images. There are 300 cells in each sample, each an image
of size 950x950. Ground truth labels contain cell counts and segmentation masks for
this dataset. Full details about this dataset are present at
https://data.broadinstitute.org/bbbc/BBBC004/.
Parameters:
overlap_probability (float from list {0.0, 0.15, 0.3, 0.45, 0.6}) – the overlap probability of the synthetic cells in the images
load_segmentation_mask (bool) – if True, the dataset will contain segmentation masks as labels. Otherwise,
the dataset will contain cell counts as labels.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Examples
Importing necessary modules
>>> importdeepchemasdc>>> importnumpyasnp
We can load the BBBC004 dataset with 2 types of labels: segmentation masks and
cell counts. We will first load the dataset with cell counts as labels.
We now have a dataset with 20 samples, each with 300 cells. The images are of
size 950x950. The labels are segmentation masks. We can verify this as follows:
This dataset contains data corresponding to 19,200 samples of synthetically generated
fluorescent cell population images. These images were simulated for a given cell count
with a clustering probablity of 25% and a CCD noise variance of 0.0001. Focus blur
was simulated by applying varying Guassian filters to the images. Each image is of
size 520x696. Ground truth labels contain cell counts for this dataset. Full details
about this dataset are present at
https://data.broadinstitute.org/bbbc/BBBC005/.
Parameters:
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Examples
Importing necessary modules
>> import deepchem as dc
>> import numpy as np
We will now load the BBBC005 dataset with cell counts as labels.
We now have a dataset with a total of 19,200 samples with cell counts in
the range of 1-100. The images are of size 520x696. The labels are cell
counts. We have a train-val-test split of 80:10:10. We can verify this as follows:
The blood-brain barrier penetration (BBBP) dataset is designed for the
modeling and prediction of barrier permeability. As a membrane separating
circulating blood and brain extracellular fluid, the blood-brain barrier
blocks most drugs, hormones and neurotransmitters. Thus penetration of the
barrier forms a long-standing issue in development of drugs targeting
central nervous system.
This dataset includes binary labels for over 2000 compounds on their
permeability properties.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“name” - Name of the compound
“smiles” - SMILES representation of the molecular structure
“p_np” - Binary labels for penetration/non-penetration
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
This dataset is based on release 22.1 of the data from https://www.ebi.ac.uk/chembl/.
Two subsets of the data are available, depending on the “set” argument. “sparse”
is a large dataset with 244,245 compounds. As the name suggests, the data is
extremely sparse, with most compounds having activity data for only one target.
“5thresh” is a much smaller set (23,871 compounds) that includes only compounds
with activity data for at least five targets.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
set (str) – the subset to load, either “sparse” or “5thresh”
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Loads the ChEMBL25 dataset, featurizes it, and does a split.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The ClinTox dataset compares drugs approved by the FDA and
drugs that have failed clinical trials for toxicity reasons.
The dataset includes two classification tasks for 1491 drug
compounds with known chemical structures:
clinical trial toxicity (or absence of toxicity)
FDA approval status.
List of FDA-approved drugs are compiled from the SWEETLEAD
database, and list of drugs that failed clinical trials for
toxicity reasons are compiled from the Aggregate Analysis of
ClinicalTrials.gov(AACT) database.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
“FDA_APPROVED” - FDA approval status
“CT_TOX” - Clinical trial results
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The Delaney (ESOL) dataset a regression dataset containing structures and
water solubility data for 1128 compounds. The dataset is widely used to
validate machine learning models on estimating solubility directly from
molecular structures (as encoded in SMILES strings).
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“Compound ID” - Name of the compound
“smiles” - SMILES representation of the molecular structure
“measured log solubility in mols per litre” - Log-scale water solubility
of the compound, used as label
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Loads FACTOR dataset; does not do train/test split
The Factors dataset is an in-house dataset from Merck that was first introduced in the following paper:
Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
It contains 1500 Merck in-house compounds that were measured
for IC50 of inhibition on 12 serine proteases. Unlike most of
the other datasets featured in MoleculeNet, the Factors
collection does not have structures for the compounds tested
since they were proprietary Merck compounds. However, the
collection does feature pre-computed descriptors for these
compounds.
Note that the original train/valid/test split from the source
data was preserved here, so this function doesn’t allow for
alternate modes of splitting. Similarly, since the source data
came pre-featurized, it is not possible to apply alternative
featurizations.
Parameters:
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
The FreeSolv dataset is a collection of experimental and calculated hydration
free energies for small molecules in water, along with their experiemental values.
Here, we are using a modified version of the dataset with the molecule smile string
and the corresponding experimental hydration free energies.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol” - SMILES representation of the molecular structure
“y” - Experimental hydration free energy
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The HIV dataset was introduced by the Drug Therapeutics
Program (DTP) AIDS Antiviral Screen, which tested the ability
to inhibit HIV replication for over 40,000 compounds.
Screening results were evaluated and placed into three
categories: confirmed inactive (CI),confirmed active (CA) and
confirmed moderately active (CM). We further combine the
latter two labels, making it a classification task between
inactive (CI) and active (CA and CM).
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
“activity”: Three-class labels for screening results: CI/CM/CA
“HIV_active”: Binary labels for screening results: 1 (CA/CM) and 0 (CI)
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The HOPV datasets consist of the “Harvard Organic
Photovoltaic Dataset. This dataset includes 350 small
molecules and polymers that were utilized as p-type materials
in OPVs. Experimental properties include: HOMO [a.u.], LUMO
[a.u.], Electrochemical gap [a.u.], Optical gap [a.u.], Power
conversion efficiency [%], Open circuit potential [V], Short
circuit current density [mA/cm^2], and fill factor [%].
Theoretical calculations in the original dataset have been
removed (for now).
Lopez, Steven A., et al. “The Harvard organic photovoltaic dataset.” Scientific data 3.1 (2016): 1-7.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Loads kaggle datasets. Generates if not stored already.
The Kaggle dataset is an in-house dataset from Merck that was first introduced in the following paper:
Ma, Junshui, et al. “Deep neural nets as a method for quantitative structure–activity relationships.” Journal of chemical information and modeling 55.2 (2015): 263-274.
It contains 100,000 unique Merck in-house compounds that were
measured on 15 enzyme inhibition and ADME/TOX datasets.
Unlike most of the other datasets featured in MoleculeNet,
the Kaggle collection does not have structures for the
compounds tested since they were proprietary Merck compounds.
However, the collection does feature pre-computed descriptors
for these compounds.
Note that the original train/valid/test split from the source
data was preserved here, so this function doesn’t allow for
alternate modes of splitting. Similarly, since the source data
came pre-featurized, it is not possible to apply alternative
featurizations.
Parameters:
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
Loads Kinase datasets, does not do train/test split
The Kinase dataset is an in-house dataset from Merck that was first introduced in the following paper:
Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
It contains 2500 Merck in-house compounds that were measured
for IC50 of inhibition on 99 protein kinases. Unlike most of
the other datasets featured in MoleculeNet, the Kinase
collection does not have structures for the compounds tested
since they were proprietary Merck compounds. However, the
collection does feature pre-computed descriptors for these
compounds.
Note that the original train/valid/test split from the source
data was preserved here, so this function doesn’t allow for
alternate modes of splitting. Similarly, since the source data
came pre-featurized, it is not possible to apply alternative
featurizations.
Parameters:
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
Lipophilicity is an important feature of drug molecules that affects both
membrane permeability and solubility. The lipophilicity dataset, curated
from ChEMBL database, provides experimental results of octanol/water
distribution coefficient (logD at pH 7.4) of 4200 compounds.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
“exp” - Measured octanol/water distribution coefficient (logD) of the
compound, used as label
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Materials datasets include inorganic crystal structures, chemical
compositions, and target properties like formation energies and band
gaps. Machine learning problems in materials science commonly include
predicting the value of a continuous (regression) or categorical
(classification) property of a material based on its chemical composition
or crystal structure. “Inverse design” is also of great interest, in which
ML methods generate crystal structures that have a desired property.
Other areas where ML is applicable in materials include: discovering new
or modified phenomenological models that describe material behavior
Contains 4604 experimentally measured band gaps for inorganic
crystal structure compositions. In benchmark studies, random forest
models achieved a mean average error of 0.45 eV during five-fold
nested cross validation on this dataset.
For more details on the dataset see [1]_. For more details
on previous benchmarks for this dataset, see [2]_.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
Contains 18928 perovskite structures and their formation energies.
In benchmark studies, random forest models and crystal graph
neural networks achieved mean average error of 0.23 and 0.05 eV/atom,
respectively, during five-fold nested cross validation on this
dataset.
For more details on the dataset see [1]_. For more details
on previous benchmarks for this dataset, see [2]_.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
Contains 132752 calculated formation energies and inorganic
crystal structures from the Materials Project database. In benchmark
studies, random forest models achieved a mean average error of
0.116 eV/atom during five-folded nested cross validation on this
dataset.
For more details on the dataset see [1]_. For more details
on previous benchmarks for this dataset, see [2]_.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
Contains 106113 inorganic crystal structures from the Materials
Project database labeled as metals or nonmetals. In benchmark
studies, random forest models achieved a mean ROC-AUC of
0.9 during five-folded nested cross validation on this
dataset.
For more details on the dataset see [1]_. For more details
on previous benchmarks for this dataset, see [2]_.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected
from PubChem BioAssay by applying a refined nearest neighbor analysis.
The MUV dataset contains 17 challenging tasks for around 90 thousand
compounds and is specifically designed for validation of virtual screening
techniques.
Scaffold splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
“MUV-XXX” - Measured results (Active/Inactive) for bioassays
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
PubChem BioAssay (PCBA) is a database consisting of biological activities of
small molecules generated by high-throughput screening. We use a subset of
PCBA, containing 128 bioassays measured over 400 thousand compounds,
used by previous work to benchmark machine learning methods.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“mol_id” - PubChem CID of the compound
“smiles” - SMILES representation of the molecular structure
“PCBA-XXX” - Measured results (Active/Inactive) for bioassays:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The PDBBind dataset includes experimental binding affinity data
and structures for 4852 protein-ligand complexes from the “refined set”
and 12800 complexes from the “general set” in PDBBind v2019 and 193
complexes from the “core set” in PDBBind v2013.
The refined set removes data with obvious problems
in 3D structure, binding data, or other aspects and should therefore
be a better starting point for docking/scoring studies. Details on
the criteria used to construct the refined set can be found in [4]_.
The general set does not include the refined set. The core set is
a subset of the refined set that is not updated annually.
Random splitting is recommended for this dataset.
The raw dataset contains the columns below:
“ligand” - SDF of the molecular structure
“protein” - PDB of the protein structure
“CT_TOX” - Clinical trial results
Parameters:
featurizer (ComplexFeaturizer or str) – the complex featurizer to use for processing the data.
Alternatively you can pass one of the names from
dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
pocket (bool (default True)) – If true, use only the binding pocket for featurization.
set_name (str (default 'core')) – Name of dataset to download. ‘refined’, ‘general’, and ‘core’ are supported.
Returns:
tasks, datasets, transformers –
tasks: list
Column names corresponding to machine learning target variables.
datasets: tuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformers: list
deepchem.trans.transformers.Transformer instances applied
to dataset.
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
QM7 is a subset of GDB-13 (a database of nearly 1 billion
stable and synthetically accessible organic molecules)
containing up to 7 heavy atoms C, N, O, and S. The 3D
Cartesian coordinates of the most stable conformations and
their atomization energies were determined using ab-initio
density functional theory (PBE0/tier2 basis set). This dataset
also provided Coulomb matrices as calculated in [Rupp et al.
PRL, 2012]:
Stratified splitting is recommended for this dataset.
The data file (.mat format, we recommend using scipy.io.loadmat
for python users to load this original data) contains five arrays:
“P” - (5 x 1433), cross-validation splits as used in [Montavon et al.
NIPS, 2012]
“Z” - (7165 x 23), atomic charges
“R” - (7165 x 23 x 3), cartesian coordinate (unit: Bohr) of each atom in
the molecules
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by
default. For the QM7 dataset, this means that calling this
function will return 6838 compounds instead of 7160 in the source
dataset file. This appears to be due to valence specification
mismatches in the dataset that weren’t caught in earlier more lax
versions of RDKit. Note that this may subtly affect benchmarking
results on this
dataset.
QM8 is the dataset used in a study on modeling quantum
mechanical calculations of electronic spectra and excited
state energy of small molecules. Multiple methods, including
time-dependent density functional theories (TDDFT) and
second-order approximate coupled-cluster (CC2), are applied to
a collection of molecules that include up to eight heavy atoms
(also a subset of the GDB-17 database). In our collection,
there are four excited state properties calculated by four
different methods on 22 thousand samples:
S0 -> S1 transition energy E1 and the corresponding oscillator strength f1
S0 -> S2 transition energy E2 and the corresponding oscillator strength f2
E1, E2, f1, f2 are in atomic units. f1, f2 are in length representation
Random splitting is recommended for this dataset.
The source data contain:
qm8.sdf: molecular structures
qm8.sdf.csv: tables for molecular properties
Column 1: Molecule ID (gdb9 index) mapping to the .sdf file
Columns 2-5: RI-CC2/def2TZVP
Columns 6-9: LR-TDPBE0/def2SVP
Columns 10-13: LR-TDPBE0/def2TZVP
Columns 14-17: LR-TDCAM-B3LYP/def2TZVP
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by
default. For the QM8 dataset, this means that calling this
function will return 21747 compounds instead of 21786 in the source
dataset file. This appears to be due to valence specification
mismatches in the dataset that weren’t caught in earlier more lax
versions of RDKit. Note that this may subtly affect benchmarking
results on this dataset.
QM9 is a comprehensive dataset that provides geometric, energetic,
electronic and thermodynamic properties for a subset of GDB-17
database, comprising 134 thousand stable organic molecules with up
to 9 heavy atoms. All molecules are modeled using density
functional theory (B3LYP/6-31G(2df,p) based DFT).
Random splitting is recommended for this dataset.
The source data contain:
qm9.sdf: molecular structures
qm9.sdf.csv: tables for molecular properties
“mol_id” - Molecule ID (gdb9 index) mapping to the .sdf file
“A” - Rotational constant (unit: GHz)
“B” - Rotational constant (unit: GHz)
“C” - Rotational constant (unit: GHz)
“mu” - Dipole moment (unit: D)
“alpha” - Isotropic polarizability (unit: Bohr^3)
“homo” - Highest occupied molecular orbital energy (unit: Hartree)
“lumo” - Lowest unoccupied molecular orbital energy (unit: Hartree)
“gap” - Gap between HOMO and LUMO (unit: Hartree)
“r2” - Electronic spatial extent (unit: Bohr^2)
“zpve” - Zero point vibrational energy (unit: Hartree)
“u0” - Internal energy at 0K (unit: Hartree)
“u298” - Internal energy at 298.15K (unit: Hartree)
“h298” - Enthalpy at 298.15K (unit: Hartree)
“g298” - Free energy at 298.15K (unit: Hartree)
“cv” - Heat capavity at 298.15K (unit: cal/(mol*K))
“u0_atom” - Atomization energy at 0K (unit: kcal/mol)
“u298_atom” - Atomization energy at 298.15K (unit: kcal/mol)
“h298_atom” - Atomization enthalpy at 298.15K (unit: kcal/mol)
“g298_atom” - Atomization free energy at 298.15K (unit: kcal/mol)
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
Note
DeepChem 2.4.0 has turned on sanitization for this dataset by
default. For the QM9 dataset, this means that calling this
function will return 132480 compounds instead of 133885 in the
source dataset file. This appears to be due to valence
specification mismatches in the dataset that weren’t caught in
earlier more lax versions of RDKit. Note that this may subtly
affect benchmarking results on this dataset.
The Free Solvation Database, FreeSolv(SAMPL), provides experimental and
calculated hydration free energy of small molecules in water. The calculated
values are derived from alchemical free energy calculations using molecular
dynamics simulations. The experimental values are included in the benchmark
collection.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“iupac” - IUPAC name of the compound
“smiles” - SMILES representation of the molecular structure
“expt” - Measured solvation energy (unit: kcal/mol) of the compound,
used as label
“calc” - Calculated solvation energy (unit: kcal/mol) of the compound
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The Side Effect Resource (SIDER) is a database of marketed
drugs and adverse drug reactions (ADR). The version of the
SIDER dataset in DeepChem has grouped drug side effects into
27 system organ classes following MedDRA classifications
measured for 1427 approved drugs.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
“Hepatobiliary disorders” ~ “Injury, poisoning and procedural
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The “Toxicology in the 21st Century” (Tox21) initiative created a public
database measuring toxicity of compounds, which has been used in the 2014
Tox21 Data Challenge. This dataset contains qualitative toxicity measurements
for 8k compounds on 12 different targets, including nuclear receptors and
stress response pathways.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles” - SMILES representation of the molecular structure
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
tasks (List[str], (optional)) – Specify the set of tasks to load. If no task is specified, then it loads
ToxCast is an extended data collection from the same
initiative as Tox21, providing toxicology data for a large
library of compounds based on in vitro high-throughput
screening. The processed collection includes qualitative
results of over 600 experiments on 8k compounds.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
The USPTO dataset consists of over 1.8 Million organic chemical reactions
extracted from US patents and patent applications. The dataset contains the
reactions in the form of reaction SMILES, which have the general format:
reactant>reagent>product.
Molnet provides ability to load subsets of the USPTO dataset namely MIT,
STEREO and 50K. The MIT dataset contains around 479K reactions, curated by
jin et al. The STEREO dataset contains around 1 Million Reactions, it does
not have duplicates and the reactions include stereochemical information.
The 50K dataset contatins 50,000 reactions and is the benchmark for
retrosynthesis predictions. The reactions are additionally classified into 10
reaction classes. The canonicalized version of the dataset used by the loader
is the same as that used by Somnath et. al.
The loader uses the SpecifiedSplitter to use the same splits as specified
by Schwaller et. al and Dai et. al. Custom splitters could also be used. There
is a toggle in the loader to skip the source/target transformation needed for
seq2seq tasks. There is an additional toggle to load the dataset with the
reagents and reactants separated or mixed. This alters the entries in source
by replacing the ‘>’ with ‘.’ , effectively loading them as an unified
SMILES string.
Parameters:
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
subset (str (default 'MIT')) – Subset of dataset to download. ‘FULL’, ‘MIT’, ‘STEREO’, and ‘50K’ are supported.
sep_reagent (bool (default True)) – Toggle to load dataset with reactants and reagents either separated or mixed.
skip_transform (bool (default True)) – Toggle to skip the source/target transformation.
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
The UV dataset is an in-house dataset from Merck that was first introduced in the following paper:
Ramsundar, Bharath, et al. “Is multitask deep learning practical for pharma?.” Journal of chemical information and modeling 57.8 (2017): 2068-2076.
The UV dataset tests 10,000 of Merck’s internal compounds on
190 absorption wavelengths between 210 and 400 nm. Unlike
most of the other datasets featured in MoleculeNet, the UV
collection does not have structures for the compounds tested
since they were proprietary Merck compounds. However, the
collection does feature pre-computed descriptors for these
compounds.
Note that the original train/valid/test split from the source
data was preserved here, so this function doesn’t allow for
alternate modes of splitting. Similarly, since the source data
came pre-featurized, it is not possible to apply alternative
featurizations.
Parameters:
shard_size (int, optional) – Size of the DiskDataset shards to write on disk
featurizer (optional) – Ignored since featurization pre-computed
split (optional) – Ignored since split pre-computed
reload (bool, optional) – Whether to automatically re-load from disk
ZINC15 is a dataset of over 230 million purchasable compounds for
virtual screening of small molecules to identify structures that
are likely to bind to drug targets. ZINC15 data is currently available
in 2D (SMILES string) format.
MolNet provides subsets of 250K, 1M, and 10M “lead-like” compounds
from ZINC15. The full dataset of 270M “goldilocks” compounds is also
available. Compounds in ZINC15 are labeled by their molecular weight
and LogP (solubility) values. Each compound also has information about how
readily available (purchasable) it is and its reactivity. Lead-like
compounds have molecular weight between 300 and 350 Daltons and LogP
between -1 and 3.5. Goldilocks compounds are lead-like compounds with
LogP values further restricted to between 2 and 3.
If reload = True and data_dir (save_dir) is specified, the loader
will attempt to load the raw dataset (featurized dataset) from disk.
Otherwise, the dataset will be downloaded from the DeepChem AWS bucket.
featurizer (Featurizer or str) – the featurizer to use for processing the data. Alternatively you can pass
one of the names from dc.molnet.featurizers as a shortcut.
splitter (Splitter or str) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data
will be included in a single dataset.
transformers (list of TransformerGenerators or strings) – the Transformers to apply to the data. Each one is specified by a
TransformerGenerator or, as a shortcut, one of the names from
dc.molnet.transformers.
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str) – a directory to save the dataset in
size (str (default '250K')) – Size of dataset to download. ‘250K’, ‘1M’, ‘10M’, and ‘270M’ are supported.
format (str (default '2D')) – Format of data to download. 2D SMILES strings or 3D SDF files.
tasks (List[str], (optional) default: [‘molwt’, ‘logp’, ‘reactive’]) – Specify the set of tasks to load. If no task is specified, then it loads
molwt (the default set of tasks which are) –
logp –
reactive. –
Returns:
tasks, datasets, transformers –
taskslist
Column names corresponding to machine learning target variables.
datasetstuple
train, validation, test splits of data as
deepchem.data.datasets.Dataset instances.
transformerslist
deepchem.trans.transformers.Transformer instances applied
to dataset.
Return type:
tuple
Notes
The total ZINC dataset with SMILES strings contains hundreds of millions
of compounds and is over 100GB! ZINC250K is recommended for experimentation.
The full set of 270M goldilocks compounds is 23GB.
The dataset consist of diffrent configurations of Adsorbates (i.e N and NO)
on Platinum surface represented as Lattice and their formation energy. There
are 648 diffrent adsorbate configuration in this datasets represented as Pymatgen
Structure objects.
Pymatgen structure object with site_properties with following key value.
“SiteTypes”, mentioning if it is a active site “A1” or spectator
site “S1”.
“oss”, diffrent occupational sites. For spectator sites make it -1.
Parameters:
featurizer (Featurizer (default LCNNFeaturizer)) – the featurizer to use for processing the data. Reccomended to use
the LCNNFeaturiser.
splitter (Splitter (default RandomSplitter)) – the splitter to use for splitting the data into training, validation, and
test sets. Alternatively you can pass one of the names from
dc.molnet.splitters as a shortcut. If this is None, all the data will
be included in a single dataset.
transformers (list of TransformerGenerators or strings. the Transformers to) – apply to the data and appropritate featuriser. Does’nt require any
transformation for LCNN_featuriser
reload (bool) – if True, the first call for a particular featurizer and splitter will cache
the datasets to disk, and subsequent calls will reload the cached datasets.
data_dir (str) – a directory to save the raw data in
save_dir (str, optional (default None)) – a directory to save the dataset in