Hi folks, my name is Selin Cetin and I’m a rising senior studying materials science and engineering at Northwestern University. I’ve been working with Dr. Santanu Chaudhuri in the Applied Materials Division this summer on a machine learning project. The goal of this project, broadly, is to predict the properties of plant-based building blocks from their structures. The original focus of the project was on cellulose and the aim was to predict properties of cellulosic materials from their structures, but due to the complexity of cellulose’s structure, this turned out to be unfeasible. We then shifted our focus to a more fundamental version of this task: learning properties of “plant-based building blocks”. This encompasses a range of small carbohydrate molecules found in plants, including glucose, fructose, starch, and small portions of cellulose.
As climate concerns mount, the pressure to move toward sustainable manufacturing practices grows ever greater. Products containing fossil-fuel-based plastics attract particular scrutiny because of the significant amount of greenhouse gases released during their life cycle. A materials solution to alleviate this issue is to replace components in these products, if not the entire product itself, with biodegradable materials. Another sustainable research area of interest is that of the circular economy, in which product components are continuously reused and recycled to suit a selection of uses. To achieve these two goals of increased biomaterial usage in traditionally plastic products and development of circular economies surrounding biomaterial-based products, the “tunability” of materials must be unearthed. For example, if adding a certain functional group to a material decreases the tensile strength but increases the solubility, these are important relationships to uncover so that materials may effectively be recycled to suit different applications.
To implement machine learning, one must have a database on which to learn. The creation of this database comprised the first part of my project. The property that we are targeting for learning is the materials’ melting point. This was chosen because it is the one that is most frequently reported for molecules of interest. The size of the database was a barrier; it was difficult to procure a large structure-property database containing only plant-based carbohydrates, hence, melting point was chosen to obtain the largest possible database. Future work may build on this project, targeting a different property to learn if the necessary amount of data becomes available. To create the database, I used a web API to import a structure representation and the melting point of plant-based carbohydrates from PubChem1. The structure representation came in the form of SMILES strings, which are text-based representations of a molecule. An example of a SMILES string can be seen in Figure 1, for the molecule glucose2. This dataset was quite small, so I supplemented it with data from the Jean-Claude Bradley Open Melting Point dataset3, and data from ChemSpider4, which had to be added by hand.
For the training of the model, I will also be using the ChEMBL dataset5, which is a much larger dataset that contains drug-like molecules with bioactive properties. This will be done to improve the model’s ability to isolate important features from the molecules.
We are able to use a larger dataset for much of the training process because of the role of feature extraction in machine learning. The structures of the molecules will first be featurized using either an Extended-Connectivity Fingerprint representation or a graph-based representation. The extraction process for an ECFP representation is shown in Figure 26.
These representations can be produced from the SMILES strings contained in the database created from PubChem. The purpose of doing this is to format the data in a way that is easier for a machine learning model to learn. Then, the model can learn to connect the featurized structure of a molecule to the property of interest. It is important to note that machine learning does not elucidate the underlying physics of the structure-property relationships, but instead reveals patterns to researchers who may then investigate further.
Discussion with Dr. Prasanna Balaprakash revealed that much of the work done within a molecular machine learning model is simply identifying which features are important and which are not. This is something that can be generalized outside of the scope of plant-based carbohydrates and their melting points. It is likely that what features are important for property prediction remain constant across most organic molecules; what changes is how said features influence the property of interest. Because of this, the final few layers of a model can be optimized for a particular dataset, which is the melting point dataset of plant-based carbohydrates in this case, while the bulk of the model, which may have been trained on a larger dataset for the prediction of a different property, remains the same. A visualization of this may be seen in Figure 37.
This approach is called transfer learning, and it has been very successful in the field of image processing. Transfer learning has been successfully applied to images of molecules and raw SMILES strings, as seen in the paper ChemNet: A Transferable and Generalizable Deep Neural Network for Small-Molecule Property Prediction8. It has been used with varying degrees of success with other representations of molecules; a paper by Hu et al. found that pre-training on graph representations needed to be done at both the node level and full graph level to be effective9. I currently intend to attempt transfer learning with ECFP molecule representations.
I’ve enjoyed my time in the NAISE program this summer, and would like to thank Dr. Santanu Chaudhuri, Dr. Prasanna Balaprakash, Xiaoli Yan, and Jennifer Dunn for all their guidance. I look forward to digging into the machine learning model for the remainder of the month, and hope that it will be a useful tool that can be expanded and explored in the future by others.
- PubChem. (n.d.). PubChem. Retrieved August 11, 2021, from https://pubchem.ncbi.nlm.nih.gov/
- PubChem. (n.d.). D-Glucose. Retrieved August 6, 2021, from https://pubchem.ncbi.nlm.nih.gov/compound/5793
- Bradley, J.-C., Williams, A., & Lang, A. (2014). Jean-Claude Bradley Open Melting Point Dataset (p. 2225265 Bytes) [Data set]. figshare. https://doi.org/10.6084/M9.FIGSHARE.1031637
- ChemSpider | Search and share chemistry. (n.d.). Retrieved August 11, 2021, from http://www.chemspider.com/
- ChEMBL Database. (n.d.). Retrieved August 11, 2021, from https://www.ebi.ac.uk/chembl/
- Tilbec, H. (2018, April 24). Cheminformatics—ECFP & Neural Graph Fingerprint. Medium. https://medium.com/@hacertilbec/cheminformatics-ecfp-neural-graph-fingerprint-c98a98e12b04
- Yamada, H., Liu, C., Wu, S., Koyama, Y., Ju, S., Shiomi, J., Morikawa, J., & Yoshida, R. (2019). Predicting Materials Properties with Little Data Using Shotgun Transfer Learning. ACS Central Science, 5(10), 1717–1730. https://doi.org/10.1021/acscentsci.9b00804
- Goh, G. B., Siegel, C., Vishnu, A., & Hodas, N. O. (2017). ChemNet: A Transferable and Generalizable Deep Neural Network for Small-Molecule Property Prediction. https://www.arxiv-vanity.com/papers/1712.02734/
- Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., & Leskovec, J. (2020). Strategies for Pre-training Graph Neural Networks. ArXiv:1905.12265 [Cs, Stat]. http://arxiv.org/abs/1905.12265