The safety norms for drug design are very strict with at least three stages of trials. One test, early on in the trials, is about the cardiotoxicity of the molecules, that is, whether the compound blocks any heart channel. Chemical libraries contain millions of compounds. Accurate a priori and in silico classification of non-blocking molecules, can reduce the screening for an effective drug, by half. The compound has to be checked for other risk factors alongside its therapeutic effect; these tests can also be done using a computer. Actual screening in a research laboratory is very expensive and time consuming. To enable the computer modelling, the molecules are provided in Simplified Molecular Input Line Entry (SMILE) format. In this study, they have been decoded using the chem-informatics development kit written in the Java language. The kit is accessed in the R statistical software environment through the rJava package, that is further wrapped in the rcdk package. The strings representing the molecular structure, are parsed by the rcdk functions, to provide structure-activity descriptors, that are known, to be good predictors of biological activity. These descriptors along with the known blocking behaviour of the molecule, constitute the input to the Decision Tree, Random Forest, Gradient Boosting, Support-Vector-Machine, Logistic Regression, and Artificial Neural Network algorithms. This paper reports the results of the data analysis project with shareware tools, to determine the best subset of molecular descriptors, from the large set that is available.
Keywords: Data mining, Bayesian classification problem, random forest, gradient boosting, biochemistry
@article{RO_2021__55_5_2769_0,
author = {Toppur, Badri and Jaims, K. J.},
title = {Determining the best set of molecular descriptors for a {Toxicity} classification problem},
journal = {RAIRO. Operations Research},
pages = {2769--2783},
year = {2021},
publisher = {EDP-Sciences},
volume = {55},
number = {5},
doi = {10.1051/ro/2021134},
mrnumber = {4313828},
zbl = {1476.62238},
language = {en},
url = {https://www.numdam.org/articles/10.1051/ro/2021134/}
}
TY - JOUR AU - Toppur, Badri AU - Jaims, K. J. TI - Determining the best set of molecular descriptors for a Toxicity classification problem JO - RAIRO. Operations Research PY - 2021 SP - 2769 EP - 2783 VL - 55 IS - 5 PB - EDP-Sciences UR - https://www.numdam.org/articles/10.1051/ro/2021134/ DO - 10.1051/ro/2021134 LA - en ID - RO_2021__55_5_2769_0 ER -
%0 Journal Article %A Toppur, Badri %A Jaims, K. J. %T Determining the best set of molecular descriptors for a Toxicity classification problem %J RAIRO. Operations Research %D 2021 %P 2769-2783 %V 55 %N 5 %I EDP-Sciences %U https://www.numdam.org/articles/10.1051/ro/2021134/ %R 10.1051/ro/2021134 %G en %F RO_2021__55_5_2769_0
Toppur, Badri; Jaims, K. J. Determining the best set of molecular descriptors for a Toxicity classification problem. RAIRO. Operations Research, Tome 55 (2021) no. 5, pp. 2769-2783. doi: 10.1051/ro/2021134
[1] , , and , Applications of QSAR study in drug design. Int. J. Eng. Res. Technol. 6 (2017) 582–587.
[2] , , , and , “NanoBRIDGES’’ software: open access tools to perform QSAR and nano-QSAR modeling. Chemom. Intell. Lab. Syst. 147 (2015) 1–13. | DOI
[3] , and , SMILES: a line notation and computerized interpreter for chemical structures. Report No. EPA/600/M-87/021. U.S. Environmental Protection Agency, Environmental Research Laboratory-Duluth, Duluth, MN 55804 (1987).
[4] , and , Natural allosteric modulators and their biological targets: molecular signatures and mechanisms. Nat. Prod. Rep. R Soc. Chem. 37 (2020) 488–514. | DOI
[5] , , and , Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics 31 (2015) 279–281. | DOI
[6] and , Deep Learning with R. Manning Publications Co. (2018).
[7] and , Analysis – Quick Reference Guide, With SPSS Examples. SAGE Publications, Inc. (2006).
[8] and , Prediction of the normal boiling points of organic compounds from molecular structures with a computational neural network model. J. Chem. Inf. Comput. Sci. 39 (1999) 974–983. | DOI
[9] , Chemical informatics functionality in R. J. Stat. Softw. 18 (2007) 1–16. | DOI
[10] and , COVID-19: a new virus, but a familiar receptor and cytokine release syndrome. Immunity 52 (2020) 731–733. | DOI
[11] , and , SMILES user manual. A simplified molecular input line entry system. Includes extended SMILES for defining fragments. Review Draft, Internal Report, Montana State University, Institute for Biological and Chemical Process Control (IPA), Bozeman, MT (1987).
[12] and , A Dictionary of Science. The English Language Book Society (1979).
[13] , , and , An Introduction to Statistical Learning, 1st edition. Springer (2013). | MR | Zbl | DOI
[14] , and , hERG liability classification models using machine learning techniques. Comput. Toxicol. 12 (2019) 100089. | DOI
[15] , Business Analytics. John Wiley (2017).
[16] , , , and , Therapeutics for COVID-19: from computation to practices – where we are, where we are heading to. Mol Divers 25 (2021) 625–659. | DOI
[17] and , MOLS 2.0: software package for peptide modeling and protein–ligand docking. J. Mol. Model 22 (2016) 239. | DOI
[18] and , Protein-small molecule docking with receptor flexibility in iMOLSDOCK. J. Comput.-Aided Mol. Design 32 (2018) 889–900. | DOI
[19] , Pattern Recognition – Statistical, Structural and Neural Approaches. John Wiley & Sons Inc., USA (1992). | MR
[20] and , Euclidean Steiner minimal trees, minimum energy configurations, and the embedding problem of weighted graphs in . Discrete Appl. Math. 71 (1996) 187–215. | MR | Zbl | DOI
[21] , , and , MILP-hyperbox classification for structure-based drug design in the discovery of small molecule inhibitors of SIRTUIN6. RAIRO:OR 50 (2016) 387–400. | Zbl | Numdam | DOI
[22] The OpenScience Project. https://cdk.github.io/cdk/2.3/docs/api/index.html?overview-summary.html.
[23] and , Enhanced sampling of the molecular potential energy surface using mutually orthogonal latin squares: application to peptide structures. Biophys. J. 84 (2003) 2897–906. | DOI
[24] , Fast identification of possible drug treatment of coronavirus disease-19 (COVID-19) through computational drug repurposing study. J. Chem. Inf. Model. 60 (2020) 3277–3286. | DOI
[25] , SMILES, a chemical language and information system. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28 (1988) 31–36. | DOI
[26] , and , SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29 (1989) 97–101. | DOI
[27] , Data mining with rattle and R: The art of excavating data for knowledge discovery. Series Use R!. Springer (2011). | MR | Zbl
Cité par Sources :





