Sylvie Nguyen1, 2, Stéphanie Aguero Pizzolo1, 2, Raphaël Terreux1, 2
1 National Center for Scientific Research, Lyon, France
2 Lyon 1 University, Lyon, France
Abstract. Recent advances in generative models enable the virtual design of molecules with specific functions. However, although these models can produce novel structures, the synthetic feasibility of these generated molecules remains a critical challenge. Existing methods estimate synthetic accessibility (SA) using heuristic calculations or deep learning, but their generalization is limited by small training datasets. In order to address this lack of data, Support Vector Machine (SVM) methods are also used to predict the SA of a molecule. Yet they provide only binary classifications (easy/difficult) without quantifying synthesis complexity. We developed a new method based on Support Vector Regression (SVR) to predict the SA as a continuous score ranging from 1 (easy) to 10 (difficult). This method incorporates human knowledge based on the following assumptions: (1) molecules composed of previously synthesized fragments should be easier to synthesize, especially if these fragments have been frequently used. (2) molecules with previously connected fragment pairs should be easier to synthesize, especially if these connections are common. To distinguish between molecules sharing similar fragments, we apply a structural penalty derived from the SAScore heuristic method [1], which accounts for large rings, non-standard ring fusions, stereocomplexity and molecule size. Additionally, molecular descriptors complement these fragment-based features to capture structural diversity. The model was trained and tested on a dataset of 1,770 drug-like molecules assessed by chemists [1, 2], using scaffold-based splitting. It achieved a Q2 of 0.88 on the 353-molecule test set and a cross-validated R2 of 0.87, showing good predictive performance. This dataset was then used alongside the existing methods. Comparative analysis on this dataset showed that our SVR approach provides better accuracy than existing methods. [1] P. Ertl et A. Schuffenhauer, « Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions », J Cheminform, vol. 1, no 1, déc. 2009, doi: 10.1186/1758-2946-1-8. [2] R. P. Sheridan et al., « Modeling a Crowdsourced Definition of Molecular Complexity », J. Chem. Inf. Model., vol. 54, no 6, p. 1604-1616, juin 2014, doi: 10.1021/ci5001778.
Keywords: Artificial Intelligence; Rational Design; Synthetic Accessibility
| ID: 24, Contact: Aguero Pizzolo STEPHANIE, stephanie.aguero-pizzolo@univ-lyon1.fr | NTREM 2026 |