Abstract
Organic synthesis has been widely used in drug discovery and development. The intelligent prediction and analysis of
high-throughput coupling reaction yield is one of the important and challenging research hotspots in the field of
organic synthesis. However, the existing methods focus on intelligent prediction rather than study and interpret the
internal relationship between reaction conditions and yield. For tackling this problem, an intelligent analysis organic
chemical synthesis model by combining topological data analysis (TDA) and Light Gradient Boosting Machine (LightGBM),
named OCS-TGBM, is proposed to deeply explore the internal relationship between reaction conditions and yield, and
obtain high-yield reaction conditions and combinations. In order to further enhance the performance of the OCS-TGBM
model, a stratified diversity sampling strategy is introduced. Experimental results show that the OCS-TGBM model is
superior to other methods in analyzing and predicting the reaction performance of high-throughput organic chemical
synthesis. And it provides intelligent assistance for the optimal design of the reaction system and the evaluation of
reaction conditions, thus greatly accelerating the process of the drug discovery and development.