TY - JOUR
T1 - Machine Learning Approach for the Estimation of Henry’s Law Constant Based on Molecular Descriptors
AU - Ullah, Atta
AU - Shaheryar, Muhammad
AU - Lim, Ho Jin
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/6
Y1 - 2024/6
N2 - In atmospheric chemistry, the Henry’s law constant (HLC) is crucial for understanding the distribution of organic compounds across gas, particle, and aqueous phases. Quantitative structure–property relationship (QSPR) models described in scientific research are generally tailored to specific groups or categories of substances and are often developed using a limited set of experimental data. This study developed a machine learning model using an extensive dataset of experimental HLCs for approximately 1100 organic compounds. Molecular descriptors calculated using alvaDesc software (v 2.0) were used to train the models. A hybrid approach was adopted for feature selection, ensuring alignment with the domain knowledge. Based on the root mean squared error (RMSE) of the training and test data after cross-validation, Gradient Boosting (GB) was selected as a model for predicting HLC. The hyperparameters of the selected model were optimized using the automated hyperparameter optimization framework Optuna. The impact of features on the target variable was assessed using the SHapley Additive exPlanations (SHAP). The optimized model demonstrated strong performance across the training, evaluation, and test datasets, achieving coefficients of determination (R2) of 0.96, 0.78, and 0.74, respectively. The developed model was used to estimate the HLC of compounds associated with carbon capture and storage (CCS) emissions and secondary organic aerosols.
AB - In atmospheric chemistry, the Henry’s law constant (HLC) is crucial for understanding the distribution of organic compounds across gas, particle, and aqueous phases. Quantitative structure–property relationship (QSPR) models described in scientific research are generally tailored to specific groups or categories of substances and are often developed using a limited set of experimental data. This study developed a machine learning model using an extensive dataset of experimental HLCs for approximately 1100 organic compounds. Molecular descriptors calculated using alvaDesc software (v 2.0) were used to train the models. A hybrid approach was adopted for feature selection, ensuring alignment with the domain knowledge. Based on the root mean squared error (RMSE) of the training and test data after cross-validation, Gradient Boosting (GB) was selected as a model for predicting HLC. The hyperparameters of the selected model were optimized using the automated hyperparameter optimization framework Optuna. The impact of features on the target variable was assessed using the SHapley Additive exPlanations (SHAP). The optimized model demonstrated strong performance across the training, evaluation, and test datasets, achieving coefficients of determination (R2) of 0.96, 0.78, and 0.74, respectively. The developed model was used to estimate the HLC of compounds associated with carbon capture and storage (CCS) emissions and secondary organic aerosols.
KW - atmospheric chemistry
KW - Henry’s law constant
KW - machine learning
KW - molecular descriptors
UR - https://www.scopus.com/pages/publications/85197270306
U2 - 10.3390/atmos15060706
DO - 10.3390/atmos15060706
M3 - Article
AN - SCOPUS:85197270306
SN - 2073-4433
VL - 15
JO - Atmosphere
JF - Atmosphere
IS - 6
M1 - 706
ER -