Accurate prediction of Chemical Oxygen Demand (COD) is vital for effective water quality management and pollution control. This study compares six ensemble boosting models, AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost, for estimating COD from multiple water quality parameters, including pH, dissolved oxygen, suspended solids, and specific conductance. Data from two monitoring stations in South Korea (Toilchun and Hwangji) were used to train and validate the models. Model performance was evaluated using RMSE, MAE, R, NSE, and PBIAS, while interpretability was assessed through SHapley Additive exPlanations (SHAP). Results showed that NGBoost achieved the highest predictive accuracy at Toilchun (R = 0.979, NSE = 0.958, RMSE = 0.397 mg/L), while CatBoost performed best at Hwangji (R = 0.861, NSE = 0.733, RMSE = 0.477 mg/L). As NGBoost provides predictive probability distributions rather than single estimates, its results also reflect model uncertainty, supporting a more robust quantification of COD variability. SHAP analysis identified total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS) as the most influential variables controlling COD dynamics.
Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis
Di Nunno F.;Granata F.;
2026-01-01
Abstract
Accurate prediction of Chemical Oxygen Demand (COD) is vital for effective water quality management and pollution control. This study compares six ensemble boosting models, AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost, for estimating COD from multiple water quality parameters, including pH, dissolved oxygen, suspended solids, and specific conductance. Data from two monitoring stations in South Korea (Toilchun and Hwangji) were used to train and validate the models. Model performance was evaluated using RMSE, MAE, R, NSE, and PBIAS, while interpretability was assessed through SHapley Additive exPlanations (SHAP). Results showed that NGBoost achieved the highest predictive accuracy at Toilchun (R = 0.979, NSE = 0.958, RMSE = 0.397 mg/L), while CatBoost performed best at Hwangji (R = 0.861, NSE = 0.733, RMSE = 0.477 mg/L). As NGBoost provides predictive probability distributions rather than single estimates, its results also reflect model uncertainty, supporting a more robust quantification of COD variability. SHAP analysis identified total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS) as the most influential variables controlling COD dynamics.| File | Dimensione | Formato | |
|---|---|---|---|
|
Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with Shap analysis.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
6.37 MB
Formato
Adobe PDF
|
6.37 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

