Accurate prediction of Chemical Oxygen Demand (COD) is vital for effective water quality management and pollution control. This study compares six ensemble boosting models, AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost, for estimating COD from multiple water quality parameters, including pH, dissolved oxygen, suspended solids, and specific conductance. Data from two monitoring stations in South Korea (Toilchun and Hwangji) were used to train and validate the models. Model performance was evaluated using RMSE, MAE, R, NSE, and PBIAS, while interpretability was assessed through SHapley Additive exPlanations (SHAP). Results showed that NGBoost achieved the highest predictive accuracy at Toilchun (R = 0.979, NSE = 0.958, RMSE = 0.397 mg/L), while CatBoost performed best at Hwangji (R = 0.861, NSE = 0.733, RMSE = 0.477 mg/L). As NGBoost provides predictive probability distributions rather than single estimates, its results also reflect model uncertainty, supporting a more robust quantification of COD variability. SHAP analysis identified total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS) as the most influential variables controlling COD dynamics.

Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis

Di Nunno F.;Granata F.;
2026-01-01

Abstract

Accurate prediction of Chemical Oxygen Demand (COD) is vital for effective water quality management and pollution control. This study compares six ensemble boosting models, AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost, for estimating COD from multiple water quality parameters, including pH, dissolved oxygen, suspended solids, and specific conductance. Data from two monitoring stations in South Korea (Toilchun and Hwangji) were used to train and validate the models. Model performance was evaluated using RMSE, MAE, R, NSE, and PBIAS, while interpretability was assessed through SHapley Additive exPlanations (SHAP). Results showed that NGBoost achieved the highest predictive accuracy at Toilchun (R = 0.979, NSE = 0.958, RMSE = 0.397 mg/L), while CatBoost performed best at Hwangji (R = 0.861, NSE = 0.733, RMSE = 0.477 mg/L). As NGBoost provides predictive probability distributions rather than single estimates, its results also reflect model uncertainty, supporting a more robust quantification of COD variability. SHAP analysis identified total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS) as the most influential variables controlling COD dynamics.
File in questo prodotto:
File Dimensione Formato  
Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with Shap analysis.pdf

accesso aperto

Licenza: Creative commons
Dimensione 6.37 MB
Formato Adobe PDF
6.37 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11580/123229
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
social impact