Climate Variables Prediction Using Time Series Analysis: A Comparative Study of Statistical and Machine Learning Models Under The Kdd Methodology
Accurate forecasting of climate variables remains a difficult problem due to their inherent non-linearity, high variability, imbalance in precipitation events, and the presence of multi-scale temporal dependencies. This study develops a unified and reproducible forecasting framework grounded in the KDD methodology, integrating classical statistical models (ARIMA, SARIMA), machine learning methods (Linear Regression, Random Forest, XGBoost), deep learning architectures (LSTM in univariate, multivariate, and multi-step configurations), and a hybrid ARIMA+LSTM approach designed to capture both linear and non-linear temporal structure. In addition, a dual classification–regression strategy was implemented to model both precipita-tion occurrence and intensity, addressing the imbalance and intermittency characteristic of rainfall data. Results demonstrate that LSTM-based models significantly outperform clas-sical approaches, achieving up to 25% lower RMSE compared to ARIMA, particularly for multi-hour forecasting horizons. The hybrid ARIMA+LSTM model yielded further reductions in residual error, confirming the benefit of combining autoregressive linear components with deep recurrent architec-tures. The classification–regression approach improved precipitation-event recall by 18%, enhancing the model’s ability to detect low-frequency rainfall events. Overall, the findings validate deep learning and hybrid models as ro-bust and scalable alternatives for operational climate forecasting
