Impacts of Missing Value Imputation On Clustering Algorithms In Academic Achievement Data

This paper explores how different approaches to handling missing data shape the outcomes of unsupervised clustering in educational datasets. Three commonly used imputation methods – mean substitution, Random Forest, and Multiple Imputation by Chained Equations (MICE) – were applied to student records from a Croatian higher education institution. Each dataset was analysed individually using three clustering algorithms: K-means, Ward’s hierarchical clustering and Gaussian mixture model (GMM). Nine datasets were created, and the optimal number of clusters was determined using silhouette coefficients. The results indicate that different imputation choices have a pronounced impact only on Ward’s hierarchical clustering, yielding different cluster solutions depending on the imputation technique. On the other hand, the results showed that K-means and GMM have a high degree of robustness, consistently identifying the same number of clusters in all datasets – eight clusters for K-means and two for GMM. Despite these methodological differences, several significant and interpretable student pro-files emerged repeatedly, including high-achieving students, older long-term studying students, and low-achieving early school leavers. According to the results, the clustering’s sensitivity was confirmed, as preprocessing significantly affects clustering algorithms. On the one hand, some methods are highly impacted by how missing data are handled (Ward’s hierarchical method); on the other hand, there are stable and reliable clustering methods regardless of the missing-data handling method (K-means and Gaussian Mixture Model). These results may benefit researchers who deal with unsupervised machine learning for educational data mining, especially when working with incomplete datasets from different sources.

Bojan Radišić
Faculty of Tourism and Rural Development in Požega, Josip Juraj Strossmayer University of Osijek
Croatia

Sanja Seljan
Faculty of Humanities and Social Sciences, University of Zagreb, Department of Infor-mation and Communication Sciences
Croatia

Ivan Dunđer
Faculty of Humanities and Social Sciences, University of Zagreb, Department of Infor-mation and Communication Sciences
Croatia