Machine Learning-Based Discovering Patterns In Academic Data For Support Decision-Making In Educational Institutions
Understanding and classifying students according to their academic outcomes is a critical challenge for higher education institutions aiming to improve student retention and success rates. This study combines machine learning models, such as clustering and prediction models, to analyze student performance data. In the first phase, a bivariate analysis is performed to determine if there are trends and relationships between variables. Then, unsupervised methods are applied to discover latent groupings of students, revealing four meaningful categories: very good performing, good performing, regular, and at risk. These clusters provide insight into the heterogeneity of academic trajectories, while also highlighting areas of overlap that reflect the student performance. Finally, supervised algorithms including Random Forest, Logistic regression, and XGBoost are trained to predict the cluster membership of new students based on their academic performance. The performance of the models is evaluated on a data set with different characteristics of the students, using conventional metrics such as Precision, Recall, and F-score, allowing for a comparative assessment of predictive precision. The integration of supervised prediction with unsupervised discovery enables institutions to classify new students into risk-related categories even before the results are available, thus supporting proactive and personalized interventions.
