Supervised Machine Learning For Classifying German Parliamentary Documents: A Comparative Study
Organizations in regulation-intensive sectors require timely access to relevant parliamentary information, yet keyword-based monitoring of political documents scales poorly and produces substantial noise. This paper evaluates supervised machine learning pipelines for binary classification of German parliamentary documents to support automated political reporting in the energy sector. Using a corpus from five East German state parliaments, the study systematically compares preprocessing strategies, vector representations (TF-IDF, fastText, EuroBERT), and classifiers under pronounced class imbalance. Results show that while fine-tuned EuroBERT achieves the strongest overall metric profile, a TF-IDF plus multinomial Naive Bayes pipeline with advanced preprocessing delivers the highest F2-score and superior retraining efficiency. Theoretically, the study contributes empirical evidence that simple, well-tuned models can rival transformers in German, domain-specific text classification. Practically, it provides a deployable, resource-efficient classifier validated for integration into an automated political reporting system.
