Singularity Score For Evaluating Topic Relevance In Tiny Text

Topic modeling is a widely used method for extracting relevant information and insights from text, given its strong results. When using this technique, it is necessary to evaluate the topics identified. However, when the text is very short, with fewer than 10 words per document on average, the classical evaluation metrics can be unreliable. To extract meaningful topics and identify the most suitable modeling technique, this study applied topic modeling to this type of data - tiny text - using user-generated Portuguese texts collected from post-its during PLANAPP workshops. Six datasets with different preprocessing steps were tested using LDA and BERTopic, the latter with two sentence-transformers (Multilingual and AlBERTina). As expected, the classical evaluation metrics proved inconsistent for such short texts, motivating the creation of a new measurement of topic coherence, the Singularity Score, that intends to mimic human annotators. Results show that BERTopic produced more coherent topics, despite the fact that LDA scores higher in traditional metrics. In summary, this work demonstrates that topic modeling can be effectively applied to tiny Portuguese texts, identifies BERTopic as the most suitable approach, and introduces SS as a novel metric for assessing topic quality.

Nicole Nunes

Ana Rita Peixoto

Ana Maria De Almeida