The Shiniest Tool, or the Most Elegant Solution? Revisiting Text Classification in the Age of Transformers

It's quite fascinating to watch the current pace of innovation in AI. Transformers like Gemini and GPT-4.1 have set a new bar for processing natural language, and that has rippled out into a wave of experimentation across automation, data science, and almost every benchmark you can name in textual classification.
But for all the acclaim, that prowess isn't always reliable in practice. A small personal experiment recently challenged what we thought we knew about classification — and quietly raised a bigger question about how we go about choosing tools in the first place.
A small problem, a familiar instinct
The task was deliberately modest: classify short-form questions from an online forum into one of nine categories — artificial intelligence, astronomy, beer, coffee, computer graphics, martial arts, open data, quantum computing, sports. The kind of dataset you'd assume a modern transformer would chew through without breaking a sweat.
The data itself had some texture worth noting. Questions averaged around 50 characters, with the longest topping out at around 150 — short-form, in other words. Categories were imbalanced enough to warrant stratified sampling. And the "noise" in the text wasn't really noise at all: equations, technical syntax like MaxPooling2D, acronyms like GPT-3, and unicode-laden formulas (Lyman‐α, weak-) all carried real domain meaning. Strip them out and you'd lose the signal you were trying to classify.
A quick exploratory pass also confirmed something that mattered for what came next. Word clouds and n-gram analysis showed each category had reasonably distinct vocabulary — but nearly half the questions in each category contained none of the top 20 words for that class. Bag-of-words alone wouldn't cut it. Context would matter.
Which, honestly, is the perfect setup for a transformer to shine.
The transformer that didn't quite
We started with BART-MNLI in a zero-shot setup. The intuition was sound: transformers were built precisely to capture context and contextual nuance, so they should naturally handle equations, acronyms and technical syntax as part of meaning rather than noise.
The numbers told a different story:
- F1-Score: 0.591 (initial) → 0.540 (after relabelling)
- Precision: 0.803 → 0.800
- Recall: 0.559 → 0.516
- Accuracy: 0.559 → 0.516
The confusion matrix exposed a particular weakness around the "open data" category — questions about datasets across other domains kept getting pulled into that label. Renaming it to "sourcing data" to better reflect its semantic content was a reasonable hypothesis, but the metrics actually drifted downward. There was also a non-trivial practical cost: roughly one second per question meant a full cross-validation run would have taken almost ten hours, forcing the evaluation onto a 5% stratified subset.
So we tried something humbler.

When older models quietly outperform
We trained two RNN variants from scratch — an LSTM and a GRU — both with a 100-dimensional embedding layer and a single recurrent layer feeding a softmax. Five epochs, batch size 32, nothing exotic.
The GRU's results on the held-out test set:
- Accuracy: 0.8950
- Precision: 0.8873
- Recall: 0.8765
- F1-Score: 0.8815
That's roughly a 30 to 40 percentage point lift over the transformer, on a model that's smaller, trains in minutes, and runs predictions in milliseconds. The LSTM landed in similar territory; we picked the GRU on efficiency grounds, given near-identical performance.
There were caveats. Training loss kept dropping while validation loss crept up after epoch two — classic overfitting signs that regularisation, early stopping, or tuning could probably tame. But even the unoptimised GRU left the zero-shot transformer well behind.
What this experiment was really about
It's tempting to read this as "RNNs beat transformers" and move on. That's not quite the lesson.
A fairer reading is that task-model fit matters more than model prestige. The transformer was being asked to do zero-shot classification on short, domain-heavy text where the relevant signal lived in technical vocabulary it had no specific training on. The GRU, meanwhile, got to learn the dataset's vocabulary and class boundaries directly. It's not surprising in hindsight — but the instinct to reach for the transformer first was strong enough to make us skip past simpler options.
And that instinct is the part worth examining.
A bigger pattern, beyond NLP
This experience didn't just challenge our assumptions about classification — it made us think about the growing reach of agentic GenAI into automation more broadly. It's easy to get caught up in the genuine promise these systems hold. In the right contexts, they're transformative. But they also tend to demand extensive edge-case handling, careful prompt engineering, and a level of operational complexity that older, more deterministic approaches simply don't.
In our enthusiasm to adopt them, we risk overlooking simpler, more robust automation that would do the job with less ceremony and more reliability. A regex, a rules engine, a small classifier — these aren't glamorous, but for many problems they're the right shape.
So the question we keep coming back to: for any problem in front of us, do we truly need the shiniest tool, or simply the most elegant solution?
Perhaps clarity comes not from novelty, but from what works simply.
Notebook: NLP Model Review — Evaluating Text Classification Techniques
