Training Transformers for Document Classification
- Vasileios iosifidis
- Jun 5
- 5 min read
Updated: 15 minutes ago
For many businesses, managing documents is a persistent challenge. Invoices, contracts, customer records, and emails accumulate quickly, and manual labeling or categorization often becomes an unavoidable drain on time and resources. As teams grow and data volumes increase, keeping up with this task can become very expensive.

📈 Discover how real businesses use AI to create value. Join the newsletter for practical use cases and strategic insights.
While automation with machine learning has long promised a solution, the reality is more complicated. Most traditional approaches struggle to handle the diversity and complexity of real-world documents—subtle formatting changes, new templates, or inconsistent terminology can easily throw off even well-designed models, meaning more manual intervention, increased costs, and frustration for teams.
Drawing on my experience implementing document classification systems across industries, I’ve seen these challenges firsthand. But recent breakthroughs in large language models (LLMs) are opening new possibilities for low-cost and effective AI processes. Fine-tuned on your specific documents, LLMs can deliver robust, accurate, and scalable document labeling pipelines that adapt to the unique demands of your business. In this post, I’ll explore how businesses can leverage modern AI to streamline document management, making classification faster, more reliable, and far less burdensome for your team.
Preprocessing Dataset
Before building any automated document classification pipeline, you need a dataset that’s both substantial and relevant to real business needs. For this use case, I wanted to move beyond toy problems, so I picked a resume dataset that contains CVs from 24 different professions, all in English and provided as PDF files. This closely mirrors a real-world use case: classifying unstructured documents across diverse business categories.
If you want to follow along or try this on your documents, you can find the dataset here: Kaggle Resume Dataset. Of course, the beauty of this approach is that with a few tweaks, you can apply the same workflow to any kind of business documents—contracts, invoices, or customer feedback.
For this use case, each resume comes as a PDF, so the first step is to extract the text and pair it with its correspondingprofession label. Once you have the raw data, it’s tempting to jump right into training—but careful preparation will save you headaches down the road! For this demo, I kept preprocessing simple and practical:
Stratified Splitting: I divided the data into train, validation, and test sets using stratified sampling. This ensures every profession is well-represented in each split.
Minimal Cleaning: Since all resumes are in English and reasonably formatted, I skipped heavy preprocessing, but you may want to clean up headers, footers, or anonymize sensitive info in your workflow.
Label Encoding: Profession categories are encoded as numerical labels for compatibility with most machine learning frameworks.
Model Selection
When it comes to document classification with transformer-based models, one size rarely fits all. The choice of model can have a dramatic impact on both performance and cost—especially when working with long, noisy, or domain-specific documents, such as resumes or contracts. That’s why, for this use case, I decided to benchmark four models that offer complementary strengths: BERT, BigBird, DeBERTa, and a more recent architecture, ModernBERT.
Each of these models tackles the document classification challenge from a slightly different angle. Let’s briefly introduce them and outline why they made the shortlist:
BERT is the baseline transformer model that popularized bidirectional attention. It’s fast, well-documented, and performs well on many short-text classification tasks. However, it struggles with longer inputs due to its fixed 512-token limit, which can be a bottleneck.
BigBird addresses BERT’s context limitation by using sparse attention, allowing it to process sequences up to 4,096 tokens efficiently. This makes it ideal for long-form documents like CVs or contracts, where key information might appear anywhere.
DeBERTa (v3 base) introduces disentangled attention and relative position embeddings, offering stronger performance across many benchmarks compared to standard BERT. It’s a strong candidate when classification accuracy is a priority, especially on nuanced tasks where language structure and semantics matter.
ModernBERT is a lightweight BERT variant fine-tuned on modern English corpora (published in December 2024). Its goal is to retain BERT’s strengths while being faster and more efficient for practical deployment. It has been extended with recent techniques to handle up to 8,196 token lengths with minimum memory consumption (RoPE).
Performance Evaluation
To evaluate the four models fairly, I used the weighted (macro) F1-score on the validation set as the primary metric. Since this is a multi-class classification task with some class imbalance (e.g., niche professions like legal assistant versus software engineer), F1 provides a more reliable signal than accuracy alone. Alongside this, I monitored the training loss over time to observe convergence behavior and signs of overfitting.
Here’s what the results revealed:

The top plot shows the evaluation F1-score over training steps on the validation set:
ModernBERT clearly outperforms all other models, reaching an impressive 0.7981 F1. Its ability to handle longer documents (up to 2048 tokens) appears to be a key differentiator, allowing it to capture more contextual information from full resumes without truncation.
BERT comes second with an F1 of 0.6152. Despite its 512-token limit, it performs reasonably well thanks to strong pretraining and careful fine-tuning.
BigBird and DeBERTa trail behind with F1-scores around 0.512 and 0.5089, respectively. This is surprising, especially for DeBERTa, which generally performs strongly on benchmarks. However, its shorter context window and possibly higher model complexity may have hurt performance given the fixed training setup.
The bottom plot shows the training loss trajectory (on the training set):
ModernBERT not only converges the fastest but also reaches the lowest final loss, indicating a good fit without obvious overfitting.
BERT and BigBird show slower convergence and higher final loss values.
DeBERTa struggles with both convergence and stability—its loss plateaus early and remains higher than the rest, suggesting underfitting or poor alignment with the dataset in this configuration.
To validate generalization, I also evaluated all models on the test set. As shown in the Figure below, ModernBERT remained the top performer with an F1 score above 0.83, significantly ahead of the others. BERT, BigBird, and DeBERTa clustered closer together around 0.56–0.58, reinforcing that raw architecture complexity doesn’t guarantee better results in real-world document tasks. What mattered more was usable context, training stability, and alignment with the problem at hand.

Final Thoughts: From AI Model to Business Process Automation
Document classification is deceptively difficult. Real-world documents are messy—long, inconsistently structured, and full of edge cases that trip up traditional models. What this benchmark made clear is that performance hinges on how much context a model can actually handle. BERT is fast and predictable but limited by its short attention span. BigBird and DeBERTa offer more architectural complexity, but didn’t translate into better results here. In contrast, ModernBERT quietly outperformed across the board—training faster, generalizing better, and adapting to long, unstructured inputs with minimal tuning.
For business and product teams, the takeaway is simple: better models reduce manual work and unlock faster, more scalable document workflows. You don’t need to hire a large ML team or spend weeks tuning hyperparameters to get results. With the right architecture and a clean pipeline, automating document labeling is not just possible—it’s practical. And once that system is in place, your team can focus on what matters: acting on insights, not chasing PDFs.
📥 Want practical AI use cases? Subscribe to stay informed.