Overload with documents? How AI Extracts Strategic Insights from Unstructured Text

Vasileios iosifidis
Jun 21
4 min read

Updated: Jun 22

For teams dealing with a constant flow of unstructured text—support tickets, feedback forms, internal notes—it doesn’t take long before things become unmanageable. Even small organizations can end up with hundreds of messages each week, and trying to manually sort, tag, or summarize them just isn’t realistic.

Topic modeling has been around for years as a way to group similar messages, but older methods like LDA often struggle in practice. They rely on clean, structured text and tend to fall apart when faced with typos, slang, or inconsistent phrasing. The output? Often vague clusters with labels that don’t tell you much.

📈 Discover how real businesses use AI to create value. Join the newsletter for practical use cases and strategic insights.

Subscribe to newsletter!

With transformer-based models, that’s finally changing. These models understand text the way humans do—capturing meaning beyond just keywords. In this post, I’ll walk through how I use BERTopic, a transformer-powered topic modeling tool, to extract structure and insight from noisy, real-world data quickly.

This approach is useful for anyone trying to make sense of large volumes of text: customer support teams looking to identify recurring issues, product managers tracking feedback themes, or ops leads trying to surface blind spots.

Working with textual data

For this use case, I wanted a dataset that felt close to the kind of problems I see in various projects: lots of messy text with overlapping themes. I used a dataset from the billingsmoore/text-clustering-example-data collection. It’s small, a bit noisy, and exactly the kind of data you'd expect to see in the early stages of a real project. The dataset includes short text snippets grouped under broad topics but the actual language used in each entry is informal and often ambiguous.

That’s what makes it interesting: Instead of relying on the original topics, I ran the text through a transformer-based topic modeling pipeline using BERTopic. The goal wasn’t to replicate the labels, but to let the model discover its own structure based on the semantic similarity between the texts.

This approach is great when:

You don’t have labeled data
You want to explore themes before doing any classification
Or you just need a quick way to summarize large amounts of content

The goal here isn’t to build a perfect classifier. Instead, I wanted to show what transformer-based topic modeling can do out of the box with very little tuning: extract recurring themes, group similar documents, and give a quick visual overview of what’s going on in the data.

How It Works

The setup for this kind of analysis is straightforward. I used BERTopic as the main tool for topic modeling, which combines several components under the hood to get from raw text to meaningful clusters. At the core of the pipeline is a sentence embedding model—in this case, Qwen/Qwen3-Embedding-0.6B, a compact but surprisingly powerful model that turns each text into a dense vector representation. This step is key: rather than comparing texts word by word, we compare them by meaning.

Once we have embeddings, BERTopic handles the rest:

UMAP reduces the high-dimensional vectors to 2D or 3D space so we can visualize them.
HDBSCAN finds clusters in the reduced space without requiring us to predefine the number of topics.
Finally, a class-based TF-IDF method selects keywords that best represent each cluster, giving us human-readable topic labels.

All of this happens with just a few lines of code, but the result is a set of topics that actually reflect the structure of the data—no manual labeling needed, and no reliance on brittle, rule-based logic.

Results: Exploring the Topics

After running the text through the pipeline, the model identified a handful of distinct topics based purely on the content—no labels, no prompts, no supervision.

To get a better sense of how those topics align with the original dataset structure, I compared the predicted topics with the original labels. While the goal wasn’t to match them exactly, it’s still a useful sanity check. The heatmap below shows how well the model’s clusters correspond to the ground truth categories.

In the figure below, I plot the generated clusters in the 2D space. This gives a sense of overlap between the discovered structure and the original labeled categories. Some clusters line up cleanly, others reveal overlaps or mixed content, which is expected given the noisy nature of the data.

Similar documents are pulled together, and different themes are pushed apart. You can see several dense clusters and a few more diffuse regions—this reflects how confident the model is in those groupings.

These visuals are helpful when working with clients or teams who want to see what the model is doing, not just read about accuracy or loss. They make it easier to validate the output and start asking more targeted questions about the themes that emerge.

Conclusion

Even with a small and slightly messy dataset, transformer-based topic modeling can uncover structure that’s both meaningful and actionable. Without any manual labeling, I was able to group similar texts, extract key themes, and visualize the results in a way that makes it easier to reason about documents' content.

This kind of approach is especially useful in real-world settings where data is messy, labels are missing, and teams don’t have the time (or budget) for months of annotation work.

If your business is working with customer feedback, support tickets, survey responses, or internal documents—and you're looking for faster ways to understand what’s going on—this kind of analysis can save you a lot of time and surface insights that would otherwise go unnoticed.

📥 Want practical AI use cases? Subscribe to stay informed.

Subscribe to newsletter!