Advanced RAG: Chat with your video and audio files using contextual chunking

Vasileios iosifidis
Apr 2
6 min read

Updated: Jun 17

Imagine you have a huge corpus of video and/or audio files, and you need to find that particular discussion that you recorded with a friend on how to build that cool feature on your website that will break the internet (?). Of course, you haven't named your files accordingly, and you want to find not just the file but the exact minute of that part of the discussion because who has time to listen to the whole part, right?

📈 Discover how real businesses use AI to create value. Join the newsletter for practical use cases and strategic insights.

Subscribe to newsletter!

If this problem sounds familiar, don't waste time searching your files one by one trying to figure out which file it was or at what timepoint this discussion was made. Fear not, because LLMs can save the day! In this post, I will dive into the advanced RAG topic, showing how to design a system that solves exactly this: retrieving and answering your questions based on your multimedia corpus. The hidden heroes of this system are: i) speech-to-text model whisper (v3-turbo) from OpenAI, ii) embedding model Qwen2 (1.5B-instruct) from Alibaba, iii) generator LLM Qwen2.5 from Alibaba (32B-instruct) models, iv) ollama for effortless LLM local deployment and, v) streamlit python package for bringing the chatbot to life.

Working with video/audio data

So, you’ve got hours of audio from podcasts, meetings, to your cousin’s rambling voicemails, and you need it all transcribed yesterday? Manually typing it out would take a loooot of time, and let’s be real, your fingers have better things to do. Luckily, modern speech-to-text technology has evolved, and now you can dump hours of video and audio into a model and get clean, timestamped text without losing your time.

The trick is picking the right tool for the job because some models handle accents while others do not. Of course, you can always fine-tune a model on your niche vocabulary (looking at you, medical jargon and startup buzzwords). And if you’re concerned about privacy, you can run plenty of them offline, so your secret brainstorming sessions or the company's secrets stay within.

OpenAI has open-sourced Whisper v3 Turbo, a state-of-the-art speech recognition model that delivers exceptional accuracy across diverse audio conditions. Its ability to handle varying accents, background noise, and technical terminology makes it an ideal choice for processing large volumes of audio content efficiently. The model is fairly small, around 800M parameters, and straightforward to use on audio data. You can also fine-tune it quite easily to your data/vocabulary and accent! Here is a great, easy-to-implement guide to fine-tune it to your own voice and accent!

Contextual vs Standard chunking

Contextual chunking and standard chunking are two fundamentally different approaches to text segmentation, each with distinct advantages and limitations.

Standard/Naive chunking divides text into fixed-length segments (e.g., fixed character length) without considering semantic structure. While this method is simple and computationally efficient, it often splits sentences, paragraphs, or ideas in unnatural ways, leading to fragmented meaning. This makes naive chunking less effective for tasks requiring deep comprehension, such as question answering or document summarization. However, its speed and simplicity make it a practical choice for large-scale preprocessing where raw efficiency outweighs the need for contextual coherence.

In contrast, contextual chunking is enriching each segment with relevant document-wide information. This ensures that key themes, entities, and relationships are preserved within each chunk, improving performance in semantic search, retrieval-augmented generation (RAG), and other NLP tasks. The downside is that contextual chunking requires more sophisticated processing, such as semantic analysis or entity recognition, which increases computational overhead. Despite this trade-off, its ability to maintain meaning and coherence makes it far superior for applications where understanding context is critical.

In the system below, as I highlight in the examples, I found that contextual chunking was able to retrieve more relevant information because of its ability to enrich chunks with information from the whole document. I used a similar way to anthropic to generate the contextualized chunks. The main difference is that instead of using the whole document for enriching each chunk, I first created a summary of the document and then used the summary instead.

System Overview: Multimedia Data Processing and Q&A

I will describe a way to build a system that efficiently retrieves relevant information from multimedia sources. The figure below illustrates the pipeline I designed for processing multimedia data and storing it in a vector database. The vector database is essential for the retrieval part. At the top, multimedia data sources serve as the input, which could include audio, video, or other rich media formats. These raw data sources undergo a series of transformations through specialized processing modules. These steps typically include tasks such as file format standardization, speech-to-text transcription, summarization, chunk augmentation, and embedding generation.

Once the data has been processed and the chunking of the transcripts is done, they are converted into vectors and sent to a vector database for efficient retrieval.

So, let's assume that the data is loaded into the vector db; now, what? The system still needs a way to allow the user to interact with it, right? For this purpose, I build a User Interface (UI) using off-the-shelf packages, i.e., Streamlit. Streamlit is an open-source Python library that simplifies building interactive web applications for data science and machine learning. With just a few lines of code, you can create dashboards, visualizations, and data tools without needing front-end development expertise.

How it works is that the user can provide a query to the system from a simple Web UI. The query is converted into a vector, which is passed to the vector database to retrieve the most relevant information. Afterward, this information, together with the original query, is provided to the LLM generator as a prompt, and the output is sent back to the user (by the way, that is the basic RAG pipeline).

Data sources and RAG evaluation

To better understand the prompt examples below, I will briefly explain the raw data that I used for the system. I have employed some of my favorite lectures on YouTube, which resulted in around 300 mp4 video files, summarized below.

# Videos	Playlist
24	MIT data structure and algorithms 2015
30	MIT 6.034 Artificial Intelligence, Fall 2010
35	Gilbert Strang lectures on Linear Algebra (MIT)
23	Financial Markets (2011) with Robert Shiller
159	Jim Rohn on motivation

These files are converted into thousands of vector embeddings and can be easily queried from the system. I have used a 2' interval to separate the audio on each of these videos so that the system can effectively fetch the most relevant piece of information based on the user query.

As I mentioned, the contextualized chunking strategy has performed better than the standard approach. The reason is that the contextualized chunks are enriched with information from the whole file, which leads to better semantic similarity scores compared to irrelevant chunks.

Look at the following example, in which I want to find the "O" complexity of an insertion on the Van Em De Boas data structure (blast from the past if you are into some advanced data structures!). The standard chunking strategy (right chat, retrieved chunks are displayed on the bottom) fetched information from irrelevant chunks and totally missed providing the right answer in contrast to the other strategy!

Standard chunking is not all bad, of course... There are cases that can answer the query even if not all the relevant chunks are returned. In the example below, standard chunking fetched the relevant info that comes from minutes [2', 6'). Contextual chunking has also retrieved the same chunks but in a slightly better time interval, e.g., [0, 6). Both answers are good enough based on the content the LLM receives, but the latter seems more robust to me!

Final Thoughts

Chatting with your multimedia files has never been easier than before! In this blog, I built an end-to-end system that parses video data, processes them, and provides a WebUI to the user to chat with them. For this implementation, I utilized open-source models and packages that enabled me to build an impressively accurate system with low effort and practically zero costs! In the next post, I will go over the full technicalities of the dockerization of this RAG system to show you how easy it is to serve a RAG system in a container!

📥 Want practical AI use cases? Subscribe to stay informed.

Subscribe to newsletter!