Sum-It-Up Agent: An Open-Source AI Agent for Meeting Intelligence

2 hours ago
4 min read

If you do a lot of meetings, you know the pattern: you join, you pay attention, you even take notes, and then two hours later, you are like “wait, what did we decide?” Multiply that by back-to-back calls, time zones, and the classic “can you send a quick recap?”, and you get a real productivity sink.

Sum-It-Up Agent is my open-source answer to that problem: a modular AI agent that ingests audio or video, transcribes it, figures out what kind of meeting it was, generates a structured summary, and then ships that output to places people actually read (Slack, email, PDF today, with Jira and others on the roadmap).

What I wanted to build was not “yet another script that calls Whisper + an LLM”. I wanted a system I can iterate on like a real product: components that can be swapped, prompts that can be versioned without touching core code, and services that can be deployed independently.

This post walks through the repo section by section and explains how the pieces fit together, why I designed them this way, and what I would improve next.

An agent made of services, not a monolith

At a high level, Sum-It-Up Agent is split into four MCP services, with one orchestrator coordinating the end-to-end flow:

Audio Processor MCP: diarization + transcription
Topic Classifier MCP: classify the meeting type from transcript content
Summarizer MCP: generate meeting-type-specific summaries using file-backed prompt templates
Communicator MCP: deliver output via email, Slack webhook, and PDF export

Why this matters:

You can scale heavy parts (audio + LLM inference) independently.
You can replace one part without rewriting everything.
You can run the same agent locally for a PoC and later deploy services in a more production-like way.

Orchestrator (the brain of the system)

The Orchestrator is the central AI controller. It takes the user's request and input file, calls the appropriate MCP servers in sequence, manages intermediate artifacts (transcript JSON, summary JSON, exports), and determines where to deliver the results (Slack, email, PDF). It is also where failures, retries, and step-level status can be handled cleanly because the workflow is explicit and traceable.

Audio Processor MCP

Ingests audio/video and normalizes it for processing (format, sample rate, mono, etc.).
Produces the transcript artifact (and can include diarization / speaker segments depending on configuration).
Exports transcription outputs in a reusable form so downstream steps can operate on files, not hidden state.

Topic Classifier MCP

Takes the transcript artifact and predicts the meeting type (standup, project sync, sales, 1:1, etc.).
Enables “right prompt for the right meeting” so summaries are not generic blobs.
Can export classification results for debugging, evaluation, and future analytics.

Summarizer MCP

Generates the actual summary using meeting-type-specific prompt templates (file-backed prompts that can be versioned without code changes).
Supports different summary styles (action items, decisions, key points, executive, detailed) depending on the user intent.
Outputs structured summary artifacts (typically Markdown packaged in JSON) designed to be consumed by multiple delivery channels.

Communicator MCP

Delivers results to real workflows: email, Slack, and PDF export (so summaries do not die in a terminal).
Applies channel-aware formatting (email subject/body, Slack-friendly Markdown, PDF rendering).
Keeps delivery concerns separate from summarization, so adding Jira/Discord/other outputs does not pollute the core pipeline logic.

Simple example:

Let’s walk through a quick example using a public meeting video from YouTube:

https://www.youtube.com/watch?v=lBVtvOpU80Q

After starting the agent, it asks for the input file and your instructions. For this demo, I downloaded the video as an .mp3, but the agent can also handle other audio formats and even video files.

Here’s the prompt I used:

Summarize the key points of this meeting in a professional and friendly way, provide bullet points. Send to my email billiosifidis@gmail.com and also on slack

A few minutes later (runtime depends on your hardware and the LLM you choose), the summary is delivered automatically to my predefined Slack channel and to my email.

Slack:

Email:

Technical highlights (for the fellow geeks)

Orchestrator-driven pipeline: a central orchestrator coordinates the end-to-end flow (ingest → transcribe → classify → summarize → deliver), with clear step boundaries and artifacts passed between stages for traceability and debugging.
MCP-based modular services: the system is split into dedicated MCP servers (audio processing, topic classification, summarization, communication), so each component can evolve, scale, and be swapped independently.
Document-as-implementation prompts: summarization behavior is driven by versioned prompt text files (not hardcoded strings), enabling prompt iteration and A/B evaluation without touching core code.
Open-source LLM support: designed to run with local OSS models (for example via Ollama), giving you a private, self-hosted path for intent extraction and summarization.
MLflow-friendly evaluation mindset: the structure (artifacts + prompt versioning + presets) is set up to integrate naturally with MLflow experiment tracking for prompt/model comparisons and regression testing.
ADR documentation: architecture decisions are intended to be captured as ADRs so design trade-offs (like prompt versioning, modular services, delivery channels) are explicit and reviewable over time.
Repo: https://github.com/iosifidisvasileios/sum-it-up-agent

Summary

Sum-It-Up Agent is my take on making meeting intelligence practical and maintainable: modular MCP services, a clear orchestrator, and prompts treated like real versioned assets. The full codebase is open source at https://github.com/iosifidisvasileios/sum-it-up-agent. Next, I plan to explore UI integration so using the agent becomes as simple as uploading a file and choosing a few options, not just running it from the CLI.

If you try it out, I would genuinely love feedback, issues, feature requests, or PRs. Contributions are welcome, whether it’s fixing a small bug, improving docs, adding a new meeting template, building another integration (Jira and more are on the roadmap), or helping shape the upcoming UI.

📥 Want practical AI use cases? Subscribe to stay informed.

Subscribe to newsletter!

What I wanted to build was not “yet another script that calls Whisper + an LLM”. I wanted a system I can iterate on like a real product: components that can be swapped, prompts that can be versioned without touching core code, and services that can be deployed independently.

This post walks through the repo section by section and explains how the pieces fit together, why I designed them this way, and what I would improve next.

An agent made of services, not a monolith

At a high level, Sum-It-Up Agent is split into four MCP services, with one orchestrator coordinating the end-to-end flow:

Audio Processor MCP: diarization + transcription

Topic Classifier MCP: classify the meeting type from transcript content

Summarizer MCP: generate meeting-type-specific summaries using file-backed prompt templates

Communicator MCP: deliver output via email, Slack webhook, and PDF export

Why this matters:

You can scale heavy parts (audio + LLM inference) independently.

You can replace one part without rewriting everything.

You can run the same agent locally for a PoC and later deploy services in a more production-like way.

Orchestrator (the brain of the system)

Audio Processor MCP

Ingests audio/video and normalizes it for processing (format, sample rate, mono, etc.).

Produces the transcript artifact (and can include diarization / speaker segments depending on configuration).

Exports transcription outputs in a reusable form so downstream steps can operate on files, not hidden state.

Topic Classifier MCP

Takes the transcript artifact and predicts the meeting type (standup, project sync, sales, 1:1, etc.).

Enables “right prompt for the right meeting” so summaries are not generic blobs.

Can export classification results for debugging, evaluation, and future analytics.

Summarizer MCP

Generates the actual summary using meeting-type-specific prompt templates (file-backed prompts that can be versioned without code changes).

Supports different summary styles (action items, decisions, key points, executive, detailed) depending on the user intent.

Outputs structured summary artifacts (typically Markdown packaged in JSON) designed to be consumed by multiple delivery channels.

Communicator MCP

Delivers results to real workflows: email, Slack, and PDF export (so summaries do not die in a terminal).

Applies channel-aware formatting (email subject/body, Slack-friendly Markdown, PDF rendering).

Keeps delivery concerns separate from summarization, so adding Jira/Discord/other outputs does not pollute the core pipeline logic.

Simple example:

Let’s walk through a quick example using a public meeting video from YouTube:

https://www.youtube.com/watch?v=lBVtvOpU80Q

After starting the agent, it asks for the input file and your instructions. For this demo, I downloaded the video as an .mp3, but the agent can also handle other audio formats and even video files.

Here’s the prompt I used:

Summarize the key points of this meeting in a professional and friendly way, provide bullet points. Send to my email billiosifidis@gmail.com and also on slack

A few minutes later (runtime depends on your hardware and the LLM you choose), the summary is delivered automatically to my predefined Slack channel and to my email.

Slack:

Email:

Technical highlights (for the fellow geeks)

Orchestrator-driven pipeline: a central orchestrator coordinates the end-to-end flow (ingest → transcribe → classify → summarize → deliver), with clear step boundaries and artifacts passed between stages for traceability and debugging.

MCP-based modular services: the system is split into dedicated MCP servers (audio processing, topic classification, summarization, communication), so each component can evolve, scale, and be swapped independently.

Document-as-implementation prompts: summarization behavior is driven by versioned prompt text files (not hardcoded strings), enabling prompt iteration and A/B evaluation without touching core code.

Open-source LLM support: designed to run with local OSS models (for example via Ollama), giving you a private, self-hosted path for intent extraction and summarization.

MLflow-friendly evaluation mindset: the structure (artifacts + prompt versioning + presets) is set up to integrate naturally with MLflow experiment tracking for prompt/model comparisons and regression testing.

ADR documentation: architecture decisions are intended to be captured as ADRs so design trade-offs (like prompt versioning, modular services, delivery channels) are explicit and reviewable over time.

Repo: https://github.com/iosifidisvasileios/sum-it-up-agent

Summary

If you try it out, I would genuinely love feedback, issues, feature requests, or PRs. Contributions are welcome, whether it’s fixing a small bug, improving docs, adding a new meeting template, building another integration (Jira and more are on the roadmap), or helping shape the upcoming UI.