Skip to content

LLM

(Part 1) How to connect your LLM Agent to MCP server

The Model Context Protocol (MCP) is quickly emerging as a foundational layer in the AI engineering stack — yet most conversations around it remain either too shallow or overly speculative. This blog series aims to cut through the noise.

I’ll dive into what makes MCP unique, where it fits in the AI stack, and why some teams are quietly treating it as infrastructure. Along the way, we’ll explore topics like connecting LLMs to MCP servers, benchmarking tool retrieval, enabling human-in-the-loop workflows, multi-tenancy ,addressing security vulnerabilities and more.

If you’re building serious AI systems or just curious about where things are headed, stay tuned. For the Part 1 of the MCP series , we will talk about how to connect your LLM to an MCP server.

Model Context Protocol (MCP)

Model Context Protocol (MCP) is an open protocol introduced by Anthropic in late 2024 to standardize how AI models connect to external data sources and tools. At it's core, it defines a common format for a Client to invoke operations on a external tool or service in a predictable way. Thus, by providing a uniform interface, MCP is intended to do for AI-agent integrations what traditional APIs did for web services – enable any AI application to talk to any service using a predictable, well-defined contract.

MCP Workflow

What MCP Fixes in the AI Agent Stack ?

  • Predictable Behavior with Tools - In the past, integrations with agents relied on hard-coded API calls or tools inferred from documentation. MCP provides a well-defined interface for tool invocation with standardized discovery, invocation patterns, and error handling. This determinism makes agent behavior more predictable and safe, solving integration challenges with services like GitHub, Jira, and other tools.

  • Reusability of Tools - Tool developers can implement an MCP server once, making it available to any MCP-compatible client. This creates an ecosystem of reusable components, similar to how npm packages work for JavaScript. For example, once someone builds a Google Drive MCP server, many different applications can access data from Google Drive without each developer needing to build custom connections.

  • Consistent Agent Interface - MCP is the action taking interface. It removes the need for custom logic for implementations. For a developer the steps are straightforward -- Implement a MCP server, Register it with Agent platform.
  • Flexible Integration - Clients and servers can dynamically discover each other's capabilities during initialization, allowing for flexible and extensible integrations that can evolve over time.

Quickstart: Connecting Your LLM to an MCP Server

Let's look at an example on how to connect your LLM to an MCP server using Agentic library like Pydantic AI. If you want to create your own MCP server - fastmcp. But right now we will use SQLite MCP server for this example. This particular MCP server lets you interact with any sqlite database and it also allows operations like 'read_query', 'write_query', 'create_table' etc.

from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStdio

server = MCPServerStdio(
    "uvx",
    args=[
        "mcp-server-sqlite",
        "--db-path",
        "db.sqlite",
    ],
)
agent = Agent("openai:gpt-4o", mcp_servers=[server])


async def main():
    async with agent.run_mcp_servers():
        result = await agent.run(
            "see if the table animal exists. If it exists, give description of the table"
        )

    print(result.output)
    # The table `animals` exists and has the following structure:
    # - `name` (TEXT): The name of the animal.
    # - `type` (TEXT): The type or species of the animal.
    # - `age` (INTEGER): The age of the animal.

Under the Hood: How LLMs Actually Connect to MCP Tools

Pydantic AI hides away a lot of complexity of the LLM workflow with MCP tools that is happening under the hood. We'll unravel some of the important steps.

Access tools through MCP Client

from fastmcp import Client
config = {
    "mcpServers": {
        "sqlite": {
            "command": "uvx",
            "args": ["mcp-server-sqlite", "--db-path", "db.sqlite"],
        }
    }
}
async def example():
    async with client:
        tools = await client.list_tools()
        print([tool.name for tool in tools])
        # > ['read_query', 'write_query', 'create_table', 'list_tables', 'describe_table', 'append_insight']
        print(tools[1].inputSchema)
        # >  {'type': 'object', 'properties': {'query': {'type': 'string', 'description': 'SQL query to execute'}}, 'required': ['query']}

asyncio.run(example())

Here the client when connected to the SQLite MCP server can fetch all the tools in the server with their input schema.

Choosing the Right Tools at Runtime

For a user query , LLM dynamically decides which tools to calls for execution based on tool description from MCP server.

import instructor
async_client = instructor.from_openai(AsyncOpenAI())


class FunctionList(BaseModel):
    """A model representing a list of function names."""

    func_names: List[str]
## modified example function

async def example(user_query: str):
    async with client:
        tools = await client.list_tools()
        print([tool.name for tool in tools])
        # > ['read_query', 'write_query', 'create_table', 'list_tables', 'describe_table', 'append_insight']
        print(tools[1].inputSchema)
        # >  {'type': 'object', 'properties': {'query': {'type': 'string', 'description': 'SQL query to execute'}}, 'required': ['query']}
        try:
            response = await async_client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {
                        "role": "system",
                        "content": f"""Identify the tools that will help you answer the user's question.
                        Respond with the names of 0, 1 or 2 tools to use. The available tools are
                        {tools}.

                        Don't make unnecessary function calls.
                        """,
                    },
                    {"role": "user", "content": f"{user_query}"},
                ],
                temperature=0.0,
                response_model=FunctionList,
            )
            print(response)
        except Exception as e:
            print(f"Error in API call: {str(e)}")
            return None

Structured Tool Definitions for LLM Use

Once the LLM selects the relevant tools, we convert them into ToolDefinition objects. These definitions are later used to generate Pydantic models for structured parameter generation by the LLM.

tool_definitions = []
for func_name in response.func_names:
    # Find the tool by name
    tool = next((t for t in tools if t.name == func_name), None)
    if tool:
        tool_def = ToolDefinition(
            name=tool.name,
            description=tool.description,
            parameters_json_schema=tool.inputSchema,
        )
        tool_definitions.append(tool_def)

    else:
        print(f"Tool {func_name} not found in tools list.")

Invoking Tool Calls with LLM-Generated Parameters

For each selected tool, we dynamically generate a Pydantic model based on its input schema. This model acts as a response template for the LLM, guiding it to produce structured parameters tailored to the tool. Once the parameters are generated, they’re used to invoke the tool via the MCP client — completing the reasoning-to-execution loop.

gpt_response = []
for tool_calls in tool_definitions:
    x = create_model_from_tool_schema(tool_calls) ## creates Pydantic model for structured output 
    response1 = await async_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"""Figure out if the {user_query} has been resolved without the tool call or not.If not, execute the tool call with the right parameters for the {tool_calls.name} tool with description {tool_calls.description}.
                """,
            },
            {"role": "user", "content": f"{gpt_response}"},
        ],
        temperature=0.0,
        response_model=x,
    )
    res = None
    if response1 is not None:
        ## Call MCP tool with infered parameters from LLM
        res = await client.call_tool(
            tool_calls.name, response1.model_dump()
        )
     gpt_response.append(
                    {
                        "tool_name": tool_calls.name,
                        "parameters": response1.model_dump(),
                        "response": res[0],
                    }
                )
Agentic Flow of Tool Calls

Synthesizing Results into Final Response

Once all tool calls are complete, we pass their responses back to the LLM. The LLM then synthesizes these results into a final, human-readable response to the original user query.

final_llm_response = await async_client.chat.completions.create(
          model="gpt-4o",
          messages=[
              {
                  "role": "system",
                  "content": f"""Consolidate all the reponse from the tools calls and give the user appropriate answer to the {user_query}.
                  """,
              },
              {"role": "user", "content": f"{gpt_response}"},
          ],
          temperature=0.0,
          response_model=LLMResponse,
      )
      return final_llm_response

if __name__ == "__main__":
    tools_called = asyncio.run(
        example(
            "see if the table animal exists. If it exists, give description of the table"
        )
    )

User Query : "see if the table animal exists. If it exists, give description of the table"

Response : The table "animals" does exist in the database. It has the following structure:
1. name: Type - TEXT, Nullable - Yes
2. type: Type - TEXT, Nullable - Yes
3. age: Type - INTEGER, Nullable - Yes
This table does not have any primary key defined

DAG-based Execution for Multi-Step Workflows

So far, we've focused on a single-turn tool interaction- but many real-world queries are multi-step and dependent tool interactions between your LLM agent and an MCP server.

Suppose your LLM agent receives a user query: "List all tables in the database and describe each one."

With a DAG-based approach:

  • The agent calls the list_tables tool (root node).
  • For each table returned, the agent creates a node to call the describe_table tool.
  • The results are gathered and passed to the LLM for summarization or further reasoning.

This pattern generalizes to any scenario where: - Agents dynamically decide the number of tool calls based on data. - Tools calls can be made only if certain conditions are met. - Output of one tool call influences the flow of next.

I’ll explore this DAG-based approach more thoroughly in a later part of this series.

What If Your Agent Picks the Wrong Tool?

As we’ve seen, MCP brings determinism to tool execution — but the decision to invoke the right tool with the right parameters still rests on the LLM, which is inherently non-deterministic. That’s where things get tricky.

In the next part of this series, we’ll dive into the overlooked but critical problem: How do you know if your agent is choosing the right tool — and how do you fix it when it doesn’t? We’ll explore evaluation techniques, failure patterns, and practical ways to debug and improve tool selection, so your agent not only runs, but runs smart.
Subscribe to my newsletter for updates on the MCP Series

Stop Tweaking Prompts—Fix Your Retrieval and Win

The Hidden Business Metrics Driving AI Success (or Costly Failure)

Are your AI generation systems consistently underperforming despite endless prompt tweaking? You're likely optimizing the wrong part of your pipeline. After analyzing generation systems across industries, I've discovered an uncomfortable truth: most teams waste resources fine-tuning models and prompts while ignoring the foundation everything depends on—retrieval quality.

The Costly Mistake Most AI Teams Make

Your generation task is only as good as the information it works with. Yet most teams spend weeks optimizing LLM prompts while dedicating minimal effort to their retrieval systems. This fundamental mistake costs companies thousands in wasted compute and engineering hours.

Common Tradeoffs Between Content Generation & Retrieval

  • Speed: Content generation takes 1-10 seconds per test, while retrieval takes only 10-800ms.
  • Cost: Content generation can cost hundreds per run, whereas retrieval is negligible.
  • Objectivity: Content evaluation is subjective, while retrieval metrics are quantitative.
  • Iteration Speed: Content tuning takes hours, retrieval tuning takes minutes.
  • Scalability: Content evaluation is limited, retrieval can be automated.

By focusing on retrieval first, you build a solid foundation that enhances everything else. Here’s why it matters and how to implement it systematically.

Framing Generation as a Retrieval Problem

Consider text-to-SQL generation. Before generating SQL, the system must first retrieve relevant context efficiently:

The fundamental step is to create a dataset where each entry consists of an generated input (e.g., a prompt, a question) paired with its desired output (e.g., an answer, a summary, sql snippet). Now task of the retrieval system is to retrieve the SQL snippet from a query efficiently and accurately. By defining and tracking retrieval metrics like recall , MRR you can identify areas where the retrieval system is failing to find the correct SQL snippets for cetain type of queries which informs optimization effort like finetuning embedding model, experiment with different indexing

Key Benefits of Retrieval-First Approach

  1. Early edge case detection: Identify potential failure points by analyzing retrieval errors before they impact production performance.
  2. Business logic integration: Ensure that company-specific calculations, such as custom pricing rules or compliance constraints, are accurately retrieved and incorporated into query generation. For example, when handling tax computations, retrieving company-specific tax codes before generating SQL ensures compliance and accuracy:
# Example tax code for a specific company
def calculate_tax(amount, tax_code):
    tax_rates = {"A": 0.05, "B": 0.1, "C": 0.15}  # Different tax rules per company
    return amount * tax_rates.get(tax_code, 0.08)  # Default tax rate if code is unknown

# Applying tax calculation in query generation
final_query = f"SELECT total_price, total_price + {calculate_tax('total_price', tax_code)} AS total_with_tax FROM orders"
  1. Metrics-driven optimization: Precision, recall, and MRR provide objective benchmarks and help ensure accurate and comprehensive infromation retrieval 
  2. Few-shot enhancement: A generation task can be greatly improved by including few-shot examples especially those of business specific edge cases or difficult to retrieve.

The Three-Step System for Optimizing Retrieval

Step 1: Create Synthetic Test Data

Generate diverse test questions reflecting real user queries:

Example input:

challenging_examples = ["SELECT * FROM sales WHERE amount > 1000", "SELECT COUNT(*) FROM users WHERE signup_date > '2024-01-01'"]

Example output:

[{
    "question": "How many users signed up after January 1, 2024?",
    "relevance": "high"
}, {
    "question": "Show all sales transactions where the amount exceeds 1000.",
    "relevance": "high"
}]
from pydantic import BaseModel
import instructor
import random

class SyntheticQuestion(BaseModel):
    question: str
    relevance: str

client = instructor.from_openai(openai.OpenAI())
questions = []

for sql_example in challenging_examples:
    constraints = random.choice(["time", "amount", "limit", "comparison"])
    question = client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=SyntheticQuestion,
        messages=[
            {"role": "user", "content":f"Generate a question for this SQL: {sql_example}. Add {constraints} constraint."}
            ]
    )
    questions.append(question)

This ensures comprehensive test coverage without needing real user data.

Step 2: Define Meaningful Retrieval Metrics

  • Precision@K: How many retrieved items are relevant? Improving precision makes sure we are not wasting resources and increasing latency by processing information which are irrelevant
  • Recall@K: How many relevant items did we retrieve?  By optimizing for recall, we ensure that the language model has access to all necessary information, which can lead to more accurate and reliable generated responses.
  • MRR@K (Mean Reciprocal Rank): How highly ranked are the relevant items? If we want to display retrieved results as citations to users, we only show the highly ranked retrieved results to the user. 

For most generation tasks, recall is critical—your LLM needs all necessary information to generate accurate responses.

Step 3: Build Your Evaluation Pipeline

For example, you want to improve performance and accuracy of your text-to-SQL system for your analytics engine.

You want to evaluate if adding a reranker is worth it. While it promised better results, it adds 200ms latency and $0.01 per query.

You need a way to systematically improve your system with data driven methods so that you are not wasting months on the wrong thing. 

Set up an automated pipeline to test different retrieval strategies:

# Compare different retrieval approaches systematically
for embedding_model in ["text-embedding-3-small", "text-embedding-3-large"]:
    for search_type in ["vector", "hybrid"]:
        for use_reranker in [True, False]:
            results = evaluate_retrieval(
                questions,
                embedding_model=embedding_model,
                search_type=search_type,
                use_reranker=use_reranker
            )
            log_metrics(results)

Finding from the evaluation pipeline : 

  1. Upgrading from text-embedding-3-small to text-embedding-3-large provided better performance than adding a re-ranker.
  2. The larger embedding model achieved near-perfect recall (0.98) at k=20 without a re-ranker.
  3. The re-ranker showed minimal statistical improvement, confirmed using bootstrapping and t-tests.We avoided an unnecessary cost while improving overall performance.

This allows for evidence-based decisions about retrieval architecture.

Validating Your Improvements with Statistical Rigor

Once you see performance improvements, validate them statistically:

  • Bootstrap sampling: Simulate multiple runs to estimate result stability.
  • Confidence intervals: Visualize performance ranges to identify overlapping results.
  • Statistical testing: Use t-tests to confirm if differences are significant.

When comparing embedding models, t-statistics below -10 and p-values < 0.001 confirmed that the larger model's superior performance was statistically significant.

Three Critical Mistakes to Avoid

  1. Starting with generation evaluation: This is slow, expensive, and subjective. Start with retrieval metrics for faster iteration.
  2. Using narrow test data: Generate diverse synthetic questions to uncover blind spots.
  3. Making decisions without statistical validation: Ensure improvements are real, not just random variation.

Start Measuring What Matters

Retrieval isn’t just a nice-to-have—it’s the backbone of effective generation tasks. Whether you’re summarizing, reporting, or extracting insights, it’s the step that turns potential into performance

The most profitable AI systems aren’t built on hype—they’re built on systematic measurement and optimization that drive real business impact. By prioritizing retrieval, you will create AI generation systems that are faster, more reliable, and cost-effective.

Don't let poor retrieval undermine your AI projects. Start optimizing today.