Building an Intelligent OSPF Troubleshooting Agent with LangChain and Netmiko in ISP Network

By Admin | 28-06-2026

Abstract

A specific type of LangeChain-based agents can completely change the landscape for network operations by automating OSPF troubleshooting and analysis. And it explains about the basics of LangChain, AI agents and your tool integration with network devices with any automation framework like Netmiko. We launch into the article as deep dive, where an AI troubleshooting assistant, automatically collect router outputs to analyzes OSPF neighbor states and then finds root cause(s) like MTU mismatch, authentication failure, timer difference between neighbors or issues in DR/BDR election with smart remediation recommendation. Using a large language model together with real-time network telemetry, organizations will be able to improve the mean time to resolution (MTTR), create repeatable troubleshooting workflows and improve the productivity of Network Operations Center (NOC) and Technical Assistance Center (TAC) teams. This blog provides you with some practical workflows, decision trees as well as real life scenarios that provides insight into how AI agents can help engineers to diagnose OSPF issues quicker and more accurately.

How LangChain Works?

LangChain — An open-source framework that extends the capabilities of GPT and other Large Language Models (LLMs) to interact with external APIs, databases, network devices, etc. LangChain first interprets the request through LLM, when you put in a query and if the A.I agent feels something needs to be asked before it can answer accurately. In case the agent needs live network data, it calls on relevant tools (like Netmiko) to SSH to routers and execute related CLI commands (e.g. show ip ospf neighbor or show ip ospf interface). So, the LLM would analyze all these outputs and use networking knowledge and logical reasoning to narrow it down as to what went wrong, an MTU mismatch, timer mismatch, DR/BDR election issue or anything else. Ultimately, you will receive an unambiguous explanation from the AI agent along with suggested remedial measures. With the combination of reasoning with real-time network data, LangChain can convert a static LLM use case into an intelligent and powerful network troubleshooting assistant that can automate diagnosis, reduce troubleshooting rest time and improve operational efficiency.

 

Let’s deep and down how Langchain internal workflow helps to elect DR-BDR in production unit and solve stuck states of OSPF.

 

Scenario

Andrew is a Senior Network Engineer at Cogent ISP, managing an internal OSPF network with over 70 interconnected routers. During production, an unexpected DR/BDR election and OSPF neighbors stuck in the EXSTART state impact network stability. Instead of manually troubleshooting multiple routers, Andrew uses a LangChain-powered AI Agent integrated with Netmiko to automatically collect OSPF data, analyse neighbor states and DR/BDR election parameters, identify the root cause, and recommend the appropriate fix. This AI-driven approach significantly reduces troubleshooting time, improves operational efficiency, and ensures faster restoration of OSPF services in a large-scale ISP environment.

To understand logic here we used 4 routers.

 

Let’s deep and dive to automate through Langchain Agent

from netmiko import ConnectHandler
from getpass import getpass
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
load_dotenv()

llm = ChatOpenAI(
    model="gpt-4o"
)
import re
import os
# ============================================
# Router Inventory
# ============================================

routers = {
    "R1": "192.168.184.131",
    "R2": "192.168.184.132",
    "R3": "192.168.184.133",
    "R4": "192.168.184.134"
}
# ============================================
# Credentials
# ============================================

username = os.getenv("DEVICE_USERNAME")
password = os.getenv("DEVICE_PASSWORD")

# ============================================
# OSPF Collection Tool
# ============================================
@tool
def ospf_election_check(router_name: str) -> str:
    """
    Collect OSPF election information.
    """

    router_name = router_name.upper()

    if router_name not in routers:
        return f"Router {router_name} not found."

    device = {
        "device_type": "cisco_ios",
        "ip": routers[router_name],
        "username": username,
        "password": password,
    }

    try:

        conn = ConnectHandler(**device)

        neighbor_output = conn.send_command("show ip ospf neighbor")

        interface_output = conn.send_command("show ip ospf interface")

        conn.disconnect()
        # --------------------
        # Router ID
        # --------------------

        rid_match = re.search(
            r"Router ID\s+([\d\.]+)",
            interface_output
        )

        router_id = rid_match.group(1) if rid_match else "Not Found"
# --------------------
        # Priority
        # --------------------

        priority_match = re.search(
            r"Priority\s+(\d+)",
            interface_output
        )

        priority = priority_match.group(1) if priority_match else "Not Found"
# --------------------
        # Interface State
        # --------------------

        state_match = re.search(
            r"State\s+(\S+)",
            interface_output
        )

        interface_state = state_match.group(1) if state_match else "Not Found"
# --------------------
        # Network Type
        # --------------------

        network_match = re.search(
            r"Network Type\s+([A-Z]+)",
            interface_output
        )

        network_type = network_match.group(1) if network_match else "Not Found"
 # --------------------
        # DR
        # --------------------

        dr_match = re.search(
            r"Designated Router \(ID\)\s+([\d\.]+)",
            interface_output
        )

        dr = dr_match.group(1) if dr_match else "Not Found"
 # --------------------
        # BDR
        # --------------------

        bdr_match = re.search(
            r"Backup Designated Router \(ID\)\s+([\d\.]+)",
            interface_output
        )

        bdr = bdr_match.group(1) if bdr_match else "Not Found"
  # --------------------
        # Parse Neighbor States
        # --------------------

        neighbor_states = []

        for line in neighbor_output.splitlines():

            line = line.strip()

            if not line:
                continue

            if line.startswith("Neighbor ID"):
                continue

            parts = line.split()

            if len(parts) >= 3:

                neighbor_states.append({

                    "neighbor": parts[0],

                    "state": parts[2]

                })
# --------------------
        # Neighbor Health
        # --------------------

        issues = []

        for nbr in neighbor_states:

            state = nbr["state"].upper()

            if "FULL" in state:
                continue

            elif "2WAY" in state or "2-WAY" in state:
                issues.append(
                    f"{nbr['neighbor']} is in 2-WAY state."
                )

            elif "EXSTART" in state:
                issues.append(
                    f"{nbr['neighbor']} is stuck in EXSTART."
                )

            elif "EXCHANGE" in state:
                issues.append(
                    f"{nbr['neighbor']} is stuck in EXCHANGE."
                )

            elif "LOADING" in state:
                issues.append(
                    f"{nbr['neighbor']} is stuck in LOADING."
                )

            elif "INIT" in state:
                issues.append(
                    f"{nbr['neighbor']} is stuck in INIT."
                )

            elif "DOWN" in state:
                issues.append(
                    f"{nbr['neighbor']} is DOWN."
                )

        if issues:
            health = "\n".join(issues)
        else:
            health = "All OSPF neighbors are healthy."

        result = f"""

 

=================================================

Router Name : {router_name}

Router ID : {router_id}

Priority : {priority}

Interface State : {interface_state}

Network Type : {network_type}

DR : {dr}

BDR : {bdr}

=================================================

Neighbor Health

{health}

=================================================

Neighbor Output

{neighbor_output}

=================================================

Interface Output

{interface_output}

=================================================
"""

        return result

    except Exception as e:

        return f"Error : {str(e)}"
# ============================================
# Prompt
# ============================================

prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        """

Analyze the OSPF output.

Explain:

1. Router ID
2. Router Priority
3. Interface State
4. Network Type
5. DR
6. BDR
7. Neighbor States
8. DR Election
9. Is the election correct?
10. Detect INIT, 2-WAY, EXSTART, EXCHANGE, LOADING and DOWN.
11. Explain root cause.
12. Recommend troubleshooting.
13. Mention whether OSPF is healthy.
"""
    ),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])
# ============================================
# Agent
# ============================================

agent = create_tool_calling_agent(
    llm,
    [ospf_election_check],
    prompt
)

agent_executor = AgentExecutor(
    agent=agent,
    tools=[ospf_election_check],
    verbose=True
)
# ============================================
# User Query
# ============================================

query = input(
    "\nAsk Question (Example: Who is DR and BDR on R1?): "
)

response = agent_executor.invoke(
    {
        "input": query
    }
)

print("\n==============================")
print(response["output"])
print("==============================")

Output:

Step-1: Enter in prompt to ask question 

Who is DR and BDR on R1?
> Entering new AgentExecutor chain...

Invoking: `ospf_election_check` with `{'router_name': 'R1'}`

Step-2 The tool returns the output 

=================================================

Router Name : R1

Router ID : 1.1.1.1

Priority : 100

Interface State : DR

Network Type : BROADCAST

DR : 1.1.1.1

BDR : 2.2.2.2

=================================================

Neighbor Health

All OSPF neighbors are healthy.

=================================================

Neighbor Output

Neighbor ID     Pri  State           Dead Time   Address      Interface
2.2.2.2          1   FULL/BDR        00:00:37    192.168.184.132     Gig0/0
3.3.3.3          1   FULL/DROTHER    00:00:38    192.168.184.133     Gig0/1

=================================================

Interface Output

GigabitEthernet0/0 is up, line protocol is up
Internet Address 192.168.184.131/32, Area 0
Process ID 1, Router ID 1.1.1.1, Network Type BROADCAST
Transmit Delay is 1 sec
State DR, Priority 100
Designated Router (ID) 1.1.1.1
Backup Designated Router (ID) 2.2.2.2
Timer intervals configured, Hello 10, Dead 40

=================================================

Step-3 LLM (GPT-4o) analyzes the output and prints

==============================
OSPF Analysis Report
Router ID
---------
Router ID is 1.1.1.1.
Router Priority
---------------
The router priority is 100, giving it the highest preference in the DR election.
Interface State
---------------
The interface state is DR, indicating that R1 is currently acting as the Designated Router.
Network Type
------------
The interface is configured as a Broadcast network.
Designated Router
-----------------
DR : 1.1.1.1
Backup Designated Router
------------------------
BDR : 2.2.2.2
Neighbor States
---------------
Neighbor 2.2.2.2 is in FULL/BDR state.
Neighbor 3.3.3.3 is in FULL/DROTHER state.
DR Election
-----------
The DR election appears correct because R1 has the highest OSPF priority (100) and is operating as the Designated Router.
Neighbor Health
---------------
All neighbors have reached FULL state.
Root Cause
----------
No OSPF issues detected.
Recommendation
--------------
• Continue monitoring OSPF adjacency.
• Verify Hello and Dead timers remain consistent.
• Periodically check LSDB synchronization.
• Monitor interface errors and CPU utilization.
Overall Status
--------------
OSPF is healthy and operating normally.
==============================

Case-1 If OSPF neighbor is stuck in EXSTART state 

Step-1 Fire command show ip ospf neighbor on R1 router 

 

Neighbor ID    Pri   State      Dead Time   Address
2.2.2.2         1    EXSTART    00:00:39   192.168.184.132

 

Step-2 LLM (GPT-4o) analyzes the output and prints

==============================
OSPF Analysis Report
Router ID
---------
1.1.1.1
Priority
--------
100
Interface State
---------------
DR
Neighbor State
--------------
Neighbor 2.2.2.2 is stuck in EXSTART.
DR Election
-----------
DR election appears correct.
Root Cause
----------
The EXSTART state indicates Database Description (DBD) packets cannot be exchanged successfully.
Possible causes include:
• MTU mismatch
• Duplicate Router ID
• Interface packet drops
• Network type mismatch
Recommended Troubleshooting
---------------------------
1. Verify MTU on both routers.
2. Compare Router IDs.
3. Check interface errors.
4. Verify OSPF network type.
5. Use 'debug ip ospf adj' if required.
6. Consider 'ip ospf mtu-ignore' after confirming the root cause.
Overall Status
--------------
OSPF is NOT healthy because one neighbor is stuck in EXSTART.
==============================

DR-BDR Election and OSPF Stuck State in Production: How LangChain Internal Workflow Comes into Play?

Timely and effective detection of OSPF issues is crucial to ensure quick resolution of service outages in an operational network. The internal workflow of LangChain makes traditional troubleshooting better by applying the reasoning power of large language models (LLM) with live network data gathered from routers using automation tools like Netmiko. For example, if an engineer reports about DR/BDR election or if OSPF neighbor stuck in state INIT / EXSTART / EXCHANGE and LOADING, firstly LangChain agent interprets the user request and functions smartly to determine which CLI commands are needed (show ip ospf neighbor, show ip ospf interface, show ip ospf database, show running-config) It gathers and analyzes information in real-time on OSPF priorities, Router IDs, interface MTU, hello/dead timers, authentication settings, network type and LSA synchronization status. Using these parameters, the AI agent can tell whether to those behaviors of OSPF protocol are due to MTU mismatches, clock timer discrepancies, wrong authentication types and credentials, two routers with dual Router ID in a multi-access link or if the network type is not correctly configured in one side or that it has been not so preferentially elected as DR/BDR. Rather than relying on engineers to sift through hundreds of devices and painstakingly correlate outputs, LangChain generates a complete root cause analysis along with evidence from the CLI data that was collected and gives recommendations on corrective actions. It reduces MTTR, provides consistent troubleshooting between NOC and TAC, cuts down on human error and helps bring the network into a stable state faster in large production environments.

Summary

In this blog, we show the establishment of LangChain-powered AI agents that can revolutionize OSPF troubleshooting in large global ISP settings by combining the reasoning capabilities of Large Language Models (LLMs) with real-time network automation. The blog demonstrates the functioning of an AI agent to automatically fetch live router information via Netmiko, evaluate DR/BDR elections for routers belonging to OSPF routing domain in a production-like scenario with more than 70 routers at Cogent ISP along with how to process OSPF neighbor state for various states like INIT, EXSTART, EXCHANGE, LOADING and DOWN with root cause analysis and recommended corrective action. LangChain helps NOC and TAC engineers reduce Mean Time to Resolution (MTTR), minimize manual troubleshooting, and improve operational efficiency by automating data collection, protocol analysis, and decision making which allows OSPF issues in production networks to be resolved with improved speed and accuracy.