The Promise of Generative AI in Telco Networks - Pitfalls and Opportunities

Introduction
Artificial Intelligence (AI) in Telco Operations is here. It is happening in phases, starting with the use of Large Language Models (LLM) in the call center. The Return on Investment (RoI) is obvious and immediate. It is straight forward to integrate LLMs with Customer Support since both are language-based (text and voice). As a result, AI has improved the quality of customer support while reducing the time spent on each case.
AI is now moving into Network Operations. While the RoI can potentially be huge, achieving it has proven to be a more complex, multi-faceted challenge. Initially, the focus was to combine anomaly detection with contextual documentation. The idea is to reduce the time to triage by bringing telco standards, industry documentation, ticketing data and other sources together with the anomalies using LLM. The integration is done via LLM fine-tuning or Retrieval Augmented Generation (RAG), or both. The strategies vary. While this has improved access to consolidated and relevant information related to network anomalies, it is not accurate enough to determine root cause of many network issues. Consequently, telcos are now taking a closer look at the data used during the troubleshooting process itself or the escalation process more specifically. This is where experienced engineers spend their time investigating. Deep understanding of network packet captures, coupled with relevant vendor and standards documentation is required. Using AI to automate network diagnostics and remediation is challenging, since the packet capture data (PCAP) is not immediately suitable for use by an LLM, for many reasons.
Early attempts and lessons learned
The initial attempts to leverage LLM with RAG centered around extracting error codes from PCAP files and use RAG to retrieve relevant documentation from other sources. While this brings documentation to the engineer faster, it can make the triage process worse since an error code is not the same as the root cause, resulting in the engineer being inundated with irrelevant information and potentially led down the wrong path.
Another approach is to translate the whole PCAP file into text, chunk it somehow and then use it as a document in the LLM/RAG process. This approach proved ineffective for a few reasons: (1) The data remains raw even though it was translated into a text format; (2) While it is raw text, it represents a language that the LLM is not trained on; (3) Even if an LLM was trained (or fine-tuned) on the text data, the text is fragmented and lack key data points. Another way to put it: The data quality is poor.
Lessons from creating Large Language Models
Let's look at how an LLM is built (text only, we will leave multi-modal LLMs out for simplicity): You crawl the web, avoid blacklisted URLs (bad or poor quality content), collect all the text, eliminate redundancy, further curate the content using various methods and then structure it in one large text blob. Then you tokenize the text. Only then can you begin training your model. From this, we can easily see that the raw content is not useful for training as-is. If you decide to take an LLM base model and fine-tune it, the quality of the data becomes even more important. During fine-tuning, the training set must be perfect. This applies to PCAP data as well. It needs to be refined before you can use it with RAG, train a small language model to "speak PCAP" or to fine-tune an LLM.
Content refinement and curation
When you spend time refining the content, you make significant progress. The refinement involves parsing out key data points, correlating network flows, building network topology, creating an ontology that describes the relationships between various information elements in the data, etc. In our version of tokenization, we perform feature engineering that we then use to train a Machine Learning (ML) model. This helps us refine the output to focus squarely on the root cause of any network issues. After all, this is the core job of this "troubleshooter". Imagine the refined context you now have. With it, you can combine other sources of curated documentation (e.g., industry standards, customer documentation, issue tickets, etc.) with the results from the troubleshooting. We combine this via RAG. The result is not just an answer about what errors are found, you will know the root error, root cause, insights into why this happened and how to remediate the issue.

Figure 1 – Process for producing troubleshooting results
Continuous feedback and refinement
Things change and evolve constantly. Networks too. It is therefore important that the fidelity of the refinement process remains high. We do this via Active Learning and Similarity Analysis. These 2 features allow network experts to weigh in when the system detects previously unseen data patterns. The similarity analysis will compare novel patterns with previously seen data flows to see if it has seen something that is close to the new pattern. It will then propose what this new pattern may be. This accelerates the work of the experts as they are trying to decipher what is going on. The system takes the feedback and updates its model without the need for retraining.
Looking forward to Agentic AI
As we enter 2025, there is excitement about Agentic AI. In network troubleshooting, Agentic AI can only be realized when you have a high quality diagnosis of the network issues. Otherwise, any AI Agent will likely take incorrect action, possibly creating havoc in the network. Consequently, At B-Yond, our focus is on using the trouble shooting results from AGILITY as input to AI Agents that can perform automatic issue remediation. A step closer to the autonomous network!
To learn more about B-Yond and AGILITY please contact us.