Why PCAPs are the Only and Best Source of Truth for Network Troubleshooting
Why is testing so important and what are the consequences of failure?
The 2022 Rogers nationwide outage in Canada was a significant event that revealed major vulnerabilities in the telecommunications network.
The problem originated from a network system failure that occurred immediately after a maintenance update in their core network. This seemingly routine update led to a malfunction in some of Rogers' routers, triggering a ripple effect that caused widespread disruptions.
Millions of customers across Canada felt the impact, experiencing interruptions not just in their personal communications but also in essential services. The outage lasted for over 12 hours, affecting various sectors including transit systems, emergency 911 services, banking systems, and general internet connectivity.
The incident was a huge blow to consumer confidence, casting doubts on the reliability of not only the Rogers brand, but also its subsidiaries, including Fido, Cityfone, and Chatr. In response to the outrage and dissatisfaction expressed by its customers, Rogers was compelled to offer rebates, resulting in a financial hit estimated between $150 million in damages.
The company recognized that it needed to take proactive measures to prevent such failures in the future.
So now, Rogers has now pivoted towards the use of artificial intelligence (AI) to supervise and monitor its network. The scale of Rogers investment into AI is significant at CAD $10 billion. Further, the introduction of AI aims to provide early detection of potential issues, allowing for quicker responses and potentially avoiding similar catastrophes in the future.
Why using PCAP (Packet Capture) is the gold standard for network traffic testing
PCAP (Packet Capture) files represent the gold standard for network testing and troubleshooting due to their ability to provide a detailed and accurate snapshot of network traffic. Capturing packets as they flow through the network, PCAP files provide a raw and unfiltered view of the entire communication process within the network.
Due to the depth of information and insights of the data, PCAPs are undeniably the gold standard for network testing and interconnectivity troubleshooting. However, in production networks, packet capture testing is typically only used in the last 10%-20% of the troubleshooting cycle when all else fails due to the complexity and scale of the data.
- PCAPs are the only raw and vendor-agnostic standard data capture, and is the dominant structured format for decoding binary network data. Other data sources (PM, FM, CM) are unreliable because:
Packet capture analysis provides unique capabilities in diagnosing network issues that are simply unattainable with other data sources. That’s why it’s such an indispensable tool for network troubleshooting.
PCAPs enable network engineers to pinpoint a more accurate root cause of a network problem. They allow a level of detail in analysis that simply doesn't exist elsewhere, going beyond surface-level symptoms to reveal underlying problems. This insight can lead to resolving certain parts of the troubleshooting process entirely, providing comprehensive solutions that are unachievable through other means.
Compared to alternative data sources, PCAPs are not vendor or OEM-specific, so standard procedures can be maintained even when switching vendors. Standardized procedures within PCAPs offer a reference against which the actual flow of messages can be checked, creating an efficient and accurate method of assessment. Other data sources may require cross-referencing against various counters to evaluate an issue, a process that can be cumbersome and less precise.
Businesses that opt not to use PCAPs for network analysis often find themselves relying on data sources that aren't up to the task and leave them more vulnerable to network disruptions. The use of PCAPs, in contrast, offers a robust and reliable approach to maintaining network integrity.
PCAP: a benchmark amid mounting challenges
It’s clear that PCAP testing is the best way to ensure network stability, so why have PCAPs not reached ubiquitous adoption as the de facto troubleshooting data set in production networks?
This is due to several challenges, including the fact that modern network environments are becoming increasingly complex and dynamic.
Advanced Network Technologies
The integration of advanced technologies like virtualization, cloud computing, software-defined networking, and IoT technologies adds layers of complexity to network configurations. These multiceted structures require specialized understanding and tools, making network packet tracing a lot more difficult to achieve.
Higher Network Speeds and Volumes With the exponential increase in dvolumes of data being generated across the network -streaming/capturing, processing, and assessing information in real-time becomes a formidable task. Traditional packet tracing mechanisms may face bottlenecks or even drop packets, unable to keep up with the sheer quantity and speed of production network traffic. For these reasons, PCAPs are not captured continuously in production networks, rather used in a reactive setting and a last mile data set for analysis and troubleshooting after Network KPI degredation or Subscriber impact has been observed.
- Customer Example: Suppose a client is experiencing challenges in reducing a PCAP file (~GBs or TBs) that they capture in a live setting. Their goal is to slice and dice signaling and user plane traffic for analysis and troubleshooting, which is no simple feat. The probable solution is to first save all the raw packets, and from the saved packets, filter out a subset and then create a PCAP file. Time-based filtering, for example, a 60-second PCAP of all C-planes may not be the best option, although it would allow for the best opportunity to view all the traffic. As calculated, a 2Gbps C-plane data only, translates to a roughly 15GB file size. Adding User plane data on top will increase this file exponentially, making it virtually impossible to analyze.
Widespread Encryption Challenges
The rise of encryption in data and network protocol packets presents its own set of difficulties, including limited visibility into packet payloads such as NAS (Non-Access Stratum), decryption requirements for TLS and user plane encrypted messages, performance implications, and privacy and legal considerations. In such cases, PCAPs may struggle to provide adequate insights, and decrypting the information may lead to additional complications and potential legal risks.
Sensitive Content
Packet level information can potentially become a concern, both on the control-plane, with exposes details on network topology, client locations and peer contacts; and user-plane side, which is potentially the type of lawful-interception requests, specifically in cases of Emergency (E911) and Mission Critical services.
Diverse Traffic Types
Modern networks host a variety of applications and services, each with distinct characteristics and behaviors that need to be understood and interpreted accurately. A PCAP may not always contain data from all interfaces (Control plane and User Plane), therefore PCAPs need to be captured from multiple interfaces and stitched together for an end-to-end view of network behavior.
Capturing PCAPs across one interface/domain or layer may not be adept at recognizing and analyzing these varied traffic types, leading to potential misunderstandings of how different services are interacting within the network.
The combination of these factors illustrates the challenges of using manual PCAP testing in today's intricate and fast-paced network landscapes. The traditional methods that may have been effective in simpler environments now struggle to adapt to the multifaceted nature of contemporary networks, underscoring the need for more advanced and automated tools capable of managing these complex scenarios.
What are the obstacles hindering the adoption of AI and ML-technology in the telecommunications industry?
The people problem
Effective PCAP analysis requires highly skilled engineers and data analysts, familiar with network protocols, analysis techniques, and tools. Finding and keeping such experts is hard, and turnover can be costly, causing "knowledge drain." When a skilled person leaves, they take unique insights that aren't easily replaced, impeding the successful implementation of PCAP analysis in telecommunications networks.
The process of implementing packet capture analysis
Beyond the human aspect, the process itself presents its challenges. Implementing PCAP analysis requires well-defined documentation along with clear, frequent communication and collaboration between network engineers and data analysts. The network's complexity requires a standardized process ensuring everyone knows their roles and goals. Without this cohesion, the analysis can become fragmented and less effective.
Combination of People and Process
The successful launch of new technologies and the enhancement of PCAP analysis programs require a combination of the right people and the right processes. Telecommunication companies must strike this delicate balance, ensuring that they not only have the skilled personnel in place but also the well-defined, collaborative processes that enable them to work effectively.
Redefining PCAP analysis with AI-driven tech
The integration of AI and ML-driven technology into PCAP analysis represents a revolutionary, transformative shift that changes the playing field altogether. The benefits include:
Improved Network Health
By leveraging AI for PCAP analysis by employing advanced algorithms and machine learning, engineers can glean more comprehensive insights into the network's operations. This, in turn, enables them to identify and resolve issues with increased speed and precision. The AI's ability to process vast amounts of data and detect patterns means that the underlying causes of network problems can be pinpointed and remedied more effectively.
Cost Savings and Improved Efficiency
Automation of protocol-level analyses, automatic detection of root errors, and instant assessment of network issues replace time-consuming manual efforts. This automation not only speeds up the process but also enhances accuracy. Considering that field and service operations account for 60-70% of most telcos' operating budgets, the ability to alleviate manual strain from PCAP analyses can dramatically increase field productivity, translating into substantial financial savings.
Enhanced Customer Service
By using AI and ML algorithms, it's possible to proactively identify and resolve network issues before customers even become aware of them. This preemptive problem-solving leads to fewer complaints and contributes to a more seamless user experience. Additionally, AI-powered solutions enable more personalized customer interactions, equipping support teams with information tailored to individual customer needs.
Increased Profitability
Top telco companies that have deployed these technologies have experienced a five-year revenue Compound Annual Growth Rate (CAGR) 2.1 times higher than that of their peers, along with a 2.5 times larger total return to shareholders. These figures underline the tangible financial benefits that AI-driven PCAP analysis can deliver, making it an attractive investment for companies looking to stay at the forefront of the industry.
AGILITY by B-Yond: Reshaping the future for 4G and 5G network operators
Enter AGILITY by B-Yond.
This groundbreaking platform offers a comprehensive suite of AI and ML-powered tools for network troubleshooting and analysis, focusing on efficiency, cost reduction, and accelerated deployment of new services for telecom operators worldwide.
Detect and Visualize End-to-End Call Flows
B-Yond's technology illustrates the sequence of messages on a ladder diagram, including protocol and network element level errors, and golden call flow comparison. This visualization helps in understanding the entire call flow process.
Isolate Network Failures More Quickly
AGILITY offers intelligent root cause identification, allowing for instant call flow classification as success vs failure. This helps to quickly isolate the network function where the error occurred, speeding up the troubleshooting process.
Detect the Root Cause of Network Errors
Through automated packet analysis and network failure knowledge bases, B-Yond's system can detect and identify errors, providing insights into the underlying causes of network issues.
Correlate Multiple Data Sources
The integration of machine learning and AI in network analysis enables AGILITY to correlate various data points for comprehensive analysis.
Integrated CI/CD/CT Pipeline
Benefits of AGILITY:
2x Accelerated Time-to-Market for New Services
AGILITY has transformed network testing, streamlining processes, and significantly reducing time-to-market for new features. This acceleration helps operators stay competitive.
10x Increase in Test Capacity
The implementation of Continuous Validation (CV) and Continuous Testing (CT) with a comprehensive AI Repository has led to a 10x increase in test capacity, improving network quality and hastening resolution times.
75% Average RCA (Root Cause Analysis) Cost Reduction
Leveraging AI and ML, AGILITY reduces the time and costs of repeated investigations on recurring patterns of network failures, leading to significant cost savings.
AGILITY can be deployed on the B-Yond cloud as a SaaS solution or on an internal cluster with a simple, fully automated operator.
A new promising, optimistic frontier for telecommunications is on the horizon. With the advent and expansion of 5G, the integration of AI and ML for PCAP analysis through AGILITY represents a pivotal advancement in network management.
By empowering telcos to scale testing, AGILITY ensures improved network health, significant cost savings, reduced OPEX, and increased profitability. This intelligent automation lays the foundation for robust connectivity and innovation, translating into a direct financial advantage.
In an era where customer expectations are high, seamless connectivity is paramount.
The adoption of AI and ML for PCAP analysis is not just an incremental improvement; it's a revolutionary step towards a future where network outages are a thing of the past, and telcos become partners in progress, driving technological advancement and societal well-being.
Are you ready to take a leap into the future of network testing and minimize network disruptions while saving money? Try AGILITY for free today at https://www.b-yond.com/agility-signup.
Interested in learning more? Click here to schedule your live guided demo of the AGILITY platform today.
Resources
- https://techsee.me/blog/artificial-intelligence-in-telecommunications-industry/
- https://www.netaxis.be/2023/05/04/artificial-intelligence-ai-is-rapidly-transforming-the-telecommunications- industry/
- https://technovert.com/blog/ai-in-telecom-automation-top-challenges-and-solutions/
- https://ts2.space/en/ai-in-the-future-of-ai-driven-telecommunication-networks-investing-in-technologies-for- high-speed-secure-connectivity/
- https://www.ctvnews.ca/business/rogers-to-spend-150-million-on-customer-credits-after-july-8-outage-1.6003851
- https://www.securityweek.com/all-eyes-pcap-gold-standard-traffic-analysis/