Network fault management

88%
Reduction
An 88% reduction in trouble ticketing
105
weeks
Engineering capacity restored in Year 2
90%
Reutilisation
~90% of the solution was re-useable IP for NOC operations

Background

A Communication Services Provider (CSP) with an international voice network wanted to maximise the quality of service to their customers whilst minimising the cost.

A set of network traffic monitoring systems reported incidents on the network traffic to specific destinations. As well as this, a central routing engine helped engineers make changes to the routing as and when required.

Due to the complexity of correlating the alarms and the evaluating the best routing alternatives, it proved difficult for the customer's engineers to keep up with the changes, leading to them being overwhelmed and their ability to apply fixes to the network diminishing over time.

The fault management system had a huge list of alerts which were either not handled in time or were cleared by users without any analysis.

Objectives

The Customer understood this to be a business critical process which was constantly deteriorating under the pressures and demands of the wider business growing around them, but the departmental ability to keep up with said growth, not scaling sufficiently.

An automation initiative was devised to increase the velocity and accuracy at which this team could operate. Additionally, the business needed to evaluate and remedy all unhandled alerts, as well as those which were handled either slowly or incorrectly, to reduce the errors in the network. The end goal of the project was to achieve a zero-touch automation of the routing changes based on traffic alarms.

To carry out the project, an internal automation team was selected to undertake the work, from the knowledge extraction to deployment in production. The team were not experienced in automation techniques and strategy. Equipping the customer to be self-sufficient was key to the selection of CORTEX. The customer needed to be able to increment changes, as well as find extensions to the implementation, which in turn would continue to increment the value of the solution and provide robust long term value for the investment.

Solution

CORTEX was integrated with key tools of the business, but primarily the Fault Management System. This is so CORTEX can gather the network alarms.

Using information collected from the inventory CORTEX triggers the diagnostics to the network through scripted task automation that the CSP had already implemented. Unlike other automation tools, CORTEX's legacy integration capability allowed for automation to be reused saving time and money.

During the project it became clear that the operations team had inconsistent and missing documentation on their diagnostic process for a given alert type and its subtypes. Whilst this can de-rail other automation projects, owed to the human-in-the-loop capability of CORTEX, automation remained possible.

Where business critical automation cannot fail, and similarly, you cannot yet rely on AI, a human is best placed to review the process and then rapidly deploy an automation flow to prevent it re-occurring. Human-in-the-loop dramatically reduces the time it takes to get solutions to production; you build what you can and then iterate as you learn or you find edge cases. This is far more agile than classic waterfall developments that often underpin automation for CSPs.

As the solution immediately addressed the bulk of the highly repetitive situations, this created capacity to begin to address the others. This not only enabled our client achieve a shorter development and testing cycle, but they were able to go on gradually delivering the automation of rarer problems and edge cases into production.

The fault management system had a huge list of alerts which were either not handled in time or were cleared by users without any analysis. Cortex’s scalability allowed every alarm to be addressed in real time.

Further in the implementation process, the existing Fault Management System required enhancement to receive alerts from a wider list of domains increasing the reach of the automation, minor rework through Cortex’s Flexibility when reutilising existing automations, enabled the quick management of these alarms.

The inventory system exposed additional services for automation so that Cortex could use the Service Impact Analysis on demand, minimise the number of tickets raised, and populate the trouble ticket with useful information.

Results from the automation were logged in Cortex but also updated in the Fault Management System which enabled users to be aware of the progress of the automation, intervene as required, and manage changes required in the automation.

Outcomes

The Customer achieved all of their objectives and more.

The Fault Management system was enhanced to receive additional alerts, from a wider list of domains all of which the customer self-serviced in line with their original objectives.

One year after the first deployment, more than 50,000 events were managed by the automation and distilled down to 6,000 automatically generated actionable tickets allowing for greater oversight of issues arising.

85,000 minutes of manual effort saved. With a potential to increase this significantly (estimated 3x) during the following year. Freeing up NOC engineers to take on work that would not have been possible before CORTEX was implemented.

The NOC engineers were educated with automation skills and a framework to deliver successful automation yielding value to the business.

The automation initiative has brought together the people, processes, and technology required to guarantee success, enabling people to implement and support automation, establishing the appropriate processes underpinned by the right technologies.