OLO Network Traffic Management

90%
90% Reduction diagnosing Faults in NOC
88%
88% OLO Faults Automatically Resolved

Background

Like many CSPs, the customer wanted to maximise the quality of their international voice network services whilst minimising costs.

They had a classic ticketing model supported by Engineers. As a result, Engineers were working through queues of tickets.

Some automation already existed. A set of network traffic monitoring systems reported incidents, and created tickets it observed with OLOs with respect to network traffic in and out of specific destinations, or geographies. This information was shared with Engineers, and a central routing engine helped engineers make changes to the routing as and when required.

Stuttering connections, which create many false alarms, and complexity of correlating all of the relevant alarms, then evaluating the best routing alternatives, it proved difficult for the customer's engineers to keep up with the changes, leading to them being overwhelmed. This materially impacted their ability to apply fixes depending on severity of the fault it could take more than 30 mins per alarm to address.

The end result of this was 1,000’s of customers getting frustrated, and the problem continued to scale. The CSP needed to make a change or risk losing those same customers.

Objectives

An automation initiative was devised to increase the speed at which the changes were diagnosed, changes made and reduction the errors in the evaluation and execution, i.e. bad decisions which didn't resolve the original issue.

Ultimately, the customer wanted to achieve a zero-touch automation of the routing changes based on traffic alarms.

To carry out the project, an internal automation team was selected. This team had limited experience in automation and needed to rely on CORTEX from inception to the initial deployment into production. After the initial work was completed, the team were then required to carry out small and incremental changes as necessary without reliance on any external party.

Solution

The team from We Are CORTEX designed a three-stage implementation of the route fault management.

Firstly, it gathered the alarms coming from the Traffic Management System. These alarms, amongst others, indicate low Network Efficiency Ratio (NER), low Average Call Duration (ACD) and low Answer Seizure Ratio (ASR). False alarms, caused by stutters we ignored or correctly collated for more effective application and analysis.

The CORTEX solution would then gather the call records, for the faulty period, for the specific destination. These records contained all the relevant information from each call, and they were analysed to identify multiple failure scenarios ranging from mass call events to partial and whole route failures, including tight loops and specific route issues. These diagnostics were done by the CORTEX platform at machine speed resulting in a 90% decrease of time compared to the original, largely manual system.

Phase one of the implementation focused on fault identification. Once the fault was identified, the system raised a ticket, adding the relevant information, and the user would manually fix the faults through the Central Routing Engine (CRE).

In Phase two, the user would log onto CORTEX and would be able to approve the mitigation action which the system would automatically implement.

For the Final phase, the faults would be resolved without any user interaction of any kind. This would be a 100% automated solution to the problem.

Outcomes

The client achieved complete success with this project, but importantly, can continue to expand re-use the solution across its operational estate.

By significantly reducing the time to diagnose issues, and by removing superfluous alerts, CORTEX helped this client reduce the event horizon by 90%.

Importantly, CORTEX then helped the client focus impact incident response times, reducing them from ~30 mins to near machine speed response times, i.e. seconds.

Ultimately, the client achieved nirvana by having faults resolved without manual intervention.

Expensive, much slower resources were able to be restored to higher value activities. The engineers appreciated being focused where they added more value, and being taken away from frustrating,  chaotic event management scenarios.