Technologies to automate IT systems and relieve over-stretched IT operations teams have been moving into the mainstream over the last few years. Several factors, driven by the digital era, have made this necessary. Firstly, digital transformation is creating ever-larger IT environments and volumes of data that cannot be managed by manual processes. These distributed systems are also becoming more complex, incorporating IoT, mobile, multi-cloud, containers, and APIs. Moreover, for digital businesses, the financial impact of an outage makes time to resolution critical. Identifying and remediating issues before they affect the user is now paramount. AIOps provides intelligence to the IT operations team that allows them to proactively resolve events before they become outages.
Augmenting IT Operations with AIOps
AIOps allows IT operations teams to not only ensure observability of their systems and reduce noise but to also understand how events are interacting together to affect performance and take corrective action quickly. The primary features of AIOps are:
- Noise reduction. AIOps ingests systems data, surfaces priority anomalies and correlates them together. This brings the number of incidents to investigate back down to a human level. Rackspace recently announced that AIOps helped it reduce alert noise by 99% during the initial stage of its rollout. Successful vendor references typically cite similar figures between 95-99%.
- Root cause analysis. Once priority events have been correlated, AIOps identifies a root cause to enable the operations team to focus its efforts on a resolution. This is a task that proves challenging to perform at speed for a human operator considering the complexity of today’s systems.
- Proactive response. A range of responses is available with AIOps, from directing issues to the appropriate people, to recommending actions that can be taken by operators directly in a collaboration tool, to rules-based workflows performed automatically, such as spinning up additional AWS EC2 instances.
- Learning. By evaluating past failures and successes, AIOps can learn over time which events are likely to become critical and how to respond to them. This brings us closer to the dream of NoOps, where operations are completely automated.
The Impact of COVID-19 on IT Operations
The Ecosystm Digital Priorities in the New Normal study launched this month, asks technology users about how their digital priorities have shifted during the pandemic. Despite pressure to shift to digital delivery, almost 40% of participants reported that their organisations cut headcount in the IT department (Figure 1). Furthermore, over one third had been forced to cut their employees’ salaries. As we have seen in previous crises, IT operations teams are being asked to do more with less and will need automation to bridge the gaps.
As we begin to move into the next phase of the COVID-19 reality and businesses continue to open, we will see many launch digital services that were conceived of during the crisis. One of the greatest challenges that IT departments face will be scalability as digital businesses grow. AIOps will be a go-to tool for IT operations to ensure uptime and improve user experience. It is likely that the next 12-18 months will be a watershed moment for AIOps.
NLP and the Democratisation of Data
Natural Language Processing (NLP) will be the next string in the bow of AIOps. While the ultimate goal of IT operations is to identify and remediate situations before they have an impact on the user, oftentimes it is the service desk that generates the initial barrage of alerts. AIOps equipped with NLP can extract relevant data from user tickets, correlate them with other system events and potentially even suggest a resolution to the user. Here, ChatOps can help to reduce the workload on the service desk and bring relevant events to the attention of the operations team faster. NLP will also help democratise IT operations data within the organisation. As they digitalise, lines of business (LoBs) besides IT will need access to system health and user experience data but business managers may not have the necessary technical skills to extract them. Chatbots that can return these metrics to non-technical users will begin to proliferate.
Most IT departments would have discovered the limitations of their current systems during the upheaval caused by recent lockdowns. Only about 7% of organisations in our study reported that they were well-prepared across all areas of IT, to handle the COVID-19 crisis. For those organisations that have yet to invest in AIOps, we recommend starting now but starting small. Develop a topology map to understand where you have reliable data sources that could be analysed by AIOps. Then select a domain by assessing the present level of observability and automation, IT skills gap, frequency of outages, and business criticality. As you add additional domains and the system learns, the value you realise from AIOps will grow.
The power of collaborative AIOps tools would have been undeniable as the COVID-19 crisis began and IT departments were forced to work in a distributed manner. When evaluating a system, carefully consider how it will integrate into your organisation’s preferred collaboration suite, whether it be the AIOps vendor’s proprietary situation tool or a third-party provider like Slack or Microsoft Teams. The ability for operations teams to collaborate effectively reduces time to resolution.