From chaos to proactivity with cross-incident analysis

By Eduardo Crespo, VP EMEA, PagerDuty.

  • 5 months ago Posted in

As systemic complexity grows, companies must do a better job at identifying patterns across their technology, tools, and teams to allow for continuous improvement and the development of more resilient systems with cross-incident analysis. Without the right analysis, problems within digital operations will cause increasingly more complex and troublesome issues, given the complexity of integrations and dependencies involved.

Organisations often find themselves in the unenviable position of struggling to balance the day-to-day needs of the business with their desired large-scale, long-term digital transformations. Switching from fighting fires to modernising operations is hard. That’s why codified processes and new software categories have grown up to replace ad hoc or in-house incident management and digital operations monitoring processes. And at the peak of digital maturity, is this ability to use cross-incident analysis to drive learning and improvement developments.

Modern problems require modern solutions

Believing that your tech team can manage every new technology, integration, and change without the support of automation today is no longer sustainable. The recent PagerDuty 2024 State of Digital Operations Report uncovered that organisations are seeing a 16% YoY increase in incident volume, indicating rising operational complexity and risk. This is being recognised, as 57% said their new year’s IT Operations budgets would rise. Of course, part of all of this is an acknowledgement that AI is becoming indispensable. 71% of leaders say budgets are growing for AI and machine learning, and 76% plan to automate IT and business operations workflows.

If an organisation hasn’t already moved to the highest stages of operational maturity, then these added opportunities - and challenges - will not make the process easier. As a recap, those five stages of maturity are: Manual, reactive, responsive, proactive, and preventative. We tend to see organisations with higher digital maturity in their operations both acknowledge incidents and mobilise responders faster, resolve incidents faster, and experience double-digit fewer hours of monthly downtime.

Achieving higher digital maturity relies on modernising processes, helping staff train and supporting them with automation, and methodically transitioning from any traditional ops management practices that would now be seen as reactive, or even chaotic. Automation is key, supporting human expertise by removing ‘noise’ and allowing talent to focus where it’s most needed - on strategy and customer impact.

Learning lessons: Moving from chaos to proactivity

Firstly, getting the right mind-set in place puts organisations on track to use their people, processes, assets, and technologies more intelligently. As a foundation, any stigma around incidents should be replaced with an acceptance that it is better to get ahead of them early, dealing with them transparently and without emotion. It’s more adaptive for organisations not to aim to eliminate incidents, but to see them as part of the process of improvement. Focussing on the process of solving and learning will increase the productivity of the business as well as the employee experience of the ops team at the sharp end.

Secondly, how organisations collate data from disparate systems involved in incident response has a significant impact on enhancing expertise and creating improvements. Ensuring the right data is on hand and interrogatable enhances organisational insights and expertise - a competitive advantage allowing rapid and effective change.

Thirdly, understanding how to identify patterns using infrastructure data allows for continuous improvement and the development of both resilient systems and teams. Given the data volumes at play in even a modest tech stack, automation is again critical throughout every stage of this process.

Embedding a learning and improvement culture

Achieving success with cross-incident analysis is best done when the organisation has reached a high level of operational maturity in its processes and feels confident with the depth of each individual incident. Without proper individual incident analysis hygiene, your cross incident data will render inconclusive. Both the use of automation in supporting the humans at the centre of the ops management process, and the use of AI to predict and rapidly react to changes, are core to the modern digital business. This is how insights are surfaced, shared, and acted on in the timeframes organisations demand.

But to underline the point, codifying the understanding and mind-set that setbacks are a sign of progress - not failure - is as equally important as the technology side. If organisations are to ensure the ops team works as effectively as they need to in this fast-paced and complex environment, people matter as much as their incident management and analysis solutions. Assessing and acting on their recommendation requires trust from the business. Making changes to the tech stack always carries risk, but where decisions can be backed up with data, then tracked and evaluated, leaders have more context to be confident.

Organisations looking to build for success and growth will do so in part by learning the lessons of their incidents - part a people, process, technology, and a mind-set challenge. Digital operations are at the heart of every part of business delivery, but digital plus human skill is the most powerful combination.

AI is at the heart of the manufacturing revolution, driving efficiency, sustainability, and...
By John Kreyling, Managing Director, Centiel UK.
By David de Santiago, Group AI & Digital Services Director at OCS.
By Krishna Sai, Senior VP of Technology and Engineering.
By Danny Lopez, CEO of Glasswall.