How machine learning can enable more consistent IT change success

Spread the love

Many software failures can be traced back to recent changes from CI/CD. Research from Gartner finds that as much as 85% of all IT problems can be traced back to changes recently implemented.

Reducing the risk of negative outcomes after a change requires diligent change management practices. However, excessive gating around change approvals slows down the CI/CD process, decreasing the velocity of value delivered to customers. Organizations, therefore, need a way to enable change success and reduce the risk of change failures while keeping the flow of releases constant.

This is a situation where machine learning (ML) and AI-powered analytics can excel. Using historical data, ML models can correlate change-related factors with past defects and failures, revealing which factors best predict a high level of change risks. On the level of individual changes, ML models can let organizations know which changes to approve and which to scrutinize further, based on specific factors like the CIs impacted. On the more global level of change processes in general, modeling and KPI feedback allows organizations to address systemic causes of change failure, leading to higher-quality change deployments capable of consistently delivering more value to customers and end-users.

In this way, machine learning empowers a feedback loop where the organization’s own data gives it the insights needed to take action. These insights allow the company to address root causes of change failures and, in turn, implement needed improvements to people, processes, and technology.

ML modeling isolates the biggest change risk factors

Unfortunately, there are no “silver bullets” when it comes to reducing the number of failed changes and the volume of change-related problems for all organizations at once. There are a number of best practices worth following, such as ensuring broad automated testing coverage. But when it comes to individual organizations, every one has different root causes behind change failures — as well as different goals and objectives that define success.

To uncover the patterns in when and why changes tend to fail in a given organization, ML algorithms can be given access to historical data from primary systems of record. Data can be sourced from Applications Performance Monitoring (APM), issue tickets, release orchestration tools, deployment platforms, and so forth. Using this data, ML will discover patterns — which are unique to each organization — in why certain changes tend to fail over others

Example correlative factors that could be surfaced include:

  • Low test coverage
  • Negative test outcomes
  • Poor code quality
  • Change category
  • Assignment group building the change for deployment
  • CIs impacted
  • CI dependencies

Data can also be included from after deployment, including automated APM alerts signaling performance degradation and customer-reported problems.

Once patterns are established, the ML model can reveal in a dashboard which changes have a high risk of failure based on the presence of factors correlated with past change-related issues. Specific changes in the CI/CD or CAB queue can be highlighted in green, yellow, orange, or red depending on the presence of modeled risk factors. Using this information, change groups or the CAB can quickly know, at a glance, not just that a change could be risky but why it’s considered risky.

ML change risk modeling allows teams to focus on the changes that are most likely to cause problems or fail entirely. “Red flag” factors that emerge can be addressed to bring the risk of the change down to an acceptable level. Approval for lower-risk changes can be automated, hastening not just CI/CD but the creation of value for customers.

Other actions can be taken in light of signals of high possible risk, including:

  • Instituting preemptive rollback procedures
  • Having the right group of subject matter experts on call
  • Scheduling to deploy the change after business hours

Overall, building a change risk model allows IT leaders to not just be aware and informed but also ready to take the appropriate actions in response.

ML models empower action in light of risk and performance insights

In the short run, change risk analysis driven by ML modeling allows organizations to evaluate and address individual changes. They can do so using the measures above, including splitting high-risk changes into lower-risk components so that the root issue can be addressed in isolation.

But, in the long run, by looking at patterns of change failures, organizations are better positioned to address the underlying factors that contribute the most to change risk.

One of the most obvious examples that may arise is when release builds developed by a specific engineering team or approved by a specific change assignment group repeatedly fail. In this case, the model will reveal the risks of these teams and signal a need to evaluate the teams’ practices. Change risk metrics, like a change risk credit score, can hold these teams accountable for their own performance. Groups can set targets to improve their score and its contributing KPIs over time, addressing one possible source of repeated change failure.

Other root causes can also be addressed using the data signals produced by the algorithm. For change-related problems that seem to cluster around a specific change category, the organization may need to make underlying fixes. Or, they may invest in a total overhaul of that part of the product architecture to be more stable.

Change-related problems that are correlated with poor quality code or escaped defects can be met with an effort to retrain teams on best practices. Feedback from the production environment can also allow development and change teams to close the loop on persistent user-facing problems that can be traced back to changes.

Without ML modeling, an organization may resort to “putting out fires” after changes are deployed, forcing them to be reactive. This not only leads to poorer customer experiences — with customers potentially enduring outage after outage — it also creates unplanned work and consumes budgetary resources that could be spent on proactive improvements.

A root cause analysis engine can use a similarity algorithm to correlate across hundreds of fields to discover the root cause and trace incidents to specific changes. After examining repeating root causes, IT teams can implement long-term improvements to people, processes, and technology while raising the quality of releases without hurting CI/CD velocity.

Get proactive to seek consistent change success

Using the methods described above and the technology provided by Digital.ai Change Risk Prediction, a Fortune 20 healthcare organization was able to reduce change-related problems by 64% and reduce lengthy CAB meetings to just an hour a week.

Outcomes like these can offer quick returns, but ML insights also signal ways for the entire organization to work together to make structured, proactive improvements to people, processes, and technology. 

This continuous improvement drive tracks with Opensource.com’s suggestion for “building in the ability to amplify feedback” in order to accelerate and improve CI/CD. Feedback generated by data can be amplified by AI-powered analytics in combination with actionable data models and informative dashboards, for example. Using amplified feedback, teams can swarm on root cause issues, resolving persistent problems while generating new knowledge that could form the basis of new best practices. IT leaders can then explicitly share this knowledge and broadcast it throughout the organization, ensuring all teams can learn lessons from the challenges and failures of the past — amplifying feedback even further.

Through this process, ML algorithms can not only empower continuous integration and deployment but also continuous learning and continuous improvements. The end result is that AI-powered analytics helps build a more robust organization — one capable of learning from its mistakes and growing more consistently exceptional with each new release.