What Can Knowledge Management Do For Site Reliability Engineering (SRE)?

  • 30 Oct 2022
  • Sergey Bloom

If everyone were an expert with a company’s systems, productivity would skyrocket. Wouldn’t that be the best outcome of all? So, how can we help teams become deeply knowledgeable fast?

Knowing systems like the back of your hand usually takes delving into the intricacies and dealing with troubleshooting. This kind of experience builds knowledge about the various system quirks and potential anomalies. It results in deep expertise with the various parts of a system. However, this can take a long time. There has to be a way to make this knowledge accessible faster.

Incorporating knowledge management into troubleshooting processes has tremendous productivity benefits. As noted by the Consortium for Service Innovation, Knowledge-Centered Service (KCS®) (https://www.serviceinnovation.org/kcs/) has the potential to improve incident resolution time by 50% or more. Resource proficiency can take roughly 1/3 of the time. And support costs can go down by half. To get there we need to focus on:

  1. Incorporating AI-driven knowledge tools into the workflow
  2. Leveraging human expertise alongside AI
  3. Effective knowledge sharing


Incorporating AI-driven knowledge tools into the workflow

Knowledge is the key ingredient of progress including in technology. When looking for experts, knowledge (experience) is one of the main factors. The more experts an organization has the better. However, experts are far and few between. Due to this shortage, productivity suffers. It’s easy to see why elevating internal team expertise quickly is crucial.

Speeding up learning involves using innovative technologies to capture the knowledge from experts and spreading it to the rest of the team in a way that’s easy to follow.

There are traditional barriers to learning such as time, resources, individual skills, and more. Overcoming these is no small feat. However, there are ways to speed up learning. In the workplace, this involves harnessing cumulative expertise through technological innovation. This innovation includes capturing, structuring, surfacing, and automating knowledge. 

Knowledge graphs play a key role in knowledge management. They help users absorb knowledge faster due to the simplicity and visual representation of the connectivity in information nuggets. 

When disaster strikes and we have to act immediately to solve a problem, it’s no time to start reading blobs of text in various documents looking for specific knowledge. This is the time when information needs to be at our fingertips and tailored to the problem at hand. To achieve that, we need the cumulative expertise presented to us in a quickly digestible format. And that’s where knowledge graphs shine.

Traditional knowledge management tools are not very effective. People spend more than 80% of their time searching for and organizing data and less than 20% deriving value from it according to IDC (https://blogs.idc.com/2018/08/23/time-crunch-equalizing-time-spent-on-data-management-vs-analytics/). We need to be innovative so that we can build reusable and reliable knowledge quickly. Knowledge is typically distributed across multiple systems and users. But there’s a connectedness to the knowledge and knowledge graphs can capture that connectivity while preserving its distributed nature.

The key is to employ AI-driven context-aware tools that make knowledge management quick and intuitive. Instead of writing lengthy documents after the fact, these tools allow for recording knowledge instantly one nugget at a time. That way knowledge capture, structuring, surfacing, and automation become a seamless part of the workflow. Thus, AI-driven tools can build reliable, contextual, easily traversable, and reusable knowledge graphs on-the-fly. With this approach, internal teams have a much easier time learning and applying expertise under varying conditions. As a consequence, more team members become experts much faster.

For example, in the SRE world, such graphs can capture expert troubleshooting and remediation knowledge. When a problem occurs, even the less experienced team members can follow a simple resolution path clearly visible in a graph. Moreover, this can be applied to a variety of situations, not just the same problem over and over. And the simplicity of this approach also serves as a quick learning experience.

Because troubleshooting is all about finding the root causes, capturing “causal” knowledge is vital. Unfortunately, today’s AI systems are good at many things but NOT causal knowledge mapping. That’s proven very challenging. Kevin Hartnett quoted Judea Pearl – one of the pioneers of AI – as saying that modern deep learning amounts to nothing more than curve fitting (https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/). That is, finding patterns in data.

This is why we need humans to step in.


Leveraging human expertise alongside AI

Human beings are good at causal reasoning. That’s the true form of intelligence. While AI can do many things for us – automatically analyze documents and categorize inquiries, classify large amounts of text, organize bodies of knowledge, improve SEO, map out an organization’s jargon and expertise location, and more (https://blogs.iadb.org/conocimiento-abierto/en/natural-language-processing/) – humans can step in and add causal knowledge to it. In the SRE world, when we combine AI-driven knowledge management with human expertise, together they make a perfect hybrid system that can solve problems and point to root causes.

Human knowledge curation combined with AI ensures precision and accuracy. That way, experts can maintain knowledge relevance.

But, as we mentioned earlier, knowledge is distributed across systems and users. The very nature of this requires collaboration. Knowledge needs to be sourced from multiple people and it needs to be shared to elevate everyone’s expertise in the organization.


Effective knowledge sharing

DevOps culture has brought a level of alignment between developers, operations, and businesses. With it came the need for knowledge sharing across teams. It takes a better grasp of each other’s function to be on “the same page.” And user productivity is correlated to business success. Today, it is incredibly easy to share knowledge with a click of a button thanks to collaboration tools.

SLAs and SLOs are extremely important communication tools between the functional and technical teams. Management needs to share knowledge with SREs regarding the impact of services that go down. This includes management letting SREs know the potential size of lost revenue, lost customers, and more. Conversely, SREs need to let management know which regions may be affected, how many people, how long it would take to bring the services back up, and so on. Sharing such knowledge effectively and efficiently can help organizations recover quickly.

Also, SRE teams must share technical knowledge across all levels to avoid expertise silos. It is far more effective to have an entire team of highly competent engineers rather than relying on a few experienced SMEs. Business reliability will be easier to maintain if expertise gets shared with everybody.

Tools that capture and present distributed knowledge in a simple manner can go a long way toward helping organizations align over a common goal.

It takes SRE to a whole new level by making knowledge management easy. Making more people deeply knowledgeable about systems is one of the many benefits. Another benefit is employee satisfaction and retention improvement by at least 20% and, in some cases, up to 35-40% (https://www.serviceinnovation.org/kcs/). And the resulting productivity boost can be easily measured via the mean time to resolution (MTTR).



Knowledge management can completely transform SRE. Incorporating AI-driven knowledge tools into the workflow helps capture expertise on-the-fly and speed up learning through knowledge graphs. Leveraging human expertise alongside AI preserves causal knowledge making automated root cause analysis possible. Effective knowledge sharing promotes collaboration by consolidating expertise distributed across systems and people. Together, all of these factors help teams become deeply knowledgeable fast and drastically improve productivity.

We thank you for reading, and we hope that we’ve made a solid case for better knowledge management in SRE. While much of the DevOps and SRE tooling today is around knowledge discovery from “machine data”, the realm of curated human knowledge has received less attention. Our DagKnows platform is all about collaborating over, capturing, curating, and consuming structured knowledge required for resolving technical issues. Please take a look at it when you get a chance: https://dagknows.com