15 min read

Evolving business expectations are being duly automated through a host of delectable developments in the IT space. These improvements elegantly empower business houses to deliver newer and premium business offerings fast. Businesses are insisting on reliable business operations.  IT pundits and professors are therefore striving hard and stretching further to bring forth viable methods and mechanisms toward reliable IT. Site Reliability Engineering (SRE) is a promising engineering discipline, and its key goals include significantly enhancing and ensuring the reliability aspects of IT.

In this tutorial, we will focus on the various ways and means of bringing up the reliability assurance factor by embarking on some unique activities in the post-production/deployment phase. Monitoring, measuring, and managing the various operational and behavioral data is the first and foremost step toward reliable IT infrastructures and applications.

This tutorial is an excerpt from a book titled Practical Site Reliability Engineering written by Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh. This book will teach you to create, deploy, and manage applications at scale using Site Reliability Engineering (SRE) principles.
All the code files for this book can be found at GitHub.

Monitoring clouds, clusters, and containers

The cloud centers are being increasingly containerized and managed. That is, there are going to be well-entrenched containerized clouds soon. The formation and managing of containerized clouds get simplified through a host of container orchestration and management tools. There are both open source and commercial-grade container-monitoring tools. Kubernetes is emerging as the leading container orchestration and management platform. Thus, by leveraging the aforementioned toolsets, the process of setting up and sustaining containerized clouds is accelerated, risk-free, and rewarding.

The tool-assisted monitoring of cloud resources (both coarse-grained as well as fine-grained) and applications in production environments is crucial to scaling the applications and providing resilient services. In a Kubernetes cluster, application performance can be examined at many different levels: containers, pods, services, and clusters. Through a single pane of glass, the operational team can provide the running applications and their resource utilization details to their users. These will give users the right insights into how the applications are performing, where application bottlenecks may be found, if any, and how to surmount any deviations and deficiencies of the applications. In short, application performance, security, scalability constraints, and other pertinent information can be captured and acted upon.

Cloud infrastructure and application monitoring

The cloud idea has disrupted, innovated, and transformed the IT world. Yet, the various cloud infrastructures, resources, and applications ought to be minutely monitored and measured through automated tools. The aspect of automation is gathering momentum in the cloud era. A slew of flexibilities in the form of customization, configuration, and composition are being enacted through cloud automation tools. A bevy of manual and semi-automated tasks are being fully automated through a series of advancements in the IT space. In this section, we will understand the infrastructure monitoring toward infrastructure optimization and automation.

Enterprise-scale and mission-critical applications are being cloud-enabled to be deployed in various cloud environments (private, public, community, and hybrid). Furthermore, applications are being meticulously developed and deployed directly on cloud platforms using microservices architecture (MSA). Thus, besides cloud infrastructures, there are cloud-based IT platforms and middleware, business applications, and database management systems. The total IT is accordingly modernized to be cloud-ready. It is very important to precisely and perfectly monitor and measure every asset and aspect of cloud environments.

Organizations need to have the capability for precisely monitoring the usage of the participating cloud resources. If there is any deviation, then the monitoring feature triggers an alert to the concerned to ponder about the next course of action. The monitoring capability includes viable tools for monitoring CPU usage per computing resource, the varying ratios between systems activity and user activity, and the CPU usage from specific job tasks. Also, organizations have to have the intrinsic capability for predictive analytics that allows them to capture trending data on memory utilization and filesystem growth. These details help the operational team to proactively plan the needed changes to computing/storage/network resources before they encounter service availability issues. Timely action is essential for ensuring business continuity.

Not only infrastructures, but also applications’ performance levels have to be closely monitored in order to embark on fine-tuning application code, as well as the infrastructure architectural considerations. Typically, organizations find it easier to monitor the performance of applications that are hosted at a single server, as opposed to the performance of composite applications that are leveraging several server resources. This becomes more tedious and tough when the underlying computer resources are spread across multiple and are distributed. The major worry here is that the team loses its visibility and controllability of third-party data center resources. Enterprises, for different valid reasons, prefer multi-cloud strategy for hosting their applications and data. There are several IT infrastructure management tools, practices, and principles. These traditional toolsets become obsolete for the cloud era. There are a number of distinct characteristics being associated with software-defined cloud environments. It is expected that any cloud application has to innately fulfill the non-functional requirements (NFRs) such as scalability, availability, performance, flexibility, and reliability.

Research reports say that organizations across the globe enjoy significant cost savings and increased flexibility of management by modernizing and moving their applications into cloud environments.

The monitoring tool capabilities

It is paramount to deploy monitoring and management tools to effectively and efficiency run cloud environments, wherein thousands of computing, storage, and network solutions are running.

The key characteristics of this tool are vividly illustrated through the following diagram:

Here are some of the key features and capabilities we need to properly monitor for modern cloud-based applications and infrastructures:

Firstly, the ability to capture and query events and traces in addition to data aggregation is essential. When a customer buys something online, the buying process generates a lot of HTTP requests. For proper end-to-end cloud monitoring, we need to see the exact set of HTTP requests the customer makes while completing the purchase.

Any monitoring system has to have the capability to quickly identify bottlenecks and understand the relationships among different components. The solution has to give the exact response time of each component for each transaction. Critical metadata such as error traces and custom attributes ought to be made available to enhance trace and event data. By segmenting the data via the user and business-specific attributes, it is possible to prioritize improvements and sprint plans to optimize for those customers.

Secondly, the monitoring system has to have the ability to monitor a wide variety of cloud environments (private, public, and hybrid).

Thirdly, the monitoring solution has to scale for any emergency.

The benefits

Organizations that are using the right mix of technology solutions for IT infrastructure and business application monitoring in the cloud are to gain the following benefits:

  • Performance engineering and enhancement
  • On-demand computing
  • Affordability

Prognostic, predictive, and prescriptive analytics

Any operational environment is in need of data analytics and machine learning capabilities to be intelligent in their everyday actions and reactions.

As data centers and server farms evolve and embrace new technologies (virtualization and containerization), it becomes more difficult to determine what impacts these changes have on the server, storage, and network performance. By using proper analytics, system administrators and IT managers can easily identify and even predict potential choke points and errors before they create problems.  To know more about prognostic, predictive, and prescriptive analytics; head over to our book Practical Site Reliability Engineering.

Log analytics

Every software and hardware system generates a lot of log data (big data), and it is essential to do real-time log analytics to quickly understand whether there is any deviation or deficiency. This extracted knowledge helps administrators to consider countermeasures in time. Log analytics, if done systematically, facilitates preventive, predictive, and prescriptive maintenance. Workloads, IT platforms, middleware, databases, and hardware solutions all create a lot of log data when they are working together to complete business functionalities. There are several log analytics tools on the market.

Open source log analytics platforms

If there is a need to handle all log data in one place, then ELK is being touted as the best-in-class open source log analytics solution. There are an application as well as system logs. Logs are typically errors, warnings, and exceptions. ELK is a combination of three different products, namely Elasticsearch, Logstash, and Kibana (ELK). The macro-level ELK architecture is given as follows:

  • Elasticsearch is a search mechanism that is based on the Lucene search to store and retrieve its data. Elasticsearch is, in a way, a NoSQL database. That is, it stores multi-structured data and does not support SQL as the query language. Elasticsearch has a REST API, which uses either PUT or POST to fetch the data. If you want real-time processing of big data, then Elasticsearch is the way forward.  Increasingly, Elasticsearch is being primed for real-time and affordable log analytics.
  • Logstash is an open source and server-side data processing pipeline that ingests data from a variety of data sources simultaneously and transforms and sends them to a preferred database. Logstash also handles unstructured data with ease. Logstash has more than 200 plugins built in, and it is easy to come out on our own.
  • Kibana is the last module of the famous ELK toolset and is an open source data visualization and exploration tool mainly used for performing log and time-series analytics, application monitoring, and IT operational analytics (ITOA). Kibana is gaining a lot of market and mind shares, as it makes it easy to make histograms, line graphs, pie charts, and heat maps.
  • Logz.io, the commercialized version of the ELK platform, is the world’s most popular open source log analysis platform. This is made available as an enterprise-grade service in the cloud. It assures high availability, unbreakable security, and scalability.

Cloud-based log analytics platforms

The log analytics capability is being given as a cloud-based and value-added service by various cloud service providers (CSPs). The Microsoft Azure cloud provides the log analytics service to its users/subscribers by constantly monitoring both cloud and on-premises environments to take correct decisions that ultimately ensure their availability and performance.  The Azure cloud has its own monitoring mechanism in place through its Azure monitor, which collects and meticulously analyze log data emitted by various Azure resources. The log analytics feature of the Azure cloud considers the monitoring data and correlates with other relevant data to supply additional insights.

The same capability is also made available for private cloud environments. It can collect all types of log data through various tools from multiple sources and consolidate them into a single and centralized repository. Then, the suite of analysis tools in log analytics, such as log searches and views a collaborate with one another to provide you with centralized insights of your entire environment. The macro-level architecture is given here:

This service is being given by other cloud service providers. AWS is one of the well-known providers amongst many others.  The paramount contributions of log analytics tools include the following:

  • Infrastructure monitoring: Log analytics platforms easily and quickly analyze logs from bare metal (BM) servers and network solutions, such as firewalls, load balancers, application delivery controllers, CDN appliances, storage systems, virtual machines, and containers.
  • Application performance monitoring: The analytics platform captures application logs, which are streamed live and takes the assigned performance metrics for doing real-time analysis and debugging.
  • Security and compliance: The service provides an immutable log storage, centralization, and reporting to meet compliance requirements. It has deeper monitoring and decisive collaboration for extricating useful and usable insights.

AI-enabled log analytics platforms

Algorithmic IT Operations (AIOps) leverages the proven and potential AI algorithms to help organizations to make the path smooth for their digital transformation goals. AIOps is being touted as the way forward to substantially reduce IT operational costs. AIOps automates the process of analyzing IT infrastructures and business workloads to give right and relevant details to administrators about their functioning and performance levels. AIOps minutely monitors each of the participating resources and applications and then intelligently formulates the various steps to be considered for their continuous well being. AIOps helps to realize the goals of preventive and predictive maintenance of IT and business systems and also comes out with prescriptive details for resolving issues with all the clarity and confidence. Furthermore, AIOps lets IT teams conduct root-cause analysis by identifying and correlating issues.


Loom is a leading provider of AIOps solutions. Loom’s AIOps platform is consistently leveraging competent machine-learning algorithms to easily and quickly automate the log analysis process. The real-time analytics capability of the ML algorithms enables organizations to arrive at correct resolutions for the issues and to complete the resolution tasks in an accelerated fashion. Loom delivers an AI-powered log analysis platform to predict all kinds of impending issues and prescribe the resolution steps. The overlay or anomaly detection is rapidly found, and the strategically sound solution gets formulated with the assistance of this AI-centric log analytics platform .

IT operational analytics

Operational analytics helps with the following:

  • Extricating operational insights
  • Reducing IT costs and complexity
  • Improving employee productivity
  • Identifying and fixing service problems for an enhanced user experience
  • Gaining end-to-end insights critical to the business operations, offerings, and outputs

To facilitate operational analytics, there are integrated platforms, and their contributions are given as follows:

  • Troubleshoot applications, investigate security incidents, and facilitate compliance requirements in minutes instead of hours or days
  • Analyze various performance indicators to enhance system performance
  • Use report-generation capabilities to indicate the various trends in preferred formats (maps, charts, and graphs)
  • and much more!

Thus, the operational analytics capability comes handy in capturing operational data (real-time and batch) and crunching them to produce actionable insights to enable autonomic systems. Also, the operational team members, IT experts, and business decision-makers can get useful information on working out correct countermeasures if necessary. The operational insights gained also convey what needs to be done to empower the systems under investigation to attain their optimal performance.

IT performance and scalability analytics

There are typically big gaps between the theoretical and practical performance limits. The challenge is how to enable systems to attain their theoretical performance level under any circumstance. The performance level required can suffer due to various reasons like poor system design, bugs in software, network bandwidth, third-party dependencies, and I/O access. Middleware solutions can also contribute to the unexpected performance degradation of the system. The system’s performance has to be maintained under any loads (user, message, and data). Performance testing is one way of recognizing the performance bottlenecks and adequately addressing them. The testing is performed in the pre-production phase.

Besides the system performance, application scalability and infrastructure elasticity are other prominent requirements. There are two scalability options, indicated as follows:

  • Scale up for fully utilizing SMP hardware
  • Scale-out for fully utilizing distributed processors

It is also possible to have both at the same time. That is, to scale up and out is to combine the two scalability choices.

IT security analytics

IT infrastructure security, application security, and data (at rest, transit, and usage) security are the top three security challenges, and there are security solutions approaching the issues at different levels and layers. Access-control mechanisms, cryptography, hashing, digest, digital signature, watermarking, and steganography are the well-known and widely used aspects of ensuing impenetrable and unbreakable security. There’s also security testing, and ethical hacking for identifying any security risk factors and eliminating them at the budding stage itself. All kinds of security holes, vulnerabilities, and threats are meticulously unearthed in to deploy defect-free, safety-critical, and secure software applications. During the post-production phase, the security-related data is being extracted out of both software and hardware products, to precisely and painstakingly spit out security insights that in turn goes a long way in empowering security experts and architects to bring forth viable solutions to ensure the utmost security and safety for IT infrastructures and software applications.

The importance of root-cause analysis

The cost of service downtime is growing up. There are reliable reports stating that the cost of downtime ranges from $100,000-$72,000 per minute. Identifying the root-cause (mean-time-to-identification (MTTI) generally takes hours. For a complex situation, the process may run into days.

OverOps analyzes code in staging and production to automatically detect and deliver the root-causes for all errors with no dependency on logging. OverOps shows you a stack trace for every error and exception. However, it also shows you the complete source code, objects, variables, and values that caused that error or exception to be thrown. This assists in identifying the root-cause of when your code breaks. OverOps injects a hyperlink into the exception’s link, and you’ll be able to jump directly into the source code and actual variable state that cause it. OverOps can co-exist in production alongside all the major APM agents and profilers. Using OverOps with your APM allows monitoring server slowdowns and errors, along with the ability to drill down into the real root-cause of each issue.


There are several activities being strategically planned and executed to enhance the resiliency, robustness, and versatility of enterprise, edge, and embedded IT. This tutorial described the various post-production data analytics to allow you to gain a deeper understanding of applications, middleware solutions, databases, and IT infrastructures to manage them effectively and efficiently.

In order to gain experience on working with SRE concepts and be able to deliver highly reliable apps and services, check out this book Practical Site Reliability Engineering.

Read Next

Site reliability engineering: Nat Welch on what it is and why we need it [Interview]

Key trends in software infrastructure in 2019: observability, chaos, and cloud complexity

5 ways artificial intelligence is upgrading software engineering

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.