Application Performance Monitoring in DevOps: What is it, how to perform it, and top tools for automated Application Performance Monitoring
We came to know more about Release Management in DevOps from our last tutorial.
Another key aspect of DevOps is carrying out the Performance monitoring of the application in the live environment by both Development and Operations team, where it was just the responsibility of Operations team earlier.
Check out => List of DevOps Tutorials
So, Live Monitoring and fixing the production issues is the priority for the team.
VIDEO Part 4 Block 3: Application Performance Monitoring – 27 minutes 35 seconds
In this block, we will learn about the Application Performance Monitoring in DevOps.
We have mentioned throughout our discussion on this topic, that DevOps or software development in DevOps context is incomplete until the software is coded, tested, deployed and monitored for its success in the live environment over a specified period.
Here, we will learn:
- What is this application performance monitoring all about?
- How do we carry out the application performance monitoring, shortly called APM, in the live environment?
- What metrics do we gather to ensure that the software is performing successfully?
- What are the benefits of Application monitoring?
- What are the tools that we engage upon in carrying out these tasks?
What You Will Learn:
What Is Application Performance Monitoring?
We know that software deployment on to the production is not all done in the software life cycle now unless we ensure that the software is up and running successfully and is able to perform its functionality and end users are quite able to use the application without any hassles.
Till now, we have got our software deployed to the production successfully in our DevOps pipeline or delivery pipeline. So, now it is our responsibility to ensure that it performs to its expectation in the live environment.
Thus, Application monitoring or application performance monitoring in simple words is to figure out a way that one can monitor our production site, after the deployment of the software and understand the problems, issues, improvement areas before our customers ever notice these.
Hence, APM is understanding how customers are using our software and engaging ourselves actively with them to understand their requirement to ensure that we are building the right things for the customers.
Earlier even application monitoring also never used to be the task or responsibility of the Dev team but of operations team alone. And Dev team used to support the Operations team if they report any issues or raise tickets.
But now it has changed in DevOps. Monitoring the application in Live is also the responsibility of Development, rather the entire team.
And the entire team constantly focuses on the Live and picks up the issues on priority, together as a team to fix it.
The next question will be How Long do you monitor?
It generally depends on the application. Ideally, the software performance will be monitored until there are no issues reported either through the logs or from any of the customers or users reported through the customer service representative.
So, Close monitoring might take one to two days to a week’s period till everything settles down and the application usage becomes normal.
It does not mean that the application monitoring stops after that. Once the system and the usage stabilizes, and the users get accustomed to the new software or features, then close monitoring might not be required but, monitoring of the logs continues throughout.
So, the application performance monitoring starts rigorously as soon as the application is deployed and continues throughout maybe at a little slower pace.
How Do You Monitor The Live System Or The Live Application?
Well, how to monitor is the biggest aspect of the application monitoring. As the speed and the accuracy expected here is very high and crucial in order to address the issues on time.
There are two to three ways of monitoring them.
- Through the application monitoring tools.
- By constantly going through the logs written out by the applications, which is a manual process.
- By configuring the notifications and alarms in the logs to alert.
Methods 1 and 3 are the ones which are quite often used as an automated method whereas the second one is mainly for internal purposes for the team members.
#1) Monitoring through tools:
There are a lot of tools that are available in the marketplace to carry out the application performance monitoring. These tools provide the automated way of providing the configured metrics to the team.
All that the team needs to do is to identify the right tool, install and configure them in the live environment so that the tool does the job of monitoring and provides the alarms or alerts whenever there is a peak or low or errors or warnings.
#2) Monitoring through logs:
This is the manual way of monitoring the application, which people used to follow earlier.
Earlier developers used to keep a window on their computer screen from the live environment and constantly run the tail –f command to see the latest and real-time happenings and they used to take care if any issues are found, either by changing the settings or configurations based on the readings from the log.
Now, this practice is not used for application monitoring purposes, but few team members, who are a part of the software upgrade team or developers, sometimes do this, just to understand what’s happening on the live.
#3) Monitoring through Notifications and Alarms:
This is quite simple.
Simply configure the notifications and alarms to be triggered in case of any warnings or errors found in the logs that are going to be written out from the Live site and route them to the Mobile numbers of the application monitors, so that it either sends SMS or rings in case of emergency.
Thus, amongst the above 3 methods, the first method, which is using sophisticated APM tools available in the marketplace is the most popular method that is adopted in the DevOps practice.
Now let us understand what are the metrics that we are gathering and how it is beneficial to the team? And what are we going to do by collecting these metrics?
First, let us list few of the metrics that we would generally collect. Well, this is not an exhaustive list. Based on the organization requirements, metrics collected and the tools used to collect varies accordingly.
- Failures, core dumps, error messages, and warnings.
- Usage pattern
- Performance metrics
- Custom telemetry
Now let us understand the benefit of collecting each of these metrics.
#1) Failures, core dumps, error messages, and warnings:
Any failures, error messages and warnings related to servers, networks, databases, websites and even the application functionality gives a clear picture about the quality of the system and the code that is delivered.
So, this metric helps the team to identify the bugs/issues in the application.
The details about the system crash help to understand how did the system crash and then recover, what is the downtime or recovery time to restore the system from the crash.
#2) System and Application Usage Pattern:
This metric aims at finding the pattern in the usage of the computer resources and the application.
This metric helps in identifying if there are any issues in the configurations of the firewalls, load balancing, server memory configurations etc., which later helps the team in fixing and optimizing them.
This also helps the team to get to know how the users are clicking on our applications and fine-tune or enhance those most used features, most visited screens, websites etc.
So, overall the usage pattern and the trends help in better understanding of the stability of the system and the application.
#3) Performance Metrics:
System and application performance metric is quite important as it provides the performance details of the application, whether the system is too slow or TPS (transactions per second) SLA is being met or not, whether the system is able to handle the peak load in the live environment or how does the app recovers from the stressed to normal state etc.,
And overall whether the application is performing consistently and reliably.
This metric basically adds to the customer satisfaction criteria as speed is key for the customers. They do not want very slow transactions to happen.
This metric also attributes to the revenue generation as more the transactions happen, more the revenue generation. The slow performing systems also add more cost due to the excess consumption of infrastructure resources.
#4) Availability Metrics:
Another important metric that the team gathers is the ‘Availability metrics’. Keeping the system up and highly available all the times is the expectation from the customers.
It records the no of times that the system has gone down over a period of time, the time taken to recover and hence helps in assessing the business loss due to non-availability. We clearly know that high availability directly contributes to more revenue generation.
The study of the availability pattern provides an opportunity for the team to improve their systems during disaster recovery based on the collected data. Managing high availability and Optimizing infrastructure cost is quite challenging. So, detailed study of this metric pattern helps the team to plan their infrastructure accordingly.
We know that ensuring 100% availability is quite challenging in DevOps, as there are hourly deployments happening and no downtime is expected.
This metric also helps in managing a double-sided knife, like one side to manage the total availability and another side to optimize the infrastructure cost based on load balancing. Thus, a detailed study of this metric pattern helps the team to plan their infrastructure accordingly.
#5) Scalability Metrics:
Well, these days, with the costly affair of the cloud infrastructure adoption, where pay per use model also has been in practice, nobody would like to idle their infrastructure resources.
At the same time, they don’t want any of their customers to feel the bite due to any shortfall of these resources. So, the intention is to keep it up always but at the same time not to pay too much for these resources, unless they are optimally used by adopting the scaling process.
So, it is quite obvious that scaling metrics, which records the scale up, down, scale in – scale out, provides information on whether successful scaling has been done on the system or not.
Hence this metric helps the team in identifying any glitches in the scaling process and aids to plan better.
And also provides an idea of what is the optimization plan that the team can adapt to save the infrastructure costs and the costs of running the application based on the scalability metrics.
#6) Custom Telemetry:
Custom telemetry is basically to insert the code to collect a few specific metrics which provides an insight into the application usage and to diagnose the issues in the application during its usage.
It is as good as running the application in the debugging mode. This slows down the application performance and hence this option should be switched on or off as per the requirement.
Actions like tracking events, metrics, exceptions, trace, and timing of the events are captured by making use of the available standard API’s. This telemetry helps in understanding the complete behavior of the system which will have a direct bearing on the business and financial benefits.
This metric also helps in comparing the telemetry with the other servers and working on the improvement.
#7) Other Metrics:
In addition to these, there are a lot more metrics which can be gathered in order to find out how customers are using our software.
- Traffic pattern, No of users created, effective load balancing, then
- Few other metrics like end user experience, SLA’s, etc.
- Certain other metrics for product analysis like most used feature, user-friendly screens etc.
So, a collection of these metrics and real-time feedback, which comes directly from the production will help in improving the ease of use and the user experience and hence helps in enhancing the product, and translates into a roadmap to the product enhancement and results in increased customer satisfaction and customer base.
A Constant study of the pattern of these errors from the live site along with user telemetry helps the team to learn from the mistakes and take proactive actions in the future instead of reactive actions.
All these metrics also help in achieving the hypothesis-driven development which is the key aspect of DevOps practice.
The metrics gathering, especially on usage data, enables the team to think of innovative ideas, implementation of new ideas, product improvement based on the live feedback.
So, the Team can really try some totally new innovative idea to implement, can develop and roll out them on a few sites, get the feedback and implement in full fledge.
Thus, the overall application performance monitoring is not just monitoring but also learning from the ‘Live site’.
As DevOps practice is becoming more and more prominent, and live metrics collection is becoming more important to meet the customers’ expectations and business needs, lots of sophisticated tools which just needs to be installed on the server and the metrics can be configured and gathered on a real-time basis have come in the market.
So, all these aspects of the metrics collection which we spoke about just now is controlled by a good monitoring tool which provides accurate details. But the selection of the right tool becomes the success of APM.
Datadog APM helps you to easily analyze and isolate dependencies, eliminate bottlenecks, reduce latency, track errors, and increase code efficiency to optimize your applications.
By correlating distributed traces with logs, browser sessions, profiles, synthetic tests, and infrastructure metrics, you can achieve full visibility into the health of your applications across all hosts, containers, proxies, and serverless functions.
Additional APM Tools:
Newrelic tool is mostly used in performance monitoring of the application like response time, throughput, most time-consuming transactions etc.,
ManageEngine is another tool which supports server monitoring, DB monitoring, and cloud monitoring
Appdynamics supports managing the performance and availability of the applications across the cloud computing environments and inside the data center.
Dynatrace also supports in addition to APM, user experience management solutions, allows monitoring performance globally by emulating real user behavior via PC’s across the world.
IBM’s performance management tool helps in efficiently managing the application, on-premises and hybrid apps and IT infrastructure.
So, Ideally majority of these APM tools provide details on Diagnostics and error reporting usage pattern and trends. And Notifications on application performance helps the team to catch up the early warnings on a real-time basis and quickly provides a fix to ensure that the user experience is not disturbed.
With this, we are completing our discussion on Application performance monitoring and completing out part4 series as well.
In our upcoming tutorial, we will revise and summarize what we learned in this entire series of tutorials.