Paperdrip

Dashboard, or HUD, is where you can find operation metrics. To no surprise, dashboard was firstly introduced for the automotive. Values from the speedometer, fuel gauge, tachometer gives the driver a sense of “what’s going on”. Imagine you are driving a car without any of this value, you will worry if you are driving too fast, or running out of gas at any time.

Let’s draw a line on the sand

The invention of the dashboard is not only providing a means for displaying the important matrics, but also forcing the engineer to quantify the operation into numbers. This number also becomes a standard language between the user and the engineer, reducing the discussion over something abstract (fast versus slow for instance). It also provides precision, instead of a rough guess based on experience.

Like we have just mentioned, we want to understand “what is going on” by glancing at the dashboard. This cannot be done by the value alone, what I mean is, 100km/hr by itself is not indicative if it is fast or slow. The value will only start showing it’s value when it compares against the boundary.

When we deliver a system, we need to understand if it is performing as desired. Like a car, we shall make sure metrics that can indicate the operation status of a system are captured. The first metric I would capture, is to determine if the system is up and running.

Do you know the site was not accessible since an hour ago?

It’s so embarassing when it is the user who reported the outages to my boss, before I was aware of it. Being the first one to notice any outage is the number one thing you should try to achieve. This gives the user an impression that you are on top of the happening, an important building block of trust. Also, it is very easy to implement, just need to be cautious on the definition of “up and running”. Being able to ping the server is not enough, if it is a web site, you should make sure the site is accessible (accessible and operational can means much more indeed but anyway).

The New Arrival page is not showing a product, do you know it just meant a site outage in a business sense?

This is what I was told, when I am having a big smile on my face, feeling that I am on top of things. Apparently, knowing the site is up is not enough, if it cannot serve the intent. Thus, what you should do next, is to identify ONE business performance metrics to monitor at. For a transaction site, it could be the last order placed. For a searching site, it could be when the last successful query executed. For me, it would be making sure there are product showing in the New Arrival page. Since the page content is driven by a search result, thus, the way I am doing this, is to look for a magical word, which will be shown if there are products in the returned in the query.

The many facet of a system

Even though your system is very simple, it will usually compose of several components. Network, the application and probably a database too. Thus, when outages arise, it can be either one, or a multiple of the component are having issues. Thus, after you have probes to monitor the key metrics, you should start rolling out monitoring probes on these components, too.

Someone reported that the site is very slow

There was a time that I was greeted with this question rather than “Good morning”. Knowing the application is up, serving the intent is again not enough. We should be aware of how efficient the application is, and again, bring ourselves to the attention when the performance is below the “acceptable region”.

Defining the “Acceptable region” is an art. You can’t just define a value out of imagination, without considering if it is physically achievable. For instance, due to network latency, it is unfair to use a single threshold for all location (edge acceleration can help to resolve this but that’s not the point of my discussion).

Notify yourself pro-actively before things get sour.

We have talked a lot about notification, let’s take a look at the importance of metrics for diagnosis. When I noticed the slow down in performance, first thing I would do is to open my monitoring tools, going through every single components and spot for abnormality. Most of the time, the cause could be identified from the single metric that falls out from the average.

What you are seeing might only be a consequence, not a cause

One day, I’ve got the degradation alert. Both application and page load apdex dropped sharply.

From the graph, page load degradation is due to application performance issue (the section in purple).

So I drilled further and noticed the time it spent on database operation increased.

Naturally, I looked at the DB performance and noticed an increase in response time but througput remains the same.

So what is wrong?

Next, I moved over to check the infrastructure metrics and I spot the following,

One of the network is saturated and this is where the databases are sitting! From there, I noticed abnormality arise from one of the database, which keeps humming data out to one of the backend application.

A restart of this backend application resolved the issues and the site is performing normally again. It takes quite a bit of time to investigate the issue and one of the reason is, the issue arise from multiple components.

To make life easier, we have established a dashboard that summarize issue all in one place. It flashes to draw the attention and from this screen, we know things are not right in serveral places! So we know where to start digging in.

This is just the beginning, with the probes in place, you can develop tools to smoothline your work. Happy Monitoring!