IBM Cloud Monitoring with Sysdig: A Developer's Perspective

4 min read

IBM Cloud Monitoring with Sysdig understands Kubernetes.

The advanced integration allows you to define how to monitor your systems using Kubernetes labels. As we all know, the labeling in Kubernetes is incredibly powerful, so there are billions of ways to slice and dice your infrastructure. Of course, all the normal stuff is there too, but you’ll soon want to move to a more advanced view of your services and deployments.

In addition, using the topology view in IBM Cloud Monitoring with Sysdig automatically maps the interactions, which makes it incredibly useful to understand where things like network traffic are going. You can even define the grouping in the topology—later you’ll see how we watch traffic between namespaces for example.

Let’s begin by taking a look at my team’s API and seeing how it’s performing without having to worry about machines and networks. We’re running a fairly traditional API over a database to give access to important business entities and using Kubernetes to make it robust and easy to scale.

Great namespace support

Starting at the simplest level, you can see the Kubernetes pods grouped by namespace. That’s really convenient because namespaces provide a great way to organize an application or service in Kubernetes.

A list of namespaces and pod counts in Sysdig.

Figure 1: Group by namespace.

My team and I practice DevSecOps, meaning the developers on our team act as operators. This makes Sysdig’s strength even more important because it allows you to look at the things you are influencing. For example, you’re going to want to know which things are using available resources and how much. Sysdig makes it easy to see this over time, which is much more powerful than simply showing a snapshot.

Here’s a quick overview of which things are using noteworthy amounts of memory.

A chart showing the memory used by multiple namespaces over an hour.

Figure 2: Total resources by namespace.

Already, you’re seeing how useful Sysdig is, without even talking about machines, containers, or processes. So far, you have a glimpse at things that the operators are naturally interested in.

In Figure 2, you can also see I chose to clear some items manually to remove them from the chart. I’ll share how to apply filters later.

Great label support

Now, you may be wondering, “Did my change break production?” There are several ways I could answer that, but often you want to see how a system changed when you pushed a change to GitHub. Of course, that change flowed straight into production with a zero-touch rolling deployment after it passed its gating tests. Among other things, I’ve labeled the deployments with its Git commit so now Sysdig can shine and show resource utilization by Git commit.

A chart showing memory usage rising for each version of the API.

Figure 3: Resources by Git commit.

Clearly, this service has a memory leak, but I can see it wasn’t my recent change that introduced it so I can relax a little (for now).

By now, I’m sure you’re already thinking of other ways to use labels. It’s a fantastic setup and great that Kubernetes supports extensive labeling while Sysdig collects the metadata.

Focus on what matters

You may have noticed in that last screenshot that I filtered what I was showing.

A simple scope filter limiting to the API namespace.

Figure 4: Simple scope.

Of course, you can tighten that scope and use simple drag-and-select to zoom in on a time window you care about. Ready to take a look at a specific Git commit?

A chart showing many pods with their memory rising up to 2 gigabytes and then rapidly falling.

Figure 5: Memory by pod.

What’s interesting here is that it wasn’t just one pod that was affected, but all of them due to the Kubernetes ingress balancing the troublesome traffic.

You probably noticed the peaks and are wondering why things stopped there. Again, Sysdig has captured the data needed to answer that question. Notably, you don’t have to go back and look at what the configuration might have been from your YAML files as you now have the actual value.

A large number 2, a gigabyte symbol, and options for configuring the widget.

Figure 6: Memory limit.

Figure 6 illustrates the configuration page where I chose the “#” symbol to show a single number for simplicity (I could have just as easily chosen a chart to see how it changed over time).

Another feature of Sysdig is its ability to do the right thing by default. I could have set up the scale for this number, but I didn’t need to. The auto option worked beautifully!

So, now it’s clear that something wonderful is happening – Kubernetes is stepping in when the memory reaches the limit and it’s restarting the pod.

But, could this be causing issues for users?

A review of traffic in large numbers showing 2.33 million requests and 1430 errors.

Figure 7: HTTP error count.

Figure 7 shows 0.06% of requests that are stumbling. Whilst I’m not happy about that, I know that by adding some new functionality I can help the business far more than worrying about a few failed connections. As a developer, this is exactly the information you need to help guide you and your team.

Peek inside the matrix

Sysdig’s topology view allows you to view traffic within the cluster as well as from outside of it. In Figure 8, I focus on the traffic from a namespace called chi-calc, which consumes the API. Being able to group network traffic by namespace is super useful.

Two highlighted namespace boxes showing traffic between a client and the API, plus two faded boxes showing other traffic.

Figure 8: API clients.

Excitingly, I can dig into each of those boxes to see what’s in them. In fact, this view covered a time when I ran three instances of the client job. When I expand the box, you can see each of the three client instances. You can expand the API box too and see the traffic between specific instances.

When I’m looking to see what changed, I often build a picture like the one above and then alter the time window. Sysdig has a wonderful way of showing the difference using dashed lines and boxes to show items that went away.

In the image below, I set the time window to be around one of the client job instances. Notice that the older two are marked with dashed boxes because they were previously shown but are no longer relevant.

Many boxes showing traffic flowing from client pods to API pods with some surrounded by dashes to indicate they have gone away.

Figure 9: Changes over time.

This shows the power of being able to group by arbitrary things. Initially, I didn’t care about specific instances and just wanted to look at the traffic between namespaces. In a dynamic infrastructure, you don’t want the little details of pods coming and going to distract you from the big picture.

Start your monitoring using Sysdig today

Thanks for coming along with me as I explored how my API was behaving. Clearly, the Sysdig team’s experience building useful technology is helping me and my team to do the same.

Start using IBM Cloud Monitoring with Sysdig right now and get labeling your objects in Kubernetes to make it shine.

Be the first to hear about news, product updates, and innovation from IBM Cloud