Garage

Logging and Error Handling in the Cloud

Share this post:

The importance of logging in the cloud

When an application in the cloud has a glitch or the customer sees the dreaded “Page not available” message, people want to know what happened and why. Management demands a root cause analysis (RCA) report as soon as the problem is fixed. Sometimes they don’t even have the patience to wait for the fix to be in. If developers have not done a good job of logging and handling errors then RCA becomes a tougher job. Cloud-native application developers tend to forget error-logging 101 because they think redeploying the application fixes everything. Catching and logging error codes and RCA are two sides of the same coin and it is no different in the cloud.

Why logging?

A basic tenet of software development, logging is useful to developers, operations, security teams, and end users. Remember watching Star Trek and hearing, “Captain’s Log, Stardate 42073.1. There has been an outbreak of an unclassified plasma plague in the Rachelis system. We’re on an emergency run to collect specimens.”? As viewers, we knew where Captain Kirk and Starship Enterprise were heading to and why.

Developers need the ability to find problems in their applications both during and after deployment. Debugging tools help, but log messages help in localizing the problem. If regular logs are not enough, many applications can also provide debug logs with different switches, which offer more details about what the application is doing.

Operations oversee many systems, both on- and off-premises. Logs and dashboards provide the visual cues that let them how systems and application are running.

IT security teams constantly view logs and analyze them for different reasons, including audit and compliance purposes. HIPAA, PCI, GDPR, and other compliance regulations place huge requirements around how logs are gathered, stored, and managed.

Finally, it is good programming practice to let end users of applications know what happened and why in the case of a glitch. When these glitches are happening in the cloud, there can be a sense of helplessness.

What to log?

According to the 12-factor application guidelines, logs are streams of aggregated, time-ordered events. That begets the question of what should be logged (keeping in mind the mantra of not to log too much or too little). The primary objective of logging is troubleshooting, but logs are also generated for other alerting purposes and business intelligence.

At a minimum, each log entry needs to include enough information for the intended monitoring and analysis. It could be full content data, but it is more likely to be an extract or summary of properties. The application logs must record “when, where, who, and what” for each event. The properties for these will be different depending on the architecture, class of application, and host system/device, but often include what is shown in the table:

Property Description
When
  • Log date and time (international format)
  • Event date and time (the event time stamp may be different to the time of logging, e.g., server logging where the client application is hosted on a remote device that is only periodically or intermittently online)
Where
  • Application identifier (e.g., name and version)
  • Application address (e.g., cluster/host name or server IPv4 or IPv6 address and port number, workstation identity, local device identifier)
  • Service (e.g., name and protocol)
  • Geolocation
  • Window/form/page (e.g., entry point URL and HTTP method for a Web application, dialog box name)
  • Code location (e.g., script name, module name)
What
  • Type of event
  • Severity of event
  • Security relevant event flag (if the logs contain non-security event data too)
  • Description
Who (human or system)
  • Source address (e.g., user’s device/machine identifier, user’s IP address, cell/RF tower ID, mobile telephone number)
  • User identity (if authenticated or otherwise known, e.g., user database table primary key value, user name, license number)

Logging best practices

Here are some guidelines we follow when advising clients who are developing or moving applications to the cloud:

  • Don’t use different formats in the same file; stick to one common format.
  • Don’t log everything at the same log level. Logging libraries offer several log levels.
  • Allow dynamically changeable and remotely changeable logging. For example, switch from INFO log level in pre-production to DEBUG in production.
  • Add standardized timestamps while keeping them as granular as possible, and include time zone information.
  • Add transaction identifiers that will help track distributed transactions across composite applications, middleware, and microservices.
  • Keep multi-line events to a minimum.
  • Add the exact method name and line number to the error message. This makes it much easier to find the issue in code.

Lastly, log files should be copied and moved to permanent storage. They should be kept safe and confidential, even when backed up.

We would like to know what practices other cloud application development teams follow.

Learn more

Executive IT Specialist, WW Hybrid Cloud Services

Ashok Iyengar

Executive Cloud Architect - Cloud Adoption and Transformation

More Garage stories
February 26, 2019

Cloud at the Edge

Edge computing is an extension of the existing cloud where the smaller infrastructure components are distributed at the edge of the network. It facilitates the operation of end devices by acting as the relay to meet the needs of high-speed IoT devices, thus reducing the bandwidth load of the network core.

Continue reading

February 21, 2019

The Hero’s Journey to Cloud: Why Star Wars, Prometheus, and Cloud Are All Interconnected

In the Cloud Garage, we see some patterns over and over again—a development organisation wants to achieve a significant improvement and realises moving to cloud could be the way to do that. This journey often mimics the hero's journey in the traditional monomyth.

Continue reading