• Efficient IT Solutions for the Modern Enterprise
Certificates

AWS Observability Best Practices For Your Application

"Attaining application observability, which empowers organizations to gain valuable insights into the inner workings of their applications, presents a prevalent business challenge. This challenge is often compounded by a common obstacle: the effective implementation of instrumentation. In an era where observability holds increasing significance, achieving it can be impeded by the complexities of instrumentation.

This article aims to guide you through Amazon's recommended strategies for surmounting this challenge. By doing so, organizations can efficiently collect and analyze their application's data, enabling them to gain actionable insights and ensure optimal application performance. After reading, you'll be equipped with the knowledge to:

  • Understand the significance of instrumentation in achieving observability.
  • Address the complexities of handling cardinality within your visibility framework.
  • Explore the array of tools and services provided by AWS for effective instrumentation.
  • Implement Amazon's best practices for attaining comprehensive visibility."

 

While businesses span various industries and geographic locations, they all have common priorities: delivering value through a secure, high-performance application, even in the face of inherent technological limitations. Application observability essentially revolves around the pursuit of optimal service functionality, finely tuned to meet customer needs and stakeholder expectations. This pursuit stems from a deep comprehension of how a particular application functions.

The concept of observability is rapidly gaining traction, particularly in the context of complex systems like microservices or distributed systems. When addressing the intricacies of application observability, one can draw upon established best practices, which we will delve into extensively throughout this article. These practices empower businesses to dissect their application's data, enabling them to proactively address issues, enhance performance, and prevent future problems.

We invite you to join us on this journey of exploring best practices in instrumentation. Incorporating these practices into your observability strategy is certain to yield significant improvements in your comprehension of an application and its ongoing maintenance.

Navigating the Choices in Instrumentation

Instrumentation in the realm of application observability involves the deliberate incorporation of code, tools, or agents into your software applications and systems. Its aim is to gather data and extract insights regarding their performance. By strategically embedding monitoring points within your application's code or infrastructure, instrumentation empowers the collection of vital information, including metrics, logs, and traces.

Instrumentation offers a range of choices, but it's important to consider that it's a collaborative endeavor. Optimal results are attained when there's a mutual contribution from both the cloud provider and the customer.

Choosing the Right Instrumentation for Your Observability Needs

Instrumentation in the realm of application observability refers to the strategic inclusion of code, tools, or agents into your software applications and systems. The goal is to gather essential information and derive insights into their performance. By strategically embedding monitoring points within your application's code or infrastructure, instrumentation enables the collection of vital data, including metrics, logs, and traces.

When it comes to instrumentation, you have a range of options at your disposal. However, it's crucial to understand that instrumentation is a collaborative effort, and the best results are achieved when both the cloud provider and the customer actively contribute.

Instrumenting with AWS Lambda - A Serverless Perspective

Let's delve into some of the services recommended by Amazon for effective instrumentation, starting with a serverless computing service like AWS Lambda. Amazon's documentation for each service provides essential information, including:

Available Statistics: AWS clarifies which statistics are accessible for each metric, such as sums, percentiles, or averages, allowing you to focus on what matters.

Metric Explanation: Amazon provides straightforward descriptions of what each metric measures, ensuring clarity even for non-technical users.

Dimensions: Many metrics involve multiple dimensions, which are crucial to understand. For example, in AWS synthetic testing, a canary has a duration metric, and each test step has its own duration metric.

In terms of data collection, various services can send logs to CloudWatch or S3, including Lambda, which automatically sends metrics and logs. Setting up tracing in Lambda is straightforward, with options like X-Ray tracing, allowing you to capture traces every time your Lambda function runs.

Instrumenting Lambda functions is user-friendly and can be compared to taking notes. You use a console tool to log essential data, like important information or warnings. These logs, essentially measurements of your function's performance, can be sent to AWS CloudWatch for review.

Instrumenting with Containers

Moving beyond serverless computing, let's explore instrumentation in the context of containers. Containers offer a wide range of possibilities, but a one-size-fits-all solution for logs, metrics, and traces doesn't exist. Here's a breakdown of some options:

  • AWS Distro for OpenTelemetry: Collects metrics and traces, forming a solid foundation for observability.
  • The Cloud: Offers various tools but doesn't handle logs directly.
  • CloudWatch Agent: Effectively collects logs and metrics but doesn't cover trace collection.
  • Fluent Bit and FireLens: Primarily focus on collecting logs, with no metrics or trace capabilities.

 

As there isn't a single solution that seamlessly handles all three aspects of instrumentation, we'll provide recommendations for choosing the right instrumentation agent. For now, we suggest starting with OpenTelemetry for metrics and traces and selecting additional tools based on your log management needs.

Leveraging OpenTelemetry for Metrics and Traces

OpenTelemetry is an open standard for gathering observability data, including metrics and traces (while logs are still in draft status). It simplifies the collection of these signals, whether manually or automatically, for system logs, infrastructure metrics, application logs, tracing data, and application metrics.

OpenTelemetry's flexibility allows you to collect data once and distribute it to multiple destinations, enhancing correlation and giving you options for your observability setup. You can use different collectors like AWS Open Distro and various logging agents, such as FireLens, Fluent Bit, or CloudWatch.

Capturing Critical Telemetry

Amazon emphasizes a meticulous approach to collecting telemetry for every application and service. They record information for each unit of work or request, such as HTTP requests or messages from queues. These telemetry data are logged in a structured "request data log."

To effectively operate a large-scale e-commerce site, comprehensive visibility into code performance and issues is essential. However, without crucial insights, questions arise when problems occur, including request failures and slow performance. Amazon addresses this challenge by using a common metrics library across the company. When a service receives a new request, it instantiates a metrics object, collecting vital data like request details, cache usage, database interactions, and timeout distinctions.

This approach enhances operational efficiency without significantly increasing the codebase, offering comprehensive insights for diagnosing and addressing issues.

The Embedded Metric Format Configuration

In addition to measurements and facts, attributes in the data are represented as strings. The embedded metric format allows you to select specific properties in logs to be converted into metrics. This format specifies the namespace for grouping metrics and indicates which property should be transformed into a metric, which can be visualized using tools like CloudWatch.

However, while this approach signals performance issues, it doesn't provide insights into their underlying causes. To understand the root causes, it's essential to examine the architecture, particularly when your application involves various operations with distinct dependencies. Metric dimensions become valuable here, allowing you to group metrics by the operation, resulting in more granular visibility.

Incorporating dimensions allows you to capture detailed telemetry data, including custom business metrics, which can be transformed into measurements for your dashboards and alarms, providing deeper visibility and understanding of your system's behavior.

It's important to note that various observability platforms have their methods for capturing metrics at scale. In this context, we recommend the open-source client library called the Embedded Metric Format for its flexibility and ability to work with different collectors and logging agents.

Dashboards for Enhanced Observability

Dashboards are essential tools in Amazon's observability ecosystem, providing a vital means to quickly and efficiently comprehend the status and performance of a system from a specific perspective. For instance, stakeholders or product managers might require dashboards tailored to focus on user experience.

However, a common pitfall is overloading dashboards with excessive information. This can hinder their effectiveness, particularly during critical operational situations when they are most needed. Dashboards are most valuable during operational events when different team members assume various roles to address and resolve issues. Thus, simplicity should be the guiding principle when designing dashboards.

Different types of dashboards serve unique purposes within an organization's operational and monitoring framework, catering to the diverse needs of various stakeholders and scenarios. Some of these types include:

Customer Experience Dashboards: Providing a high-level view of customer experience, these dashboards facilitate communication between business leaders, technical teams, and customers. They focus on key aspects of customer experience and highlight the impact of actions on end-users.

System-Level Dashboards: Dedicated to web-based services, these dashboards furnish engineers with data on system performance, particularly customer-facing endpoints accessed through UI or APIs.

Microservice Dashboards: These dashboards enable quick assessment of customer experience within individual services, helping engineers maintain focus during operational events. They also track dependencies between microservices.

Capacity Dashboards: Used for resource and service monitoring, capacity dashboards are valuable for long-term capacity planning, ensuring teams have sufficient computing and storage resources.

While these are some of the available dashboard types, it's important to note that visibility isn't about cramming dashboards with countless metrics or utilizing every possible kind of metric. What truly makes a difference in system observability is cultivating a culture of continuous improvement. This involves regularly integrating lessons learned from past events.

An effective approach to achieving this is managing dashboards programmatically using an infrastructure-as-code approach. For instance, AWS has a routine of refining dashboards as part of its continuous development culture. By randomly selecting services for auditing, AWS ensures that all teams are prepared to discuss their dashboards during operational reviews, promoting proactive readiness.

Best Practices for High Cardinality Dimension Management

Effective management of high cardinality dimensions is crucial for optimizing costs and maintaining observability. Here are some best practices:

Identify Potential High Cardinality Dimensions:

High cardinality dimensions include attributes with numerous unique values, such as user IDs, request paths, or resource names. Recognizing which attributes drive high cardinality is essential for making informed decisions about dimension management and analysis.

Ingest Telemetry as Logs First:

A cost-effective approach to handling high cardinality data is to ingest telemetry as logs before converting them into metrics. Logs offer greater flexibility for capturing detailed data, including high cardinality dimensions, without incurring the storage and query costs associated with high cardinality metrics. Storing logs for extended periods is also more economical than storing metrics based on unique dimensions. Starting with logs allows you to align your decisions about which dimensions to elevate to high cardinality metrics with your specific analysis requirements.

Create Metrics Using Appropriate Dimensions:

Once you've ingested telemetry as logs and identified the most relevant high cardinality dimensions, purposefully generate metrics with those dimensions. Avoid creating metrics for all dimensions immediately, as this can result in unnecessary expenses. Crafting metrics for dimensions that provide valuable insights and align with your observability goals allows you to prioritize crucial dimensions, optimize metric usage, and minimize the cost impact of high cardinality while still collecting relevant information.

Leverage the Embedded Metric Format (If Using CloudWatch):

If you use AWS CloudWatch, take advantage of the embedded metric format. This format allows you to create custom metrics with high cardinality dimensions directly from your log data. By defining custom metrics using structured logs, you can reduce the need to create high cardinality metrics in advance. This approach empowers you to manage costs by selectively deciding which dimensions to elevate to metrics, avoiding unnecessary metric generation. It offers the benefits of log flexibility while maintaining metric efficiency, helping you optimize costs while ensuring observability.

Mitigating Alert Fatigue: Best Practices for Effective Alarm Management

Strategizing your alarm alert system is a critical aspect of communication between your business and its technical operations. Alerts aren't just meant to be triggered; they should lead to actionable steps. It's essential to be intentional about your alerts, as they can significantly impact your business outcomes. A detrimental scenario involves alert fatigue and excessive alarm noise: having too many alerts can lead to overlooking critical issues.

To ensure clarity and effectiveness in your alarm management:

Define Expected Actions: It's crucial to clearly define the expected actions to take when alerts are triggered. Use playbooks to establish standardized procedures that guide anyone responding to an alert, regardless of their experience or familiarity with the business. This ensures that the right steps are followed precisely and consistently.

Remediate Issues Through Runbooks: Whenever possible, automate or semi-automate issue remediation through runbooks. Runbooks provide a set of pre-defined actions to address specific issues. By automating these actions, you can reduce the manual effort required to resolve common problems and expedite the response to alerts.

Implementing these best practices can help you manage alarms effectively, reduce alert fatigue, and ensure that alerts lead to actionable outcomes that benefit your business.

Best Practices for Effective Alarm Management

Alarm on Key Performance Indicators and Workload Metrics:

Identify the key performance indicators (KPIs) that align with your application's objectives and user expectations. Simultaneously, determine the critical metrics for your service or application. For example, a high-volume e-commerce website may prioritize tracking metrics such as order processing rates, page load speed, or search latency. Using KPIs and workload metrics as a guide ensures that your alarms are aligned with your strategic goals.

Alert When Workload Outcomes Are at Risk:

Configure alarms to be triggered by critical metrics only. Alarms should activate when your workload's performance is in jeopardy, preventing alert fatigue by minimizing unwarranted alerts and flagging only substantial deviations from desired results. CloudWatch Synthetics, which simulates user interactions with your system, is an effective method for achieving this.

Create Alarms on Dynamic Thresholds:

For application workloads with dynamic and fluctuating activity levels, consider using dynamic thresholds that adapt to the workload's typical behavior instead of static thresholds. Relevant thresholds that accommodate fluctuations can be set based on an analysis of historical data. This approach reduces false alarms and the volume of unnecessary notifications, ensuring that notifications are pertinent.

Correlate Alarms and Notify When Correlation Occurs:

When dealing with numerous alarms, ensure that alarms within a composite alarm do not trigger notifications separately. Composite alarms, available in CloudWatch and similar features in other observability platforms, allow you to use operators like "and," "or," and "not" to merge multiple metric alarms. These composite alarms notify you when something unexpected occurs, effectively correlating alarms while avoiding excessive notifications.

Leverage Machine Learning Algorithms for Anomaly Detection:

Utilize machine learning algorithms to develop advanced anomaly detection models that understand your application's metric patterns and detect genuine deviations from expected behavior. These models adapt and comprehend intricate metric relationships, resulting in more precise and timely alerts. This approach helps mitigate alarm fatigue and enhances alert accuracy, ensuring that you're notified when significant anomalies occur.

Best Practices for Avoiding Dangling Traces in Application Monitoring

Understanding tracing and implementing effective tracing practices is crucial for maintaining observability in complex systems. Here are some best practices to avoid dangling traces:

Instrument All Your Code:

To achieve comprehensive visibility into every step of your application's functioning, instrument all of your code. This includes capturing and tracing each operation, such as functions, methods, and interactions with AWS services. Insert code snippets into your application to generate and transmit traces to the tracing system. Instrumenting thoroughly generates sufficient data to track the journey of requests and responses, facilitating the detection of bottlenecks, failures, or delays within your application.

Understand Which AWS Services Support Tracing:

Familiarize yourself with specific AWS services that inherently support tracing. Understand how these services interact with your chosen tracing system. Leveraging Amazon services' native features, such as autonomous trace creation and seamless transfer, can reduce the risk of dangling traces in scenarios involving interactions between different services and tracing systems.

If a Service Does Not Support Tracing:

  • Pass Trace Context Across Service Boundaries:

When a service does not include inherent tracing support, manually transmit trace context across its boundaries to avoid interrupting trace continuity and losing essential data. Include the trace ID and relevant details within the request or message to preserve trace context.

  • Resume the Context on the Other Side:

When a non-tracing-supported service receives the trace context, extract it from the incoming request or message and resume the trace. Continuing the trace context by filling the gap in trace information allows tracking the path of a particular request within your app, even if segments of that journey involve services that do not provide native tracing.

 

 

 

13/04/2023

« »