Telemetry with Identity integration

A balance of information

One recent challenge I have faced has been finding the right fit of metrics and log events to solve the variety of problems we face in monitoring our services. It is a common problem which usually supported by existing metrics and events. Developers and operators can find themselves straddled between request/response and identity. A requirement such as the observation of an account resource with respect to usage of APIs means creating a common correlation. The need to preserve privacy is a fundamental element which should be included in observability. Diligence should be practiced when picking the right platform. Careless implementation can create immense trouble for a product and company.

Experienced observability and telemetry experts often decompose request and response telemetry into 3 availability speeds: Hot, Warm, and Cold. Without diving too much into details:

Hot: 5 minutes or less of numerical count-based data with a set of dimensions. These are called metrics.
Warm: 15 minutes of an object-based event which may represent state or transaction at the point in time.
Cold: 30 days or longer which is often derived from the Warm path and optimized for longer term storage and usage.

These 3 categories can typically make up all parts of observability in a mature product and service.

The common established metrics in the operations of the service might be a latency metric with a set of dimensions.

[latencyMs, OperationResult, OperationName, Region]
[100, "Success", "Search", "East US"]

These dimensions may not include any customer identifying information. Secondly, it is very likely that a request/response event (Warm path) is currently being emitted.

{
	"host": "service.mydomain.com",
	"requestPath": "/api/search/",
	"queryString": "?query=abc%20def",
	"requestHeaders": "Accept: application/json
	Authorization: Bearer xxxxxx",
	"httpMethod": "GET",
	"responseStatusCode": 200,
	"responseHeaders": "…",
	"clientIpAddress": "xxx.xxx.xxx.xxx",
	"responseBody": "{…}",
	"correlationId": "[guid]"
}

This event may capture request path, query strings, http method, headers, response status code, and more. This event is specific to request/response and currently it does not indicate any form of account or identity information either. Headers and query strings might be capturing customer identity information and we will come back to this later to handle this situation. This can be a concern as it relates to privacy and security of a customer’s account.

Understanding the pipeline

Conflating these metrics or events with additional information accommodating the identity needs can cause confusion and lead to the opposite intention. For authentication we typically must execute before any route or individual operation names may be available to our instrumentation code. This can lead to cases where authentication is denied and “OperationName” can result in an empty or null value. There may also be cases where some APIs are made anonymous and do not require any authentication. This may result in a large influx of null or default values for any identity dimensions or fields. This can be exacerbated in cases of supporting multiple authentication protocols and complex authorization policies. In these cases, adding dimensions or properties may distract or mislead the Dev/Ops individual during troubleshooting.

This is a very simplified process view of the pipeline for requests into a service. As you can see if authentication or authorization is denied; the request will not proceed to API for the OperationName or OperationResult. Similarly, we can see how an Anonymous API must contain a default or some placeholder values for anonymous to clear any possible confusion. If we did not provide any placeholder values, we could see cases where the metric dimensions cannot helpfully distinguish identity. I speak about this because we currently face this challenge. We want account information to be made available concisely, but we also want to know if the customer is struggling with problems using our identity implementation.

In our production we have a case where a we support Azure AD Bearer token authentication with the addition of Azure Role Based access control. One of our customers had created a website and had configured everything related to Azure AD App Registrations and was able to successfully authenticate to Azure AD. However, the problem was in the configuration of the access control. When authenticating with Azure AD the OAuth 2.0 bearer token contains an ID for the user or application. This ID was not configured in the access control for the account and resulted in authorization denies. The customer reached out on GitHub and I was able to diagnose the issue with a mix of metrics and logging events. However, this could be a better experience and so I’m currently working on how to make this quicker and easier to identify for us as well as provide the necessary self-service help for customers.

Guidance to a solution

To solve these types of problems I’m approaching it from the point of a view of identity as a contained responsibility. When an software engineer is developing an application, they care about identity as the necessary work to enable the actual business logic. When customers cannot access their accounts or use the APIs it is very important to identify the cause separately with clear error codes and messages.

When a customer receives an authentication deny that means the customer was not identified. We can assume there is a request correlationId on the existing request warm path event. Any identifying request information can be found there including authentication headers or query strings.

What we need is a metric count of authentication denies to quickly identify the cause. Adding a simple trace message event for warm path which can correlate on the same correlationId should give an easy way to troubleshoot the authentication for the given request. The additional metric will be used to aggregate and identify common user errors or potential attacks on the service.

A metric count of authentication success will show the most used authentication scheme(s) such as Azure AD or Shared Key based on the AccountId. The success of authentication is now confirmed, and no additional information should be needed.

A metric count of authorization denies for complex authorization rules. If an account has specific policies such as group membership requirements, SKU based access such as a free and paid, or Role Based Access Control policies then defining a metric like below will be critical for support.

[latencyMS, policy, errorCode, authenicationScheme, accountId]

I recommend enabling any necessary auditing by use of the existing warm path events. If there is a need to add more details to authentication you can extend on this guidance in the places where you insert the metrics. However, with the combination of the new metrics and any existing telemetry there is a good possibility that the data can be exported to cold storage for any necessary auditing.

On privacy

As someone who works on the fundamentals of an Azure service, we value compliance and higher set of goals for security and privacy. Request headers and query strings can add values such as full bearer tokens or symmetric keys. We created a library for header / query string redaction of customer sent parameters. This is critical today and it is necessary to maintain or build the trust of our customers. As an example, there was a case with Facebook which exposed millions of passwords to 20k employees in plain text. As mentioned, there were a series of errors which led to this but as a developer it is our responsible to care about these issues. Redaction could have provided some basic identity protection which may have just stopped this from happening.

For those concerned about losing the ability to troubleshoot authentication while using redaction there are solutions available. I mentioned above the definition of specific error codes to triage the actual issue as one solution. Adding in a short-life warm path event in a contained module may be a necessary work item to solve this problem. The most important thing is to not let the event to be exposed for a long period of time. There should be proper production processes in place to ensure the least amount of people have access to this data.

A balance of information

Understanding the pipeline

Guidance to a solution

On privacy

Share this:

Related

Leave a comment Cancel reply