We have previously discussed why system observability is important in distributed systems and the challenges that it can help to overcome – if you missed it, you can see Matt Ingenthron’s introduction to the problem here. Next, we will discuss why the Couchbase SDK team selected OpenTracing as their tracing API and why we think it’s a big deal. We’ll also cover a brief introduction to SDKs ThresholdLoggingTracer with the .NET SDK (In OpenTracing, .NET is used as the vehicle to demonstrate the tracer, similar functionality will be available in each of the SDKs).
Why OpenTracing?
OpenTracing is a standardised API that describes how to structure tracing information within a distributed system environment and how a requests can cross process boundaries. OpenTracing allows us to separate the data that is being collected from the mechanism that collects and does something with that data. Another major plus for us is that OpenTracing is vendor agnostic, which means we can provide maximum flexibility to use a custom or third party distributed tracing implementation without having to adhere to a prescribed network protocol or in-transit data structure.
At Couchbase, observability and OpenTracing integration are important to us. We know we’re part of a larger customer solution and want to help inform the view of what is going on with elements we know about. For the Couchbase SDKs we defined a collection of steps that describe the journey a request (Get, Upsert, etc) takes from being submitted to when a response is ready. In the OpenTracing world, these steps are called **Span**s and have timing information (start / end). Spans can also be ‘tagged’ with additional information, eg IDs, endpoint names, network details, etc. All of this information is useful to help take a snapshot of what was going on at the time.
Each of the SDKs define the following Spans for a request:
Step | Description |
---|---|
request encoding | Transcode the document content from a native object into a JSON string |
dispatch to server | Send the request over the network and wait for a response |
response decoding | Transcode the response document content from a JSON string into a native object |
NOTE: request encoding and response decoding are only required for some operations. eg Get would omit ‘request encoding’.*
In addition to all of the SDK specific work, Couchbase server has been enhanced to return the operation duration as part of KV responses to match other services such as N1QL, FTS and Analytics. This is encoded directly into the response packet and will be parsed using the SDK during processing. This helps to determine if an unusually long time was spent servicing an operation and can be used to help identify the cause of a slow operation (eg read from disc).
Next, the OpenTracing API describes a Tracer as the thing that creates and collects Spans and does something with them. The Couchbase SDKs provide a Tracer implementation out of the box called the ThresholdLoggingTracer (see below). This is a modest, out-of-the-box Tracer that aggregates requests that exceed a service specific threshold and then writes them to a log at a given interval, including a total number and a small sample of the worst offenders. Of course, the number and thresholds are tune-able, since there is such a wide variety of environments Couchbase is deployed to.
A key point to mention here is that the Couchbase SDKs intentionally allow the Tracer implementation to be replaced. This gives us the benefit of allowing the Tracer implementation to be swapped out (custom or Product) without requiring any internal logic to be modified.
More detail on the OpenTracing specification is here.
ThresholdLoggingTracer using C# .NET
Okay, lots of theoretical talking so far, let’s show it in action. The following is a simple example of the ThresholdLoggingTracer in action in the .NET SDK.
We have tried to select sensible defaults for the ThresholdLoggingTracer to balance capturing valuable details with not spamming log files. Also, starting with the next minor release for each SDK, this will all be enabled by default. That means you get it all for free, without doing anything extra!
An example of the output from the log looks like this (note this format may change between DP, beta and GA– feedback wanted!):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
[{ "service": "kv", "count": 8, "top": [ { "operaion_name": "get", "last_local_address": "10.211.55.3:52450", "last_remote_address": "10.112.180.101:11210", "total_duration_us": 18908, "encode_us": 256, "dispatch_us": 825, "decode_us": 16, "server_us": 11 }, { "operaion_name": "set", "last_local_address": "10.211.55.3:52450", "last_remote_address": "10.112.180.101:11210", "total_duration_us": 11468, "encode_us": 3832, "dispatch_us": 565, "decode_us": 15, "server_us": 15 }, ... ] }, { "service": "n1ql", "count": 7, "top": [ { "operation_id": "4bf5807b-1538-4420-a64d-a7829a2acf98", "statement": "select default.* from `default` limit 1;", "last_remote_address": "10.112.180.101:8093", "total_duration_us": 13674, "encode_us": 2311, "dispatch_us": 6883, "decode_us": 4551, "server_us": 4325 }, ... ] }] |
From the above example you can see we group the operations by type, eg “kv”, and give a total number of operations that exceeded the service defined threshold. There is also a sample of the slowest operations that provide additional details. This information could be very valuable when trying to diagnose a problem. The samples also include the “server_us” property which indicates how long the server was processing that operation for.
CNCF
Cloud Native Computing Foundation (CNCF) is a community of open source software projects built with containers and forming part of a microservices ecosystem from the start. It’s a valuable resource that enables projects to collaborate with talented engineers and gain visibility while they mature. This is extremely important to small projects to ensure they are supported and give them a springboard into the wild.
OpenTracing has been accepted into this program and is an incubating project. This will only help to improve its visibility and reach, which can only be good for both them and us as the project matures grows.
Of course, since it is still incubating and the OpenTracing standard has not yet reached 1.0, this means the interface we’ll offer from the Couchbase SDKs will track the standard as it evolves. That means our interface will be marked as uncommitted at the moment, and it may change even across micro releases. We will aim to upgrade it to committed when OpenTracing reaches 1.0.
Next Steps
Now we’ve explored the reasons why Couchbase selected OpenTracing as our tracing API and what it has enabled us to provide to our users, we can discuss how to use the out of the box tracer and how to integrate with other tracer solutions. Michael Nitschinger is going to do a deeper dive into Response Time Observability with the Couchbase Java SDK and then show how to integrate with the Jaeger tracing system.
Thanks Mike for this article.
You mention: The samples also includes the “server_us” property which indicates how long the server was processing that operation for.
I also see server_duration_us property, can you help understand what’s the difference between the 2?
This looks to be an error in the JSON which is now been fixed. There is just a server_us property that represents the amount of time taken to process the request on the server.
Thanks Mike.
Is similar tracing capability built in the C SDK as well and if yes, would it print similar traces/logs?