In Matthew’s blog the Sub-Document (subdoc) API feature is introduced with a short overview: In summary, subdoc allows efficient access to parts of documents (sub-documents) without requiring the transfer of the entire document over the network.
Throughout this blog, we’ll use a reference document. This document will then be accessed through various ways using the subdoc API. Note that for each subdocument operation, doc_size - op_size
bytes of bandwidth are being saved, where doc_size
is the length of the document, and op_size
is the length of the path and sub-document value.
The document below is 500 bytes. Performing a simple get()
would consume 500 bytes (plus protocol overhead) on the server response. If you only care for the delivery address, you could issue a lookup_in('customer123', SD.get('addresses.delivery'))
call. You would receive only about 120 bytes over the network, a savings of over 400 bytes, using a quarter of the bandwidth of the equivalent full-document (fulldoc) operation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
{ "name": "Douglas Reynholm", "email": "douglas@reynholmindustries.com", "addresses": { "billing": { "line1": "123 Any Street", "line2": "Anytown ", "country": "United Kingdom" }, "delivery": { "line1": "123 Any Street", "line2": "Anytown ", "country": "United Kingdom" } }, "purchases": { "complete": [ 339, 976, 442, 666 ], "abandoned": [ 157, 42, 999 ] } } |
I’ll be demonstrating examples using a development branch of the Python SDK and Couchbase Server 4.5 Developer Preview.
[EDIT: An experimental version of the sub-document API is now available in the latest Python SDK, version 2.0.8, and the examples below have been updated to reflect the latest API]
You can read about the other new features in Couchbase 4.5 in Don Pinto’s blog post
Subdoc operations
A subdoc operation is a single action for a single path in a document. This may be expressed as GET('addresses.billing')
or ARRAY_APPEND('purchases.abandoned', 42)
. Some operations are lookups (they simply return data without modifying the document) while some are mutations (they modify the contents of the document).
Many of the subdoc operations are smaller scale equivalents of fulldoc operations. It helps to think of a single document as being itself a miniature key-value store. In the Python SDK, operations can be specified via special functions in the couchbase.subdocument
module, which I will abbreviate in the rest of this blog as SD
. This is done by
1 |
import couchbase.subdocument as SD |
While looking at these operations, note that what is being transmitted over the network is only the arguments passed to the subdoc API itself, rather than the contents of the entire document (as would be with fulldoc). While the document itself may seem small, even a simple
Lookup Operations
Lookup operations queries the document for a certain path and returns that path. You have a choice of actually retrieving the document path using the GET
operation, or simply querying the existence of the path using the EXISTS
operation. The latter saves even more bandwidth by not retrieving the contents of the path if it is not needed.
1 2 |
rv = bucket.lookup_in('customer123', SD.get('addresses.delivery.country')) country = rv[0] # => 'United Kingdom' |
1 2 |
rv = bucket.lookup_in('customer123', SD.exists('purchases.pending[-1]')) rv.exists(0) # (check if path for first command exists): =>; False |
In the second snippet, I also show how to access the last element of an array, using the special [-1]
path component.
We can also combine these two operations:
1 2 3 4 5 6 |
rv = bucket.lookup_in('customer123', SD.get('addresses.delivery.country'), SD.exists('purchases.pending[-1]')) rv[0] # => 'United Kingdom' rv.exists(1) # => False rv[1] # => SubdocPathNotFoundError |
Mutation Operations
Mutation operations modify one or more paths in the document. These operations can be divided into several groups:
- Dictionary/Object operations: These operations write the value of a JSON dictionary key.
- Array/List operations: These operations add operations to JSON array/list.
- Generic operations: These operations modify the existing value itself and are container-agnostic.
Mutation operations are all or nothing, meaning that either all the operations within mutate_in
are successful, or none of them are.
Dictionary operations
The simplest of these operations is UPSERT
. Just like the fulldoc-level upsert, this will either modify the value of an existing path or create it if it does not exist:
1 |
bucket.mutate_in('customer123', SD.upsert('fax', '775-867-5309')) |
In addition to UPSERT
, the INSERT
operation will only add the new value to the path if it does not exist.
1 2 |
bucket.mutate_in('customer123', SD.insert('purchases.complete', [42, True, None])) # SubdocPathExistsError |
While the above operation will fail, note that anything valid as a full-doc value is also valid as a subdoc value: As long as it can be serialized as JSON. The Python SDK serializes the above value to [42, true, null]
.
Dictionary values can also be replaced or removed:
1 2 3 |
bucket.mutate_in('customer123', SD.remove('addresses.billing'), SD.replace('email', 'doug96@hotmail.com')) |
Array Operations
True array append (ARRAY_APPEND
) and prepend (ARRAY_PREPEND
) operations can also be performed using subdoc. Unlike fulldoc append/prepend operations (which simply concatenate bytes to the existing value), subdoc append and prepend are JSON-aware:
1 2 |
bucket.mutate_in('customer123', SD.array_append('purchases.complete', 777)) # purchases.complete is now [339, 976, 442, 666, 777] |
1 2 |
bucket.mutate_in('customer123', SD.array_prepend('purchases.abandoned', 18)) # purchaes.abandoned in now [18, 157, 49, 999] |
You can make an array-only document as well, and then perform array_
operations using an empty path:
1 2 3 |
bucket.upsert('my_array', []) bucket.mutate_in('my_array', SD.array_append('', 'some element')) # the document my_array is now ["some element"] |
Limited support also exists for treating arrays like unique sets, using the ARRAY_ADDUNIQUE
command. This will do a check to determine if the given value exists or not before actually adding the item to the array:
1 2 3 4 5 |
bucket.mutate_in('customer123', SD.array_addunique('purchases.complete', 95)) # => Success bucket.mutate_in('customer123', SD.array_addunique(‘purchases.abandoned', 42)) # => SubdocPathExists exception! |
Array operations can also be used as the basis for efficient FIFO or LIFO queues. First, create the queue:
1 |
bucket.upsert('my_queue', []) |
Adding items to the end
1 |
bucket.mutate_in('my_queue', SD.array_append('', 'job:953')) |
Consuming item from beginning.
1 2 3 4 |
rv = bucket.lookup_in('my_queue', SD.get('[0]')) job_id = rv[0] bucket.mutate_in('my_queue', SD.remove('[0]'), cas=rv.cas) run_job(job_id) |
The example above performs a GET
followed by a REMOVE
. The REMOVE
is only performed once the application already has the job, and it will only succeed if the document has not since changed (to ensure that the first item in the queue is the one we’ve just removed).
Counter Operations
Counter operations allow the manipulation of a numeric value inside a document. These operations are logically similar to the counter
operation on an entire document.
1 2 |
rv = bucket.mutate_in('customer123', SD.counter('logins', 1)) cur_count = rv[0] # => 1 |
The COUNTER
operation peforms simple arithmetic against a numeric value (the value is created if it does not yet exist).
COUNTER
can also decrement as well:
1 2 3 4 |
bucket.upsert('player432', {'gold': 1000}) rv = bucket.mutate_in('player432', SD.counter('gold', -150)) print('player432 now has {0} gold remaining'.format(rv[0])) # => player 432 now has 850 gold remaining |
Note that the existing value for counter operations must be within range of a 64 bit signed integer.
Creation of Intermediates
All of the examples above refer to creating a single new field within an existing dictionary. Creating a new hierarchy however will result in an error:
1 2 3 |
bucket.mutate_in('customer123', SD.upsert('phone.home', {'num': '775-867-5309', 'ext': 16})) # => SubdocPathNotFound |
Despite the operation being an UPSERT
, subdoc will refuse to create missing hierarchies by default. The create_parents
option however allows it to succeed: add the protocol level the option is called F_MKDIRP
, like the -p
option of the mkdir
command on Unix-like platforms.
1 2 3 4 |
bucket.mutate_in('customer123', SD.upsert('phone.home', {'num': '775-867-5309', 'ext': 16}, create_parents=True)) |
Subdocument and CAS
Subdoc mostly eliminates the need for tracking CAS. Subdoc operations are atomic and therefore if two different threads access two different sub-documents then no conflict will arise. For example the following two blocks can execute concurrently without any risk of conflict:
1 |
bucket.mutate_in('customer123', SD.array_append('purchases.complete', 999)) |
1 |
bucket.mutate_in('customer123', SD.array_append(‘purchases.abandoned', 998)) |
Even when modifying the same part of the document, operations will not necessarily conflict, for example two concurrent ARRAY_PREPEND
to the same array will both succeed, never overwriting the other.
This does not mean that CAS is no longer required – sometimes it’s important to ensure the entire document didn’t change state since the last operation: this is especially important with the case of REMOVE
operations to ensure that the element being removed was not already replaced by something else.
FAQ about Sub-Document Operations in Couchbase
Over the course of developing subdoc, I’ve been asked several questions about what it does, and I’ll respond in turn:
What’s the difference between Subdoc and N1QL?
N1QL is a rich, expressive query language which allows you to search for and possibly mutate multiple documents at once. Subdoc is a high performance API/implementation designed for searching within a single document.
Subdoc is a high performance set of simple, discreet APIs for accessing data within a single document, with a goal of reducing network bandwidth and increasing overall throughput. It is implemented as part of the KV service and is therefore strongly consistent with it.
N1QL is a rich query language capable of searching multiple documents within Couchbase which adhere to certain criteria. It operates outside the KV service, making optimized KV and index requests to satisfy incoming queries. Consistency with the KV service is configurable per query (for example, the USE KEYS
clause and the scan_consistency
option).
When should I use N1QL and when should I use subdoc?
N1QL answers questions such as Find me all documents where X=42 and Y=77 whereas subdoc answers questions such as Fetch X and Y from document Z. More specifically, subdoc should be used when all the Document IDs are known (in other words, if a N1QL query contains USE KEYS
it may be a candidate for subdoc).
The two are not mutually exclusive however, and it is possible to use both N1QL and subdoc in an application.
Are mutate_in
and lookup_in
atomic?
Yes, they are atomic. Both these operations are guaranteed to have all their sub-commands (e.g. COUNTER
, GET
, EXISTS
, ADD_UNIQUE
) operate on the same version of the document.
How do I access multiple documents with subdoc?
There is no bona fide multi operation for subdoc, as subdoc operates within the scope of a single document. Because documents are sharded across the cluster (this is common to Couchbase and all other NoSQL stores), multi operations would not be able to guarantee the same level of transactions and atomicity between documents.
I don’t like the naming convention for arrays. Why didn’t you use append
, add
, etc.?
There are many languages out there and it seems all of them have a different idea of how to call array access functions:
- Generic: add to end, add to front
- C++:
push_back()
,push_front()
- Python:
append()
,insert(0)
,extend
- Perl, Ruby, Javascript, PHP:
push()
,unshift()
- Java, C#:
add()
The term append
is already used in Couchbase to refer to the full-document byte concatenation, so I considered it inconsistent to use this term in yet a different manner in subdoc.
Why does COUNTER
require 64 bit signed integers?
This is a result of the subdoc code being implemented in C++. Future implementations may allow a broader range of existing numeric values (for example, large values, non-integral values, etc.).
How do i perform a pop? why is there no POP
operation?
POP
refers to the act of removing an item (e.g. from an array) and returning it, in a single operation.
POP
may indeed be implemented in the future, but using it is inherently dangerous:
Because the operation is being done over the network, it is possible for the server to have executed the removal of the item but have the network connection terminated before the client receives the previous value. Because the value is no longer in the document, it is permanently lost.
Can I use CAS with subdoc operations?
Yes, in respect to CAS usage, Subdoc operations are normal KV API operations, similar to upsert
, get
, etc.
Can I use durability requirements with subdoc operations?
Yes, in respect to durability requirements, mutate_in
is seen like upsert
, insert
and replace
.