How (and Why) to Use Couchbase as an XML Database

Couchbase and XML – no problem!

I’ve heard it said dozens of times: “Hey, Couchbase is great but I use XML.”

I recognize that countering with, “Couchbase can be your XML database” is pretty bold, especially for a JSON-oriented document database. Some folks in the Couchbase community might even do a double take, but I hope you see what I mean by the end of this post.

It’s a fact of life that many legacy applications still rely on XML, but is Couchbase really the right solution for storing and processing your XML data? Don’t worry: with respect to Couchbase and XML you can have your cake and eat it too.

After introducing a few concepts, this article shows you how to automatically and instantly convert XML into JSON equivalents within Couchbase with almost zero effort – all via an Eventing function. If you just think of JSON as the intermediate computer format that’s used to store your XML, you might already see where I’m headed.

When you transform your XML data into a native JSON representation, you take advantage of the rich ecosystem of Couchbase’s services including N1QL, Indexing, Full-Text Search, Eventing, and Analytics.

Couchbase is a purpose-built database you assemble to your needs. Just need a fast cache? You only need the Data Service or key-value store. Want SQL-like access? Add Query and Index services. Or maybe you need Analytics or Full-Text Search? Just add some nodes of the proper type.

Each of the above services independently scale so you end up building – and more importantly, paying for – only what you need. In the article below, I use Eventing to transform Couchbase into an XML-capable database. I know this sounds too good to be true, but trust me it really does work.

If you are familiar with Couchbase, XML, and JSON please feel free to skip ahead to the Prerequisites section.

The Couchbase Data Model

Couchbase Server is an open source, distributed data platform. It stores data as items, each of which has a key and a value. Sub-millisecond data operations are provided by powerful services for querying and indexing, as well as a feature-rich, document-oriented query language, N1QL. Multiple instances of Couchbase Server can be combined into a single cluster.

Keys

Each value (binary or JSON) is identified by a unique key, defined by the user or application when the item is saved. The key is immutable: once the item is saved, the key cannot be changed. Note that Couchbase also refers to an item’s key as its id.

Each key:

- Must be a UTF-8 string with no spaces. Special characters, such as (, %, /, " and _ are acceptable.
- May be no longer than 250 bytes.
- Must be unique within its bucket.

Values

The maximum size of a value is 20 MiB. A value can be either:

- Binary: Any form of binary is acceptable. Note that a binary value cannot be parsed, indexed or queried. It can only be retrieved by key.
- JSON: A JSON value, referred to as a document, can be parsed, indexed and queried. Each document consists of one or more attributes, each of which has its own value. An attribute’s value can be a basic type – such as a number, string or Boolean – or a complex, such as an embedded document or an array.

JSON Is the Database

Let’s talk more about JSON. You can parse, index, query and manipulate JSON documents. In fact, Couchbase introduced the N1QL query language (pronounced “nickel”) to meet the query needs of distributed document-oriented databases. N1QL is used for manipulating the JSON data in Couchbase, just like SQL manipulates data in a relational database (RDBMS). It has SELECT, INSERT, UPDATE, DELETE and MERGE statements to operate on JSON data.

You could store XML as a string or just a binary blob, but where’s the fun – or the utility – in that? Why not automatically convert your XML into JSON instead? I’m so glad you asked.

XML: Extensible Markup Language

XML is a markup language introduced in 1996 that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is a World Wide Web Consortium (W3C) recommendation.

The design goals of XML emphasize simplicity, generality and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.

Here’s an example of XML:

<CD>
  <TITLE>EmpireBurlesque</TITLE>
  <ARTIST>BobDylan</ARTIST>
  <COUNTRY>USA</COUNTRY>
  <COMPANY>Columbia</COMPANY>
  <PRICE>10.90</PRICE>
  <YEAR>1985</YEAR>
</CD>

<CD>

<TITLE>EmpireBurlesque</TITLE>

<ARTIST>BobDylan</ARTIST>

<COMPANY>Columbia</COMPANY>

</CD>

JSON: JavaScript Object Notation

Created circa 2001, JSON is a fairly lightweight open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays (or other serializable values).

JSON is a very common data format, with a diverse range of applications, one example being web applications that communicate with a server.

Here’s a sample of JSON:

{
  "email": "testme@example.org",
  "friends": [
    {"name": "rick"},
    {"name": "cate"}
  ]
}

{

"email": "testme@example.org",

"friends": [

{"name": "rick"},

{"name": "cate"}

]

}

JSON Has Taken over the World

Today, when any two applications communicate with each other across the internet, odds are they do so using JSON, especially if they communicate in human-readable text.

If you’re an XML fan, don’t shoot the messenger here. I’m sure XML will never go away given HTML as a markup language, but I like to think statistics don’t lie when it comes to developing internet applications that need to communicate with one another.

Prerequisites: Learning about Eventing

In this article we will be using the latest version of Couchbase – version 6.6.2 – however, it should work just fine in prior versions as well.

If you are not familiar with Couchbase or the Eventing service please walk through the following resources, including at least one Eventing example:

- Setup a working Couchbase 6.6.2 server as per the directions under “Start Here!”
- Understand how to deploy a basic Eventing function as per the directions in the Data Enrichment example. Look at “Case 2” where we will only use the “source” bucket.
- Make sure you have a source bucket of at least 100MB in the Buckets view of the UI.
- Make sure you a bucket called metadata of at least 100MB in the Buckets view of the UI.
- See the documentation for detailed steps on how to create a bucket.

Inserting XML into Couchbase

Now you should have an application that loads data, or you may have used cbimport to load your XML data into your Couchbase cluster.

In the code below, I load my first document with a key xml::1 and a body consisting of a type and the id used in the key (these values are optional but useful) and finally a property in_xml which contains the XML string itself.

{
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"
}

{

"type": "xml",

"id": 1,

"in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"

}

We can’t do too much with the above XML string. Sure, we could index the property in_xml as a string, but we won’t get a lot of performance out of doing this unless we use a prefix search, which in my humble opinion, is most likely worthless.

We might pass it to the Full-Text Search (FTS) product. In this case, you would get more utility, but you’d have to index the entire XML payload (not just something important).

What we really want is the JSON representation of the XML string property in_xml, perhaps something like:

"out_json": {
  "CD": {
    "TITLE": "EmpireBurlesque",
    "ARTIST": "BobDylan",
    "COUNTRY": "USA",
    "COMPANY": "Columbia",
    "PRICE": "10.90",
    "YEAR": "1985"
  }
}

"out_json": {

"CD": {

"TITLE": "EmpireBurlesque",

"ARTIST": "BobDylan",

"COUNTRY": "USA",

"COMPANY": "Columbia",

"PRICE": "10.90",

"YEAR": "1985"

}

Using the new property out_json, we now see individual JSON properties in a proper structure: something that N1QL, Indexing, Full-Text Search (FTS), Eventing and Analytics all natively and efficiently work with.

With these preliminary steps behind us, you might think to just parse your data and transform it to JSON before putting it in Couchbase. But the downside is that we now have more baggage to carry around and more infrastructure (outside of Couchbase) to deploy as we scale out by adding more nodes.

If we can avoid it, we don’t want to write a custom parser. That isn’t fun, and it’s hard to maintain as your requirements and data formats change over time.

What you ideally want is an in-server solution that generically converts XML to JSON in real time with no impact on data loaders.

Converting XML to JSON on the Fly

Enter Couchbase’s Eventing Service.

Just think of Eventing as a “post trigger” on a document that responds to every change (or mutation) in your data. The main idea is that whenever you insert or update your document (refer to xml::1 above) a small JavaScript fragment, or lambda, is executed against the document and encapsulates the needed business logic to manipulate the document to ensure that the document is in the format you want.

In this case, we want to convert a document like the one below:

{
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"
}

{

"type": "xml",

"id": 1,

"in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"

}

In real time, we want to convert the above document into an enriched document like the one below:

{
  "type": "xml",
  "id": 2,
  "in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",
  "out_json": {
    "CD": {
      "TITLE": "EmpireBurlesque",
      "ARTIST": "BobDylan",
      "COUNTRY": "USA",
      "COMPANY": "Columbia",
      "PRICE": "10.90",
      "YEAR": "1985"
    }
  }
}

{

"type": "xml",

"id": 2,

"in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

"out_json": {

"CD": {

"TITLE": "EmpireBurlesque",

"ARTIST": "BobDylan",

"COUNTRY": "USA",

"COMPANY": "Columbia",

"PRICE": "10.90",

"YEAR": "1985"

}

The enriched document shown above is searchable either via key-value or via an index on any given field using N1QL.

As before, we see individual JSON properties in a proper structure but under the out_json property, which enables the efficient use of Full-Text Search (FTS), Analytics, and even other Eventing functions.

Eventing Function: convertXMLtoJSON

Eventing allows you to write pure business logic and the Eventing service takes care of all the infrastructure needed to manage and scale your function (horizontally and vertically) across multiple nodes in a performant and reliable fashion.

All Eventing functions have two entry points OnUpdate(doc,meta) and OnDelete(meta,options). Note that we’re not worried about the latter entry point in this example.

When a document changes or mutates (insert, upsert, replace, etc.) a copy of the document and some metadata about the document will be passed to a small JavaScript entry point OnUpdate(doc,meta):

function OnUpdate(doc, meta) {
  // filter out non XML
  if (!meta.id.startsWith("xml:")) return;
  // The KEY started with "xml" try to process it
  // ===========================================================
  // *** Do other work required here on non .in_xml changes ***
  // ===========================================================
  var jsonDoc = parseXmlToJson(doc.in_xml);
  log(meta.id, "1. INPUT xml doc.in_xml :", doc.in_xml);
  log(meta.id, "2. OUTPUT doc.out_json :", jsonDoc);
  doc.out_json = jsonDoc;
  // ===========================================================
  // enrich the source bucket with .out_json
  src_bkt[meta.id] = doc;
}

function OnUpdate(doc, meta) {

// filter out non XML

if (!meta.id.startsWith("xml:")) return;

// The KEY started with "xml" try to process it

// ===========================================================

// *** Do other work required here on non .in_xml changes ***

// ===========================================================

var jsonDoc = parseXmlToJson(doc.in_xml);

log(meta.id, "1. INPUT xml doc.in_xml :", doc.in_xml);

log(meta.id, "2. OUTPUT doc.out_json :", jsonDoc);

doc.out_json = jsonDoc;

// ===========================================================

// enrich the source bucket with .out_json

src_bkt[meta.id] = doc;

}

The OnUpdate(doc,meta) logic above performs four steps on any mutation.

- First, the prefix of the key is checked, if it doesn’t start with xml:, we don’t do anything else. Note that meta.id is the key of the doc.
- Second, we just call a function parseXmlToJson(doc.in_xml) where we just pass the XML string into the function.
- Third, what comes back is added as a new field to the copy of the document as the property out_json.
- Fourth, we update the document in real time with the JSON representation.

The Core XML-to-JSON Conversion Logic

Here’s the core logic for the XML-to-JSON conversion:

// 6.6.0 version no String.matchAll need our own MatchAll function
function* MatchAll(str, regExp) {
  if (!regExp.global) {
    throw new TypeError('Flag /g must be set!');
  }
  const localCopy = new RegExp(regExp, regExp.flags);
  let match;
  while (match = localCopy.exec(str)) {
    yield match;
  }
}

// A simple XML to JSON parser
function parseXmlToJson(xml) {
    const json = {};
    for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {
        const key = res[1] || res[3];
        const value = res[2] && parseXmlToJson(res[2]);
        json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;
    }
    return json;
}

// 6.6.0 version no String.matchAll need our own MatchAll function

function* MatchAll(str, regExp) {

if (!regExp.global) {

throw new TypeError('Flag /g must be set!');

}

const localCopy = new RegExp(regExp, regExp.flags);

let match;

while (match = localCopy.exec(str)) {

yield match;

}

// A simple XML to JSON parser

function parseXmlToJson(xml) {

const json = {};

for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {

const key = res[1] || res[3];

const value = res[2] && parseXmlToJson(res[2]);

json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;

}

return json;

}

Thanks to user MasterPiece on Stack Overflow for sharing the method parseXmlToJson(xml) with the world. The method uses a cool regular expression on the XML string and returns a first-class JSON object. (Yes, it works and is both elegant and compact.)

Note that in Couchbase 7.0 the method MatchAll(str, regExp) in our JavaScript code above is not needed because Couchbase now has a more current v8 runner that includes the String.prototype.MatchAll() function.

Let’s Optimize the Eventing Function

What we have so far works just fine but there’s a problem.

Suppose your data has 100,000 mutations a second all with a key prefix of xml:. Even if the in_xml property never changes, our current Eventing function will a) perform 100,000 XML-to-JSON conversions, and b) make 100,000 writes to the data service (or key-value store). In addition, your XML is most likely much more complex that our sample in_xml string, and it might have deep nesting and a thousand fields.

Some thoughts we need to consider:

- Converting a large, nested XML document via a recursive routine takes compute resources. The method parseXmlToJson(xml) will be executed on every mutation even if the XML didn’t change.
- Writing to databases involves I/O, and we want to save cycles or writes for more important work. With the current Eventing function, we will perform a write or update the document into the key-value store on every mutation even if the XML didn’t change.

To fix the above issues, let’s add some logic to only update the document property out_xml if the in_xml data changes or if we don’t already have an out_xml property.

To implement this optimization, we will utilize a checksum via Eventing’s fast crc64() method on the in_xml property and store it in our document as a property: xmlchksum.

We can then use this field to minimize the extra work of needlessly converting XML to JSON or needlessly writing an unchanged document to the data service.

Here’s the optimization:

function OnUpdate(doc, meta) {
    // filter out non XML
    if (!meta.id.startsWith("xml:")) return;
    // The KEY started with "xml" try to process it
    // ===========================================================
    // *** Do other work required here on non .in_xml changes ***
    // ===========================================================
    // let's see if we need to re-create our json representation.
    var xmlchksum = crc64(doc.in_xml);
    // ===========================================================
    // Don't reprocess if the doc.in_xml has not changed this could be
    // a big performance win if the doc has other fields that mutate.
    // We do this via a checksum of the .in_xml property.
    if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;
    // Either this is the first pass, or the .in_xml property changed.
    var jsonDoc = parseXmlToJson(doc.in_xml);
    log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);
    log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);
    log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);
    doc.out_json = jsonDoc;
    doc.xmlchksum = xmlchksum;
    // ===========================================================
    // enrich the source bucket with .out_json and .xmlchksum
    src_bkt[meta.id] = doc;
}

function OnUpdate(doc, meta) {

// filter out non XML

if (!meta.id.startsWith("xml:")) return;

// The KEY started with "xml" try to process it

// ===========================================================

// *** Do other work required here on non .in_xml changes ***

// ===========================================================

// let's see if we need to re-create our json representation.

var xmlchksum = crc64(doc.in_xml);

// ===========================================================

// Don't reprocess if the doc.in_xml has not changed this could be

// a big performance win if the doc has other fields that mutate.

// We do this via a checksum of the .in_xml property.

if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;

// Either this is the first pass, or the .in_xml property changed.

var jsonDoc = parseXmlToJson(doc.in_xml);

log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);

log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);

log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);

doc.out_json = jsonDoc;

doc.xmlchksum = xmlchksum;

// ===========================================================

// enrich the source bucket with .out_json and .xmlchksum

src_bkt[meta.id] = doc;

}

Here’s what we added to the original OnUpdate(doc,meta) entry point.

- First, we calculate an in-memory checksum on the XML string accessed from doc.in_xml.
- Second, we compare this to a stored checksum: doc.xmlchksum (if it exists).
- Third, if the stored checksum is missing or differs, we convert the XML to JSON and store both the new checksum xmlchksum and the out_json properties in the document by writing it back to the Data Service.

The Final, Complete Eventing Function

Here’s our complete Eventing function for converting XML to JSON with all of the aforementioned optimizations:

function OnUpdate(doc, meta) {
    // filter out non XML
    if (!meta.id.startsWith("xml:")) return;
    // The KEY started with "xml" try to process it
    // ===========================================================
    // *** Do other work required here on non .in_xml changes ***
    // ===========================================================
    // let's see if we need to re-create our json representation.
    var xmlchksum = crc64(doc.in_xml);
    // ===========================================================
    // Don't reprocess if the doc.in_xml has not changed this could be
    // a big performance win if the doc has other fields that mutate.
    // We do this via a checksum of the .in_xml property.
    if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;
    // Either this is the first pass, or the .in_xml property changed.
    var jsonDoc = parseXmlToJson(doc.in_xml);
    log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);
    log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);
    log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);
    doc.out_json = jsonDoc;
    doc.xmlchksum = xmlchksum;
    // ===========================================================
    // enrich the source bucket with .out_json and .xmlchksum
    src_bkt[meta.id] = doc;
}

// 6.6.0 version no String.matchAll need our own MatchAll function
function* MatchAll(str, regExp) {
  if (!regExp.global) {
    throw new TypeError('Flag /g must be set!');
  }
  const localCopy = new RegExp(regExp, regExp.flags);
  let match;
  while (match = localCopy.exec(str)) {
    yield match;
  }
}

// A simple XML to JSON parser
function parseXmlToJson(xml) {
    const json = {};
    for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {
        const key = res[1] || res[3];
        const value = res[2] && parseXmlToJson(res[2]);
        json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;
    }
    return json;
}

function OnUpdate(doc, meta) {

// filter out non XML

if (!meta.id.startsWith("xml:")) return;

// The KEY started with "xml" try to process it

// ===========================================================

// *** Do other work required here on non .in_xml changes ***

// ===========================================================

// let's see if we need to re-create our json representation.

var xmlchksum = crc64(doc.in_xml);

// ===========================================================

// Don't reprocess if the doc.in_xml has not changed this could be

// a big performance win if the doc has other fields that mutate.

// We do this via a checksum of the .in_xml property.

if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;

// Either this is the first pass, or the .in_xml property changed.

var jsonDoc = parseXmlToJson(doc.in_xml);

log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);

log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);

log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);

doc.out_json = jsonDoc;

doc.xmlchksum = xmlchksum;

// ===========================================================

// enrich the source bucket with .out_json and .xmlchksum

src_bkt[meta.id] = doc;

}

// 6.6.0 version no String.matchAll need our own MatchAll function

function* MatchAll(str, regExp) {

if (!regExp.global) {

throw new TypeError('Flag /g must be set!');

}

const localCopy = new RegExp(regExp, regExp.flags);

let match;

while (match = localCopy.exec(str)) {

yield match;

}

// A simple XML to JSON parser

function parseXmlToJson(xml) {

const json = {};

for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {

const key = res[1] || res[3];

const value = res[2] && parseXmlToJson(res[2]);

json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;

}

return json;

}

Deploying the Eventing Function

Now it’s time to deploy the Eventing function. We’ve reviewed a bit of the code and the design of the XML-to-JSON translator, and now it’s time to see everything working together.

At this point, we have a function in JavaScript so we need to add it into your Couchbase cluster and deploy it into an active state.

This example requires two buckets: source (i.e., your document store) and metadata (i.e., a scratchpad for Eventing that can be shared with other Eventing functions). The source bucket should have a size of at least 100MB (if you want to test with 10M+ documents, you might want to make the source bucket bigger). The metadata bucket should have the minimum size of 100MB. Both these buckets need to already exist as per the Prerequisites above.

- Verify your current bucket configuration by accessing the Couchbase Web Console > Buckets page:

Manually Add The Function convertXMLtoJSON

To add the first Eventing function from the Couchbase Web Console > Eventing page, click ADD FUNCTION, to add a new function. The ADD FUNCTION dialogue then appears.

In the ADD FUNCTION dialogue, provide the below information for individual function elements:

For the Source Bucket dropdown, set to source.
For the Metadata Bucket dropdown, set to metadata.
Make sure convertXMLtoJSON is the name of the function you are creating in the Function Name text box.
[Optional Step] Enter text Generic XML conversion, in the Description text box.
For the Settings option, use the default values.

For the Bindings option, create two bindings:

For the binding, the bucket alias, specifies src_bkt as the alias name of the bucket, and select source as the associated bucket, and the mode should be read and write.
After configuring your settings, your dialogue should look like this:
After providing all the required information in the ADD FUNCTION dialogue, click Next: Add Code. The cron_impl_2func_651 dialogue appears. The convertXMLtoJSON dialogue initially contains a placeholder code block. You will need to substitute the JavaScript code we developed into this block.

Copy the following Eventing function JavaScript source (48 lines) and paste it in the placeholder code block of convertXMLtoJSON:

function OnUpdate(doc, meta) {
    // filter out non XML
    if (!meta.id.startsWith("xml:")) return;
    // The KEY started with "xml" try to process it
    // ===========================================================
    // *** Do other work required here on non .in_xml changes ***
    // ===========================================================
    // let's see if we need to re-create our json representation.
    var xmlchksum = crc64(doc.in_xml);
    // ===========================================================
    // Don't reprocess if the doc.in_xml has not changed this could be
    // a big performance win if the doc has other fields that mutate.
    // We do this via a checksum of the .in_xml property.
    if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;
    // Either this is the first pass, or the .in_xml property changed.
    var jsonDoc = parseXmlToJson(doc.in_xml);
    log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);
    log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);
    log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);
    doc.out_json = jsonDoc;
    doc.xmlchksum = xmlchksum;
    // ===========================================================
    // enrich the source bucket with .out_json and .xmlchksum
    src_bkt[meta.id] = doc;
}

// 6.6.0 version no String.matchAll need our own MatchAll function
function* MatchAll(str, regExp) {
  if (!regExp.global) {
    throw new TypeError('Flag /g must be set!');
  }
  const localCopy = new RegExp(regExp, regExp.flags);
  let match;
  while (match = localCopy.exec(str)) {
    yield match;
  }
}

// A simple XML to JSON parser
function parseXmlToJson(xml) {
    const json = {};
    for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {
        const key = res[1] || res[3];
        const value = res[2] && parseXmlToJson(res[2]);
        json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;
    }
    return json;
}

function OnUpdate(doc, meta) {

// filter out non XML

if (!meta.id.startsWith("xml:")) return;

// The KEY started with "xml" try to process it

// ===========================================================

// *** Do other work required here on non .in_xml changes ***

// ===========================================================

// let's see if we need to re-create our json representation.

var xmlchksum = crc64(doc.in_xml);

// ===========================================================

// Don't reprocess if the doc.in_xml has not changed this could be

// a big performance win if the doc has other fields that mutate.

// We do this via a checksum of the .in_xml property.

if (doc.xmlchksum && doc.xmlchksum === xmlchksum) return;

// Either this is the first pass, or the .in_xml property changed.

var jsonDoc = parseXmlToJson(doc.in_xml);

log(meta.id,"1. INPUT xml doc.in_xml :", doc.in_xml);

log(meta.id,"2. CHECKSUM doc.in_xml :", xmlchksum);

log(meta.id,"3. OUTPUT doc.out_json :", jsonDoc);

doc.out_json = jsonDoc;

doc.xmlchksum = xmlchksum;

// ===========================================================

// enrich the source bucket with .out_json and .xmlchksum

src_bkt[meta.id] = doc;

}

// 6.6.0 version no String.matchAll need our own MatchAll function

function* MatchAll(str, regExp) {

if (!regExp.global) {

throw new TypeError('Flag /g must be set!');

}

const localCopy = new RegExp(regExp, regExp.flags);

let match;

while (match = localCopy.exec(str)) {

yield match;

}

// A simple XML to JSON parser

function parseXmlToJson(xml) {

const json = {};

for (const res of MatchAll(xml,/(?:<(\w*)(?:\s[^>]*)*>)((?:(?!<\1).)*)(?:<\/\1>)|<(\w*)(?:\s*)*\/>/gm)) {

const key = res[1] || res[3];

const value = res[2] && parseXmlToJson(res[2]);

json[key] = ((value && Object.keys(value).length) ? value : res[2]) || null;

}

return json;

}

After pasting, the screen appears as displayed below:
Click Save.
To return to the Eventing screen, click the < back to Eventing link (below the editor) or click the Eventing option.

Deploy the Function

We are now ready to start the Eventing functions. From the Couchbase Web Console > Eventing screen, follow these steps:

- - - Click on the function name convertXMLtoJSON to expand and expose the function controls.
    - Click Deploy.
    - In the Confirm Deploy Function dialogue, select Deploy Function from the Feed boundary option.
    - The deployment of your function will take about 18 seconds to become active.

Test the Running Function

At this point we have a lambda running that will update any document with a key prefix of xml: and an XML property of in_xml.

To exercise the Eventing function convertXMLtoJSON, go to the Documents page in the UI and do the following:

Click ADD DOCUMENT.
Enter a key of xml::1 (just make sure the prefix is xml:)
Click Save.

Enter a body as follows:

{
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"
}

{

"type": "xml",

"id": 1,

"in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>"

}

Click Save.

Edit the document you just saved; you will notice it has been updated in real-time

{
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",
  "out_json": {
    "CD": {
      "TITLE": "EmpireBurlesque",
      "ARTIST": "BobDylan",
      "COUNTRY": "USA",
      "COMPANY": "Columbia",
      "PRICE": "10.90",
      "YEAR": "1985"
    }
  },
  "xmlchksum": "02087b7be275d0d8"
}

{

"type": "xml",

"id": 1,

"in_xml": "<CD><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

"out_json": {

"CD": {

"TITLE": "EmpireBurlesque",

"ARTIST": "BobDylan",

"COUNTRY": "USA",

"COMPANY": "Columbia",

"PRICE": "10.90",

"YEAR": "1985"

}

"xmlchksum": "02087b7be275d0d8"

}

Adjust the in_xml. Perhaps add <MONTH>FEB</MONTH> before the title.

"in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

1	"in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

Click Save.

Again, edit the document you just saved. You will notice it has been updated in real time once again:

{
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",
  "out_json": {
    "CD": {
      "MONTH": "FEB",
      "TITLE": "EmpireBurlesque",
      "ARTIST": "BobDylan",
      "COUNTRY": "USA",
      "COMPANY": "Columbia",
      "PRICE": "10.90",
      "YEAR": "1985"
    }
  },
  "xmlchksum": "06b5b40b276f160b"
}

{

"type": "xml",

"id": 1,

"in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

"out_json": {

"CD": {

"MONTH": "FEB",

"TITLE": "EmpireBurlesque",

"ARTIST": "BobDylan",

"COUNTRY": "USA",

"COMPANY": "Columbia",

"PRICE": "10.90",

"YEAR": "1985"

}

"xmlchksum": "06b5b40b276f160b"

}

Add a new property above type, such as "other": "this is fun",
Click Save.

Once again, edit the document. Note that the checksum will not be updated

{
  "other": "this is fun",
  "type": "xml",
  "id": 1,
  "in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",
  "out_json": {
    "CD": {
      "MONTH": "FEB",
      "TITLE": "EmpireBurlesque",
      "ARTIST": "BobDylan",
      "COUNTRY": "USA",
      "COMPANY": "Columbia",
      "PRICE": "10.90",
      "YEAR": "1985"
    }
  },
  "xmlchksum": "06b5b40b276f160b"
}

{

"other": "this is fun",

"type": "xml",

"id": 1,

"in_xml": "<CD><MONTH>FEB</MONTH><TITLE>EmpireBurlesque</TITLE><ARTIST>BobDylan</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.90</PRICE><YEAR>1985</YEAR></CD>",

"out_json": {

"CD": {

"MONTH": "FEB",

"TITLE": "EmpireBurlesque",

"ARTIST": "BobDylan",

"COUNTRY": "USA",

"COMPANY": "Columbia",

"PRICE": "10.90",

"YEAR": "1985"

}

"xmlchksum": "06b5b40b276f160b"

}

Let’s Process 100K XML Records

Now that you’ve seen how things work, let’s dump a lot of data at the function. First, we need to generate some data.

I will use a simple Perl script called xml_data_gen.pl that I wrote for this blog post:

#!/usr/bin/perl 
use Getopt::Long qw(GetOptions);

sub randi {
    ($lower_limit,$upper_limit) = @_;
    return int(rand($upper_limit-$lower_limit)) + $lower_limit;
}

my $blk = 0;
my $num = 0;
my $help = 0;
GetOptions(
    'blk=i' => \$blk,
    'num=i' => \$num,
    'help' => \$help,
) or die "Usage: $0 --blk # --num #\n";

if ($num == 0 || $help != 0 || $blk < 1) {
    printf stderr "Usage: $0 --blk # --num #\n";
    printf stderr "examples:\n";
    printf stderr "./xml_data_gen.pl --blk 1 --num 100000 > data.json\n";
    printf stderr "cbimport json -c couchbase://127.0.0.1 -u \$CB_USERNAME " .
        "-p \$CB_PASSWORD -b source -d file://./data.json -f lines -t 4 -g xml::%%id%%\n";
    exit(1);
}

@artist = ("Elton John", "Lisa Lobb", "Sting", "Clash", "The Smiths", 
    "Bob Dylan", "The Yard Birds", "Journey", "Led Zeppelin", "Adele");
@title = ("Empire Burlesque", "Greatest Hits", "Songs Vol 1.", "Songs Vol 2.", 
    "Songs Vol 3.", "Classics", "Hidden Tracks");
@country = ("USA", "UK", "AU", "AR", "ES", "MO");
@company = ("Columbia", "Capital", "Dog Boys", "Epic", "Home Brew", "EMG");



$template = '{"type":"xml","id":_XXX_,"in_xml":"<CD><TITLE>_TTT_</TITLE>' .
    '<ARTIST>_AAA_</ARTIST><COUNTRY>_CCC_</COUNTRY><COMPANY>_OOO_</COMPANY>' .
    '<PRICE>_PPP_</PRICE><YEAR>_YYY_</YEAR></CD>"}';

my $beg = 1 + ($blk-1) * $num;
my $max = $beg + $num - 1;

my $wrk = "";
for ($j=$beg; $j<=$max; $j=$j+1) {

    $wrk = $template; $wrk =~ s/_XXX_/$j/;

    $tmp = $artist[randi(0,$#artist-1)]; $wrk =~ s/_AAA_/$tmp/;
    $tmp = $title[randi(0,$#title-1)]; $wrk =~ s/_TTT_/$tmp/;
    $tmp = $country[randi(0,$#country-1)]; $wrk =~ s/_CCC_/$tmp/;
    $tmp = $company[randi(0,$#company-1)]; $wrk =~ s/_OOO_/$tmp/;
    $tmp = randi(850,1765)/100; $wrk =~ s/_PPP_/$tmp/;
    $tmp = randi(1,30)+1981; $wrk =~ s/_YYY_/$tmp/;

    printf $wrk . "\n";
}

#!/usr/bin/perl

use Getopt::Long qw(GetOptions);

sub randi {

($lower_limit,$upper_limit) = @_;

return int(rand($upper_limit-$lower_limit)) + $lower_limit;

}

my $blk = 0;

my $num = 0;

my $help = 0;

GetOptions(

'blk=i' => \$blk,

'num=i' => \$num,

'help' => \$help,

) or die "Usage: $0 --blk # --num #\n";

if ($num == 0 || $help != 0 || $blk < 1) {

printf stderr "Usage: $0 --blk # --num #\n";

printf stderr "examples:\n";

printf stderr "./xml_data_gen.pl --blk 1 --num 100000 > data.json\n";

printf stderr "cbimport json -c couchbase://127.0.0.1 -u \$CB_USERNAME " .

"-p \$CB_PASSWORD -b source -d file://./data.json -f lines -t 4 -g xml::%%id%%\n";

exit(1);

}

@artist = ("Elton John", "Lisa Lobb", "Sting", "Clash", "The Smiths",

"Bob Dylan", "The Yard Birds", "Journey", "Led Zeppelin", "Adele");

@title = ("Empire Burlesque", "Greatest Hits", "Songs Vol 1.", "Songs Vol 2.",

"Songs Vol 3.", "Classics", "Hidden Tracks");

@country = ("USA", "UK", "AU", "AR", "ES", "MO");

@company = ("Columbia", "Capital", "Dog Boys", "Epic", "Home Brew", "EMG");

$template = '{"type":"xml","id":_XXX_,"in_xml":"<CD><TITLE>_TTT_</TITLE>' .

'<ARTIST>_AAA_</ARTIST><COUNTRY>_CCC_</COUNTRY><COMPANY>_OOO_</COMPANY>' .

'<PRICE>_PPP_</PRICE><YEAR>_YYY_</YEAR></CD>"}';

my $beg = 1 + ($blk-1) * $num;

my $max = $beg + $num - 1;

my $wrk = "";

for ($j=$beg; $j<=$max; $j=$j+1) {

$wrk = $template; $wrk =~ s/_XXX_/$j/;

$tmp = $artist[randi(0,$#artist-1)]; $wrk =~ s/_AAA_/$tmp/;

$tmp = $title[randi(0,$#title-1)]; $wrk =~ s/_TTT_/$tmp/;

$tmp = $country[randi(0,$#country-1)]; $wrk =~ s/_CCC_/$tmp/;

$tmp = $company[randi(0,$#company-1)]; $wrk =~ s/_OOO_/$tmp/;

$tmp = randi(850,1765)/100; $wrk =~ s/_PPP_/$tmp/;

$tmp = randi(1,30)+1981; $wrk =~ s/_YYY_/$tmp/;

printf $wrk . "\n";

}

To test the above generator script xml_data_gen.pl just run: ./xml_data_gen.pl --blk 1 --num 5

You should see something like:

{"type":"xml","id":1,"in_xml":"<CD><TITLE>Songs Vol 3.</TITLE><ARTIST>Elton John</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Epic</COMPANY><PRICE>10.64</PRICE><YEAR>2003</YEAR></CD>"}
{"type":"xml","id":2,"in_xml":"<CD><TITLE>Songs Vol 1.</TITLE><ARTIST>Clash</ARTIST><COUNTRY>AU</COUNTRY><COMPANY>Epic</COMPANY><PRICE>12.69</PRICE><YEAR>1989</YEAR></CD>"}
{"type":"xml","id":3,"in_xml":"<CD><TITLE>Empire Burlesque</TITLE><ARTIST>Journey</ARTIST><COUNTRY>AU</COUNTRY><COMPANY>Epic</COMPANY><PRICE>15.25</PRICE><YEAR>1999</YEAR></CD>"}
{"type":"xml","id":4,"in_xml":"<CD><TITLE>Songs Vol 1.</TITLE><ARTIST>Bob Dylan</ARTIST><COUNTRY>AR</COUNTRY><COMPANY>Dog Boys</COMPANY><PRICE>15.43</PRICE><YEAR>1997</YEAR></CD>"}
{"type":"xml","id":5,"in_xml":"<CD><TITLE>Greatest Hits</TITLE><ARTIST>Journey</ARTIST><COUNTRY>UK</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.71</PRICE><YEAR>1983</YEAR></CD>"}

{"type":"xml","id":1,"in_xml":"<CD><TITLE>Songs Vol 3.</TITLE><ARTIST>Elton John</ARTIST><COUNTRY>USA</COUNTRY><COMPANY>Epic</COMPANY><PRICE>10.64</PRICE><YEAR>2003</YEAR></CD>"}

{"type":"xml","id":2,"in_xml":"<CD><TITLE>Songs Vol 1.</TITLE><ARTIST>Clash</ARTIST><COUNTRY>AU</COUNTRY><COMPANY>Epic</COMPANY><PRICE>12.69</PRICE><YEAR>1989</YEAR></CD>"}

{"type":"xml","id":3,"in_xml":"<CD><TITLE>Empire Burlesque</TITLE><ARTIST>Journey</ARTIST><COUNTRY>AU</COUNTRY><COMPANY>Epic</COMPANY><PRICE>15.25</PRICE><YEAR>1999</YEAR></CD>"}

{"type":"xml","id":4,"in_xml":"<CD><TITLE>Songs Vol 1.</TITLE><ARTIST>Bob Dylan</ARTIST><COUNTRY>AR</COUNTRY><COMPANY>Dog Boys</COMPANY><PRICE>15.43</PRICE><YEAR>1997</YEAR></CD>"}

{"type":"xml","id":5,"in_xml":"<CD><TITLE>Greatest Hits</TITLE><ARTIST>Journey</ARTIST><COUNTRY>UK</COUNTRY><COMPANY>Columbia</COMPANY><PRICE>10.71</PRICE><YEAR>1983</YEAR></CD>"}

To create a 100K-record xml_data_gen.pl to later load into your Couchbase cluster just run: ./xml_data_gen.pl --blk 1 --num 100000 > data.json
Now we need to load the data. I will use the cbimport utility located in the ./bin directory of your Couchbase installation on one of your nodes. I assume you have shell access to your Couchbase cluster and the required credentials.

Shell

cbimport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b source -d file://./data.json -f lines -t 4 -g xml::%id%

1

cbimport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b source -d file://./data.json -f lines -t 4 -g xml::%id%
Take a look. All 100K items will have been converted instantly as they load (on the mutation caused by the insert).

If you want more performance, you can scale up vertically by raising the number of workers in the functions settings or by adding more Eventing nodes. If you want to load more data and maintain 100% memory residency, you may need to increase the size of your bucket.

Final Thoughts

I hope you found this walkthrough educational and have developed more appreciation for the Couchbase Eventing Service as a whole.

Previously we optimized the Function to use less compute resources. We could also optimize for storage space. Note that we have both an in_xml and an out_json properties in our documents after we enrich them. With just a minor tweak (adding two lines) we could remove the in_xml and eliminate the redundant storage (saves 30% of my disk space for me). I chose to leave this out of the example to better highlight how the function ran, but I invite you to experiment with it yourself using the following:

function OnUpdate(doc, meta) {
    // filter out non XML
    if (!meta.id.startsWith("xml:")) return;
    if (!doc.in_xml) return; // optimize for storage space
    
    // ALL OTHER LINES THE SAME
    
    delete doc.in_xml; // optimize for storage space
    src_bkt[meta.id] = doc;
}

function OnUpdate(doc, meta) {

// filter out non XML

if (!meta.id.startsWith("xml:")) return;

if (!doc.in_xml) return; // optimize for storage space

// ALL OTHER LINES THE SAME

delete doc.in_xml; // optimize for storage space

src_bkt[meta.id] = doc;

}

I have placed this example is on Couchbase’s main documentation site as the example Scriptlet: Function: convertXMLtoJSON. It will not translate XML attributes, but that was easy to add (see next paragraph). Yet the simple method parseXmlToJson() I used was a perfect fit for the initial data.

During development, I started simple and tested some massive XML documents but of course I might not have covered the full gamut of XML constructs. I developed the second Scriptlet convertAdvXMLtoJSON (not shown in this article) as I promptly found out that a lot of XML documents utilize attributes and have some interesting syntax variants.

<DOC_WITH_ATTRS>
  <ADVELEM1 adv_attrA="adv_valA" />
  <ADVELEM2 adv_attrA="adv_valA" adv_attrB="adv_valB">
    <SUB>
      SUBDATA
    </SUB>
  </ADVELEM2>
</DOC_WITH_ATTRS>

<DOC_WITH_ATTRS>

<SUB>

SUBDATA

</SUB>

</ADVELEM2>

</DOC_WITH_ATTRS>

This led me to create a more extensive Eventing function that translates XML attributes and the syntax variants I encountered. You will also find the source for this longer, more capable Eventing function on Couchbase’s main documentation site as the example Scriptlet: Function: convertAdvXMLtoJSON. This more advanced version has successfully converted large XML documents including “OpenStreetmap Geometry files” and also “Huawei Cellular Operational Measurements”.

I challenge you to try your hand at the following:

- - - Write a reverse function: Instead of convertXMLtoJSON, say convertJSONtoXML
    - Alter convertAdvXMLtoJSON to prefix any property names such as attr_ to help you identify XML attributes in the final JSON conversion.
    - Add an optional JavaScript map and routine to generically convert properties by name to real numbers and booleans (instead of the default string type)
    - Implement an enhancement to apply a “Document Type Definition” or DTD to the input data, where you read the DTD from another document.
    - Create a translator for XPath queries to address (point to) different parts of a JSON representation of the converted document.

Don’t forget to experiment with SQL++ (N1QL) in the Query Workbench and use the Index Advisor to query and create optimal indexes to access your XML data (transformed to JSON).

Resources

- - - Download: Download Couchbase Server 6.6.2
    - Eventing Scriptlet: Function: convertXMLtoJSON.
    - Eventing Scriptlet: Function: convertAdvXMLtoJSON.

References

I hope I’ve convinced you that Couchbase via Eventing is an XML database.

I would love to hear from you on how you liked the capabilities of Couchbase and Eventing and how they benefit your business going forward. Please share your feedback via the comments below or in the Couchbase forums.

Madhuram Gupta

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Vector Search

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Capella Playground

Start A Free Capella Trial

Resource Center

Education

Compare

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

How (and Why) to Use Couchbase as an XML Database

Couchbase and XML – no problem!

The Couchbase Data Model

Keys

Values

JSON Is the Database

XML: Extensible Markup Language

JSON: JavaScript Object Notation

JSON Has Taken over the World

Prerequisites: Learning about Eventing

Inserting XML into Couchbase

Converting XML to JSON on the Fly

Eventing Function: convertXMLtoJSON

The Core XML-to-JSON Conversion Logic

Let’s Optimize the Eventing Function

The Final, Complete Eventing Function

Deploying the Eventing Function

Manually Add The Function convertXMLtoJSON

Deploy the Function

Test the Running Function

Let’s Process 100K XML Records

Final Thoughts

Resources

References

Author

Posted by Madhuram Gupta

Leave a reply Cancel reply