A data construct that often appears in business application is the hierarchical data structure. Hierarchy captures the parent-child relationship often between the same object. For instance a company structure captures the reporting line between employees. Business organization captures the relationship between parent companies and subsidiaries. Territory hierarchies in Sales. Book of accounts in financial applications.

Due the self referencing nature of hierarchy, querying the structure efficiently along with its associated data can be a challenge for RDBMS, particularly from a performance perspective. In this article, I will discuss how traditional RDBMS handles hierarchical queries. The challenges that it has to deal with, and how this issue can be similarly addressed with Couchbase N1QL and Couchbase GSI.

Hierarchy Structure in Applications

The main reason for gathering information in a hierarchical structure is to improve the understanding of the information. Company reporting structure is designed, not only to help with how the organization is managed, but also provide a structure to measure and optimize the effectiveness of each group. In a large organization, sales performance is often assessed, not only at the individual level, but at sales team. In short, businesses organize information in a hierarchical structure so that they can get a better understanding of the business performance. In order to achieve that goal, business needs an efficient means to query the hierarchical data.

Representation of company hierarchy

While database data model is capable of efficiently capturing the hierarchical structure, the difficulty arises when you need to query the hierarchical data, and the related information.

Consider this requirement: Get a total sales order value for all sales reps who report up to ThomasH-Sales VP

While the data model is relatively simple. The hierarchical nature of the sales organization suggests an inherent dynamic structure in the reporting hierarchy. AjayW, who heads the Region1 territory, also manages the sales team members directly. Whereas, in Region2, Liz L manages two Managers, who then manages the sales team. This is typical in most application data hierarchies.

RDMBS approach

In order to query hierarchical data, more established RDBMS such as Oracle and SQLSever provide the CONNECT BY / START WITH construct to allow for a single query to recursively traverse the hierarchy employee structure.

While the above query may look simple, the query performance is difficult to improve with indexes due to the recursive nature of the CONNECT BY implementation. For this reason, this technique is not popular in application development for systems with large volume of data. Instead, enterprise applications rely on a pre-flattened object structure for a more predictable query performance. The flattened hierarchy technique is described in the Couchbase N1QL section below.

Couchbase N1QL

In order to get the best query performance, N1QL applications should use the flattened hierarchy structure. The approach  provides more predictable performance as well as better at leveraging Couchbase GSI. The diagram below shows an example of the flattening transformation of a self-referencing hierarchical structure, such as the employee object. I also include a python code snippet below that you can use to flatten the hierarchical structure.  The flatten_hierarchy function takes a self referencing hierarchical JSON object and generates two new objects in the same keyspace but with different type values

  • The object_hier structure works with aggregated BI queries, where query results can be rolled up for each level.
  • The object_hier_level structure provides the same result as the CONNECT BY/START WITH clause, which returns all of the child objects for any given node. This is the object that we will use in our N1QL query to provide the query solution.

Recommended GS Index:

Notes:

  1. The main query retrieves all sales order in the crm bucket with type value = ‘salesorder’
  2. The query performs a HASH JOIN with another query (N1QL 6.5 feature) that retrieves all the employee ids who report to user101, i.e ThomasH-SalesVP
  3. Additional covering indexes can also improve query performance
  4. The query uses N1QL 6.5 ANSI JOIN Support for Expression and Subquery Term

Resources

We would love to hear from you on how you liked the 6.5 features and how it’ll benefit your business going forward. Please share your feedback via the comments or in the forum.

Author

Posted by Madhuram Gupta

5 Comments

  1. N1QL Query with INNER JOIN no longer works in CB6 as written.

    [
    {
    “code”: 3000,
    “msg”: “ANSI JOIN must be done on a keyspace. – at where \n Error while parsing: runtime error: invalid memory address or nil pointer dereference – at where”,
    “query_from_user”: “SELECT uhl.id FROM test uhl \r\nINNER JOIN (SELECT e.id,sum(a.value)\r\nFROM test a\r\n WHERE a.type=’sales_order’ ) e ON a.owner = e.id\r\nwhere uhl.type =’_employee_hier_level’\r\n AND uhl.parent=’101’\r\nGROUP BY e.id”
    }
    ]

    Also the keyword value needs to be quoted or syntax error. Something like

    SELECT e.id,sum(a.value), (SELECT uhl.id FROM test uhl WHERE uhl.type =’_employee_hier_level’
    AND uhl.parent=’101′) e
    FROM test a
    WHERE a.type=’sales_order’ and a.owner = e.id
    GROUP BY e.id

    works but not sure about the USE HASH…

    1. The query uses a new N1QL 6.5 feature – ANSI JOIN Support for Expression Term and Subquery Term. 6.5 is planned for beta in late May. Please let me know if you are interested in an early build to try this out. I am also looking into a way to make dataset available.

      thanks,
      -binh

  2. Suggest some sample JSON data would be nice, too

  3. well, either Oracle ATP – as well as MS SQL Server offers a hierarchy function and a special hierarchy data type, too. Could you clarify differences to CouchBase? Does ANSI SQL and/or CouchBase adds these SQL key words: “CONNECT BY” and “LEVEL”? TIA.

  4. Hi CM27,

    Couchbase query does not support the “CONNECT BY” keyword. Relationship traversal capability is on our plan. On that same note, CONNECT BY is an Oracle specific feature, whereas Recursive CTE is the ANSI approach for relationship query.

    -binh

Leave a reply