“Nothing is certain except for death and taxes.”

This isn’t a dataset made with a bed of roses or manicured green grass. A bit more serious. Let’s see if we can quickly learn anything here. The dataset is the following.

“name” : “NCHS – Leading Causes of Death: United States”,
“attribution” : “National Center for Health Statistics”,

The public dataset is available at https://data.cdc.gov/api/views/bi63-dtpu/rows.json?accessType=DOWNLOAD

Step 1: Download the file into a local file (e.g. health.json). Upload this file to one of the nodes in the Couchbase cluster.

Step 2: import the data into a bucket called cause.  After you create the bucket, create the primary index. You’ll need this for querying.

/opt/couchbase/bin/cbimport json -c couchbase://127.0.0.1 -u Administrator -p password -b cause -d file://health.json -g cause:0 -f sample

CREATE PRIMARY INDEX ON cause;

Step 3. Inspect the structure of the data.

All of the data is provided in a SINGLE JSON document.  Because of this, INFER doesn’t help.  You’ll have to inspect and understand the structure manually. This data in typical government dataset with a lot of data in simple arrays with the meaning of each entity given at in the metadata.

Simple array:

 

 

 

 

 

 

 

 

This simply contains an array of data without the schema.  For the public datasets, the schema is in the meta field.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Let’s transform the structure into simple JSON key-value pairs so we can handle these bit more effectively.  You can learn more about how this magic happened in this article.

 

 

 

 

 

 

 

 

Task1: Find out the cause for most deaths in a state, by year.

The common table expression (CTE) in the WITH clause (csdata) transforms the complex json data into flat JSON.   You can do this dynamically or do this once and INSERT back into a bucket, as I’ve discussed in the article on New York baby names. In this article, I use CTEs.

 

 

 

 

 

 

 

 

In this case, all the deaths in California come on top, mainly due to its population.

Task 2. Find out leading causes of death in each state for the year 2016.

Query 2: Use the resultset from the previous query and then use the FIRST_VALUE() window function to determine the top cause.  Partitioning by state (in the OVER BY clause) will give you the partitions by state and ORDER BY dx.totdeaths within the OVER BY clause will give you the top cause in every state.

 

 

 

 

 

 

 

 

Task 3.  Find out how the top reason has changed by the year, from 1999 to 2016 by state.

Query 3:  Simply generate the report for all the years (199-2016) and then determine the top reason and finally get the highest reason by grouping by state, year and getting MAX(topcount) for the topreason cause.

Here’s the partial result.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Visualizing this gives us the following histogram.

 

 

Author

Posted by Madhuram Gupta

Leave a reply