Introduction to external datasets
Couchbase is very excited to announce its new “External Datasets” Analytics Services feature in the latest Couchbase Server 6.6 release. External datasets empower customers to access externally stored data in real-time from Amazon Web Services (AWS) Simple Storage Service (S3) and to combine S3-resident data with existing Couchbase data for analysis.
Customer use case
Some customers use AWS S3 to reduce storage costs and store data (e.g., multiple years of historical data, offline business data for machine learning, product reviews, etc.). They have expressed a desire to combine, query, and utilize S3 data in real-time to make this data available to business users for analytics. You can read more about other Analytics use cases here.
How do external datasets work?
External datasets provide the ability to dynamically query and analyze data residing in AWS S3, allowing users to easily combine data in real-time from both inside and outside their Couchbase analytics nodes. This is achieved in three simple steps:
- Set up an S3 link by using a REST API call or the command-line interface (CLI)
- Create an external dataset on the S3 link
- Query the dataset using SQL++ (or your favorite BI tool)
Let’s walk through a simple example. iMaz, an e-commerce company, sells consumer products online. Their order, product, and user data are stored on a Couchbase cluster with both data and analytics services (on separate sets of nodes in the cluster). They use the Analytics Service to run ad hoc and complex queries to analyze their business. iMaz also stores their product reviews on AWS S3, and they would like to combine and analyze the top 3 most highly rated products using the Couchbase Analytics Service.
Sample product data:
1 2 3 4 5 6 7 8 9 10 11 |
[ { "id": "Product_1", "docType": "Product", "productId": 1, "price": 811.76, "salePrice": 70.14, "productName": "Ergonomic Cotton Ball", "desc": "Plastic fused metallic Ergonomic Cotton Ball", } ] |
Sample review data:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "id": "Review_0001764a17a844279a2227e137cc4e36", "docType": "Review", "reviewId": "0001764a17a844279a2227e137cc4e36", "productId": 1, "userId": 5862, "reviewerName": "M. Schaefer", "reviewerEmail": "...@mmail.com", "rating": 5, "title": "Works well and meets expectations.", "review": "Product works great and will buy one more for my extended family.", "reviewDate": 1597273484 } |
Let’s follow the three steps from above with sample setup code along with a SQL++ query.
Step 1: Set up S3 link
We’ll create an S3 link using a REST API call. (Alternatively, you can use the CLI to create S3 links.). We’ll need to provide:
- Analytics Service hostname
- Analytics user credentials
- S3 link name (in this case myS3Link)
- Dataverse name (if different from default)
- Link type (S3)
- AWS S3 required access key ID
- AWS S3 required secret access key
- AWS S3 required region (e.g., us-west-2)
1 2 3 4 5 6 7 8 |
curl -u <username>:<pwd> -X POST "http://<analytics_hostname>/analytics/link" -d dataverse=Default -d name=myS3Link -d type=S3 -d accessKeyId=... -d secretAccessKey=... -d region=us-west-2 |
Step 2: Create an External Dataset
Using the Analytics workbench, we’ll now create an external dataset named “S3productreviews”. We’ll need to specify:
- S3 bucket name
- Dataverse name (if different from default) and S3 bucket name (in this case cb-analytics-6.6-demo)
- Directory location (optionally) inside the bucket where files will be read from and recursively collected (in this case product reviews are stored in a “reviews” folder)
- File format (in this case we’ll use JSON) with the ability to specify a search pattern (in this case *.json indicates that all JSON files will be included when querying data)
1 2 3 4 5 |
CREATE EXTERNAL DATASET S3productreviews ON cb-analytics-6.6-demo AT myS3Link USING “reviews” WITH { "format": "json", "include": "*.json" } ; |
Currently, the external datasets feature supports the json, csv (comma-separated values), and tsv (tab-separated values) file formats, including compressed gzip files (filenames ending with .gz or .gzip). Both the csv and tsv formats require you to specify an inlined type definition (more about this shortly). Additional file formats will be supported in future releases. You can read more about that here.
Step 3: Query using SQL++
As the last step, we can now run the SQL++ query listed below (which looks exactly like SQL :)). It joins the existing products dataset from the Couchbase Analytics Service and the product reviews data from AWS S3 to get the top 3 highly rated products.
1 2 3 4 5 6 7 8 |
SELECT p.productName, AVG(s.rating) AS ‘Rating’ FROM S3productreviews s, products p WHERE s.productId = p.productId GROUP BY. p.productName ORDER BY AVG(s.rating) DESC LIMIT 3; |
Here are the json query results:
1 2 3 4 5 |
[ { "Rating": 4.33, "productName": "Licensed Rubber Tuna"}, { "Rating": 4.29, "productName": "Gorgeous Plastic Salad"}, { "Rating": 3.86, "productName": "Intelligent Cotton Bike"} ] |
This is great – we’re now able to combine and analyze external data located in AWS S3 from the Couchbase Analytics Service. Notice how few steps it took to enable us to analyze our data; no ETL was involved, and the data was immediately available!
You might now be wondering: How would this have worked if the S3 reviews file format had been of type csv instead of JSON? The answer is simple; you simply would have constructed your external dataset accordingly. Below, we show what the create external dataset statement from above would look like to support csv:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE EXTERNAL DATASET S3productreviews ( id STRING NOT UNKNOWN, docType STRING NOT UNKNOWN, reviewId STRING NOT UNKNOWN, productId BIGINT, userId BIGINT, reviewerName STRING NOT UNKNOWN, reviewerEmail STRING NOT UNKNOWN, rating BIGINT, title STRING NOT UNKNOWN, review STRING NOT UNKNOWN, reviewDate BIGINT ) ON `cb-analytics-6.6-demo` AT myS3link USING "reviews" WITH { "format": "csv", "include": "*.csv", "header": false }; |
Notice how the create statement now includes inlined type information. This is needed to tell Analytics how to interpret the csv data (e.g., not just as strings).
The SQL++ query remains exactly the same. That’s right, no change at all! External datasets are easy to set up, flexible, and simple to use thanks to the power of the SQL++ language. Users can develop complex ad hoc queries for further data exploration, answer new business questions, and combine external data with data from Remote Links to bring in other Couchbase data sources as well.
Benefits
Here are key benefits that come from using external datasets:
- Data enrichment. Couchbase data can now be enriched with additional information obtained from files that reside in an enterprise’s existing S3-based data lake.
- Dynamic data access. The latest data can be dynamically retrieved, streamed, combined, and analyzed from any S3 bucket in any AWS region during Analytics query execution.
- Parallel query processing. Users can configure and arrange access to S3 data using Analytics’ massively parallel processing (MPP) query processing architecture for fast response to queries involving external data.
Summary
External Datasets unlock the value of external live and archived data residing in S3-based data lakes. Users can combine and analyze data in real-time, sourced from both AWS S3 and Couchbase Analytics Service. This enables faster and more comprehensive data analysis and agile decision making.
Resources
You can learn more about External Datasets statements here. Register here for our upcoming “What’s new in Couchbase Server release 6.6” webinar.
Explore Couchbase Server 6.6 resources
Thanks Till Westmann for co-authoring and Michael Carey for valuable contributions and review of this post.
Co-author
Till Westmann, Engineering Director at Couchbase
Till Westmann is an Engineering Director at Couchbase working on the Analytics Service. Before joining Couchbase Till built data management software at Oracle, 28msec, SAP, BEA Systems, XQRL, and Xyleme. He is a member of the Apache Software Foundation and the Vice President of the Apache AsterixDB project. Till holds a PhD from the University of Mannheim in Germany.
This is a revolutionary feature ! I simply love it!!!