Accessing AWS S3 with External Datasets in Couchbase Analytics

Introduction to external datasets

Couchbase is very excited to announce its new “External Datasets” Analytics Services feature in the latest Couchbase Server 6.6 release. External datasets empower customers to access externally stored data in real-time from Amazon Web Services (AWS) Simple Storage Service (S3) and to combine S3-resident data with existing Couchbase data for analysis.

Customer use case

Some customers use AWS S3 to reduce storage costs and store data (e.g., multiple years of historical data, offline business data for machine learning, product reviews, etc.). They have expressed a desire to combine, query, and utilize S3 data in real-time to make this data available to business users for analytics. You can read more about other Analytics use cases here.

How do external datasets work?

External datasets provide the ability to dynamically query and analyze data residing in AWS S3, allowing users to easily combine data in real-time from both inside and outside their Couchbase analytics nodes. This is achieved in three simple steps:

Set up an S3 link by using a REST API call or the command-line interface (CLI)
Create an external dataset on the S3 link
Query the dataset using SQL++ (or your favorite BI tool)

Let’s walk through a simple example. iMaz, an e-commerce company, sells consumer products online. Their order, product, and user data are stored on a Couchbase cluster with both data and analytics services (on separate sets of nodes in the cluster). They use the Analytics Service to run ad hoc and complex queries to analyze their business. iMaz also stores their product reviews on AWS S3, and they would like to combine and analyze the top 3 most highly rated products using the Couchbase Analytics Service.

Sample product data:

[
{
"id": "Product_1",
"docType": "Product",
"productId": 1,
"price": 811.76,
"salePrice": 70.14,
"productName": "Ergonomic Cotton Ball",
"desc": "Plastic fused metallic Ergonomic Cotton Ball",
}
]

[

{

"id": "Product_1",

"docType": "Product",

"productId": 1,

"price": 811.76,

"salePrice": 70.14,

"productName": "Ergonomic Cotton Ball",

"desc": "Plastic fused metallic Ergonomic Cotton Ball",

}

]

Sample review data:

{
"id": "Review_0001764a17a844279a2227e137cc4e36",
"docType": "Review",
"reviewId": "0001764a17a844279a2227e137cc4e36",
"productId": 1,
"userId": 5862,
"reviewerName": "M. Schaefer",
"reviewerEmail": "...@mmail.com",
"rating": 5,
"title": "Works well and meets expectations.",
"review": "Product works great and will buy one more for my extended family.",
"reviewDate": 1597273484
}

{

"id": "Review_0001764a17a844279a2227e137cc4e36",

"docType": "Review",

"reviewId": "0001764a17a844279a2227e137cc4e36",

"productId": 1,

"userId": 5862,

"reviewerName": "M. Schaefer",

"reviewerEmail": "...@mmail.com",

"rating": 5,

"title": "Works well and meets expectations.",

"review": "Product works great and will buy one more for my extended family.",

"reviewDate": 1597273484

}

Let’s follow the three steps from above with sample setup code along with a SQL++ query.

Step 1: Set up S3 link

We’ll create an S3 link using a REST API call. (Alternatively, you can use the CLI to create S3 links.). We’ll need to provide:

Analytics Service hostname
Analytics user credentials
S3 link name (in this case myS3Link)
Dataverse name (if different from default)
Link type (S3)
AWS S3 required access key ID
AWS S3 required secret access key
AWS S3 required region (e.g., us-west-2)

curl -u <username>:<pwd>
-X POST "http://<analytics_hostname>/analytics/link" 
-d dataverse=Default
-d name=myS3Link
-d type=S3
-d accessKeyId=...
-d secretAccessKey=...
-d region=us-west-2

curl -u <username>:<pwd>

-X POST "http://<analytics_hostname>/analytics/link"

-d dataverse=Default

-d name=myS3Link

-d type=S3

-d accessKeyId=...

-d secretAccessKey=...

-d region=us-west-2

Step 2: Create an External Dataset

Using the Analytics workbench, we’ll now create an external dataset named “S3productreviews”. We’ll need to specify:

S3 bucket name
Dataverse name (if different from default) and S3 bucket name (in this case cb-analytics-6.6-demo)
Directory location (optionally) inside the bucket where files will be read from and recursively collected (in this case product reviews are stored in a “reviews” folder)
File format (in this case we’ll use JSON) with the ability to specify a search pattern (in this case *.json indicates that all JSON files will be included when querying data)

CREATE EXTERNAL DATASET S3productreviews
ON cb-analytics-6.6-demo
AT myS3Link
USING “reviews”
WITH { "format": "json", "include": "*.json" } ;

CREATE EXTERNAL DATASET S3productreviews

ON cb-analytics-6.6-demo

AT myS3Link

USING “reviews”

WITH { "format": "json", "include": "*.json" } ;

Currently, the external datasets feature supports the json, csv (comma-separated values), and tsv (tab-separated values) file formats, including compressed gzip files (filenames ending with .gz or .gzip). Both the csv and tsv formats require you to specify an inlined type definition (more about this shortly). Additional file formats will be supported in future releases. You can read more about that here.

Step 3: Query using SQL++

As the last step, we can now run the SQL++ query listed below (which looks exactly like SQL :)). It joins the existing products dataset from the Couchbase Analytics Service and the product reviews data from AWS S3 to get the top 3 highly rated products.

SELECT p.productName, AVG(s.rating) AS ‘Rating’
FROM   S3productreviews s, products p
WHERE  s.productId = p.productId
GROUP
BY.    p.productName
ORDER
BY     AVG(s.rating) DESC
LIMIT  3;

SELECT p.productName, AVG(s.rating) AS ‘Rating’

FROM S3productreviews s, products p

WHERE s.productId = p.productId

GROUP

BY. p.productName

ORDER

BY AVG(s.rating) DESC

LIMIT 3;

Here are the json query results:

[
{ "Rating": 4.33, "productName": "Licensed Rubber Tuna"},
{ "Rating": 4.29, "productName": "Gorgeous Plastic Salad"},
{ "Rating": 3.86, "productName": "Intelligent Cotton Bike"}
]

[

{ "Rating": 4.33, "productName": "Licensed Rubber Tuna"},

{ "Rating": 4.29, "productName": "Gorgeous Plastic Salad"},

{ "Rating": 3.86, "productName": "Intelligent Cotton Bike"}

]

This is great – we’re now able to combine and analyze external data located in AWS S3 from the Couchbase Analytics Service. Notice how few steps it took to enable us to analyze our data; no ETL was involved, and the data was immediately available!

You might now be wondering: How would this have worked if the S3 reviews file format had been of type csv instead of JSON? The answer is simple; you simply would have constructed your external dataset accordingly. Below, we show what the create external dataset statement from above would look like to support csv:

CREATE EXTERNAL DATASET S3productreviews
(
id STRING NOT UNKNOWN, 
docType STRING NOT UNKNOWN,
reviewId STRING NOT UNKNOWN,
productId BIGINT,
userId BIGINT,
reviewerName STRING NOT UNKNOWN,
reviewerEmail STRING NOT UNKNOWN,
rating BIGINT,
title STRING NOT UNKNOWN,
review STRING NOT UNKNOWN,
reviewDate BIGINT
) 
ON `cb-analytics-6.6-demo`
AT myS3link
USING "reviews"
WITH { "format": "csv", "include": "*.csv", "header": false };

CREATE EXTERNAL DATASET S3productreviews

(

id STRING NOT UNKNOWN,

docType STRING NOT UNKNOWN,

reviewId STRING NOT UNKNOWN,

productId BIGINT,

userId BIGINT,

reviewerName STRING NOT UNKNOWN,

reviewerEmail STRING NOT UNKNOWN,

rating BIGINT,

title STRING NOT UNKNOWN,

review STRING NOT UNKNOWN,

reviewDate BIGINT

)

ON `cb-analytics-6.6-demo`

AT myS3link

USING "reviews"

WITH { "format": "csv", "include": "*.csv", "header": false };

Notice how the create statement now includes inlined type information. This is needed to tell Analytics how to interpret the csv data (e.g., not just as strings).

The SQL++ query remains exactly the same. That’s right, no change at all! External datasets are easy to set up, flexible, and simple to use thanks to the power of the SQL++ language. Users can develop complex ad hoc queries for further data exploration, answer new business questions, and combine external data with data from Remote Links to bring in other Couchbase data sources as well.

Benefits

Here are key benefits that come from using external datasets:

Data enrichment. Couchbase data can now be enriched with additional information obtained from files that reside in an enterprise’s existing S3-based data lake.
Dynamic data access. The latest data can be dynamically retrieved, streamed, combined, and analyzed from any S3 bucket in any AWS region during Analytics query execution.
Parallel query processing. Users can configure and arrange access to S3 data using Analytics’ massively parallel processing (MPP) query processing architecture for fast response to queries involving external data.

Summary

External Datasets unlock the value of external live and archived data residing in S3-based data lakes. Users can combine and analyze data in real-time, sourced from both AWS S3 and Couchbase Analytics Service. This enables faster and more comprehensive data analysis and agile decision making.

Resources

You can learn more about External Datasets statements here. Register here for our upcoming “What’s new in Couchbase Server release 6.6” webinar.

Explore Couchbase Server 6.6 resources

Blogs	Docs and Tutorials	Webpages and Webinars
What’s New in Couchbase Server 6.6	What’s New in Couchbase Server 6.6?	New Features in Couchbase Server 6.6: Analytics, Backup, Query, and More
Eventing Improvements (Timers, Handlers, and Statistics)	Couchbase Server 6.6 Release Notes	Couchbase Analytics Service
Remote Links – Analyze Your Enterprise With Couchbase Analytics	Try the Couchbase Index Advisor Service	What’s New in Couchbase Server (Product Page)
External Datasets – Extend Your Reach With Couchbase Analytics	Set Up Analytics Remote and S3 Links Using REST API	Compare Editions
Announcing Flex Index With Couchbase	Create External Datasets Using Data Definition Language (DDL)
Introducing Backing Up to Object Store (S3)	Set Up Analytics Remote and S3 Links Using CLI
Import Documents With the Web Admin Console

Thanks Till Westmann for co-authoring and Michael Carey for valuable contributions and review of this post.

till westman engineering director analytics

Co-author

Till Westmann, Engineering Director at Couchbase

Till Westmann is an Engineering Director at Couchbase working on the Analytics Service. Before joining Couchbase Till built data management software at Oracle, 28msec, SAP, BEA Systems, XQRL, and Xyleme. He is a member of the Apache Software Foundation and the Vice President of the Apache AsterixDB project. Till holds a PhD from the University of Mannheim in Germany.

Idris Motiwala

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

COMMUNITY

Join the Developer Community

Resource Center

Education

Compare

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

External Datasets: Accessing AWS S3 in Couchbase Analytics