Many statistical data analysts and data scientists use the R programming language to crunch their numbers outside of a database. Likewise, database analysts try to do everything in the same database whenever possible to maintain a single source. Couchbase provides a common basis for bridging the gap between JSON datasets, powerful query languages, and analytic toolsets like R.
The distributed, fault-tolerant nature and flexible schema are only some of the reasons businesses turn to Couchbase. To learn more, see Why Enterprises use NoSQL to deliver next-generation products.
The Couchbase SDK covers the most popular programming languages: Java, .NET, Node.js, Go, and more. But does not need to provide a specific drive for R because it can easily use the REST API.
In this post, I resurrect an R programming tutorial from an older blog and see how R works with the latest Couchbase version. Examples show how to run a tabular N1QL/SQL query on the JSON data, as well as mapping the resulting locations of some of the geospatial data.
What is R?
R fills the need for an open source statistical computing and graphics toolset similar to proprietary products like SAS and SPSS. This covers statistical analysis and graphical visualization geared especially well for publication-ready graphical plots for scientific documentation.
Is R better than Python for statistical analysis? That’s open for debate, and yet there are plenty of advocates. In fact, for many developers, it may be the most popular analytical environment you’ve never used. In one global R programming jobs search project, the author found R programming in 5th place behind Python, SQL, Java, and Amazon ML. It was moderately higher than C/C++, Tableau, Hadoop, and even SAS for statistical data analysis toolsets.
The ecosystem that it spawned provides tens of thousands of packages that add powerful analytics, time series analysis, visualization capabilities and more. Several IDEs exist for building solutions, for this post I use R Studio (which I easily used for the first time while writing this article).
Setting up Couchbase
The default Couchbase server install has everything needed for this tutorial. I suggest installing it on a single node, ideally the PC you are using with R Studio so the connection URL will be localhost. I’m using the Couchbase 7.x beta version.
Once installed, set up a new bucket with the travel-sample dataset and create a new user that has access to that bucket.
Setting up R Studio
Next, install R and then R Studio Desktop (free version) with all the default options. Once installations are complete, launch R Studio and select Tools -> Install Packages and install these four: httr, jsonlite, ggplot2, leaflet.
Alternatively, you can run the following code in the R coding console:
1 |
install.packages(c("httr", "jsonlite", "ggplot2", "leaflet")) |
R Studio will go away and install a bunch of underlying dependencies for you. Meanwhile, you continue to the next step while that runs.
Testing N1QL Query in Couchbase
Before we start the R programming code, let’s test the query using the Query tab in the Couchbase web console.
1 2 3 4 5 6 7 8 9 |
SELECT a.name, count(*) as total_flights FROM `travel-sample` r JOIN `travel-sample` a ON KEYS r.airlineid WHERE r.type ="route" AND a.type="airline" GROUP BY a.name ORDER BY total_flights DESC LIMIT 20 |
If you’ve had to work JSON before, you’ll see how simple it can be to query from documents in Couchbase using a familiar syntax.
The result shows the top 20 airlines and the number of flights each has in the sample database.
R Coding N1QL Query
Set up your R project by setting the library requirements at the top of the script in the console:
1 2 3 4 |
library(httr) library(jsonlite) library(ggplot2) library(leaflet) |
To build the query and handle the data responses, we create a few variables.
First, the connection info for Couchbase is provided including the URL, port, and the username and password you created earlier:
1 |
cbServer <- "http://localhost:8093/query/service" |
Then encode the query into its own variable as well. Note the escaped double quotes that are needed (\”).
1 |
query <- "SELECT a.name, count(*) as total_flights FROM `travel-sample` r JOIN `travel-sample` a ON KEYS r.airlineid WHERE r.type =\"route\" AND a.type=\"airline\" GROUP BY a.name ORDER BY total_flights DESC LIMIT 20" |
Next is setting up the HTTP request:
1 |
req <- httr::POST(cbServer, httr::add_headers("Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"), body = paste("statement=", query), authentication("public","public")) |
And then receiving the actual response from the request and assigning it to a variable to hold the data:
1 2 |
res <- fromJSON(httr::content(req, "text")) airlineFlights <- res$results |
That’s it!
Visualize Query Results from Couchbase
Now we can see the results of the query by entering the data variable:
R-Studio, with the ggplot extension and a few settings, is a simple way to rapidly view data in a variable:
1 |
ggplot(data=airlineFlights, aes(x=name, y=total_flights)) + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_bar(stat="identity") |
Now you can tweak the query or ggplot code and experiment with other approaches.
Map Geospatial Data from Couchbase using R Coding
The power of some of the other packages is incredible. For the next example, we use the R coding that leverages the Leaflet web mapping library.
Let’s change the query to list hotels that are pet-friendly. Because the hotel data includes a geographic latitude and longitude object (“geo”) we can easily make a map of the results.
Adjust your query to include name, latitude, and longitude along with a couple of simple filters:
1 2 3 |
SELECT name,geo.lat as lat,geo.lon as long FROM `travel-sample` WHERE type="hotel" AND city="London" AND pets_ok=true |
Test it in the Couchbase web console and you get 37 results. Then put the query into the variables as before:
1 2 3 |
query <- "SELECT name, geo.lat as lat, geo.lon as long FROM `travel-sample` WHERE type=\"hotel\" AND city=\"London\" AND pets_ok=true" req <- httr::POST(cbServer, httr::add_headers("Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"), body = paste("statement=", query), authentication("public","public")) res <- fromJSON(httr::content(req, "text")) petFriendlyHotelsLondon <- res$results |
Results will look similar to this:
To make the map, we use the leaflet package and provide a few settings to identify the fields to use for locating the marker symbols on the map.
1 |
leaflet(data = petFriendlyHotelsLondon) %>% addTiles() %>% addMarkers(~long, ~lat, popup = ~as.character(name), label = ~as.character(name)) %>% addProviderTiles(providers$OpenStreetMap) |
R Studio is a great information-rich IDE allowing code editing, interactive console, and built-in viewer components, including the map, just like ggplot did with the earlier chart.
Naturally, you can expand any of the window panes accordingly:
Conclusion
The flexibility of Couchbase and R, with its powerful set of 3rd party packages, are a great combination. Application developers can leave data in the central database and take advantage of failover, distributed processing, full-text search, and SQL analytics.
Meanwhile, the same database can serve analysts who need an easy-to-use API to access data without making multiple offline copies.
This comprehensive data platform approach continues to attract enterprises of all sizes that need to simplify their architecture while also making it more stable and scalable.
Thank you for this useful intro. How do you send the percentage sign, e.g., like in:
…
where meta(d).id LIKE ‘%something%’
…
Queries containing % all fails stating there is no statement. Escaping with \ and double %% does not seem to solve it either. Is that an R thing and is there no way around it?
Content-Type=”application/x-www-form-urlencoded”, so the body/query should be url-encoded.