MonetDB.R TutorialMonetDB.R Tutorial dnedev Fri, 06/20/2014 - 18:00
The MonetDB.R tutorial is designed to familiarise R users with the functionality of the MonetDB.R connector and dplyr library for data manipulation in R.
Before stating with this tutorial, make sure you have a running instance on MonetDB to which you can connect. As such, the MonetDB/SQL tutorial is a recommended reading, even more, given that this page was adapted from the SQL tutorial.
To start, in case you have not done it yet, create a DB farm using the MonetDB daemon monetdbd and a new database called 'voc'. The daemon is controlled by the application monetdb. See its documentation for all details. Note: The monetdbd is not available on Windows. For that reason, on these systems a default database called demo is available by default (Windows-specific MonetDB tutorial is available here).
At this time you will be prompted to select a mirror.
After MonetDB.R is installed you must load it.
> library(MonetDB.R) Loading required package: DBI Loading required package: digest
Now, let's create a connection to the database.
Note that by default MonetDB runs on port 50000, but MonetDB.R already knows that, thus there is no need to set the port. Also, in the tutorial we've use the demo database, which is the default on Windows.
> conn <- dbConnect(MonetDB.R(), host="localhost", dbname="demo", user="monetdb", password="monetdb")
Let's execute a dummy 'hello world' SQL query and in our database directly from R.
> dbGetQuery(conn,"SELECT 'hello world'") single_value 1 hello world
To illustrate the use of the newly created account and schema, the following example connects and creates a table, which is dropped afterwards by aborting the transaction.
> dbTransaction(conn) > dbSendUpdate(conn,"CREATE TABLE test (id int, data varchar(30))") > dbListTables(conn)  "test" > dbExistsTable(conn,"test")  TRUE > dbRollback(conn)  TRUE > dbListTables(conn) character(0)
For the purpose of the tutorial we have prepared a TSV (tab-separated values) version of the VOC dataset, which contains data for around 8000 voyages. After unpacking the .zip you can import the data into the directly from R.
Using the monetdb.read.csv function we can load the specified file as a new table 'craftsmen', specifying the upper bound on rows to import (2500), the delimiter (\t). The last parameter indicates that at least 1474 rows need to be read, before before the function can safely auto-deretmine the datatype of the columns.
> file <- "voc_craftsmen.tsv" > monetdb.read.csv(conn, file, "craftsmen", 2350, delim="\t", nrow.check=1474) voc_craftsmen.tsv /Users/monetdb/voc_craftsmen.tsv L1 1 2348
Now, let's repeat the same for the other files in the dataset.
> file <- "voc_impotenten.tsv" monetdb.read.csv(conn, file, "impotenten", 1000, delim="\t", nrow.check=160) file <- "voc_invoices.tsv" monetdb.read.csv(conn, file, "invoices", 4000, delim="\t", nrow.check=1000) file <- "voc_passengers.tsv" monetdb.read.csv(conn, file, "passengers", 3000, delim="\t", nrow.check=1000) file <- "voc_seafarers.tsv" monetdb.read.csv(conn, file, "seafarers", 4500, delim="\t", nrow.check=4500) file <- "voc_soldiers.tsv" monetdb.read.csv(conn, file, "soldiers", 4500, delim="\t", nrow.check=4000) file <- "voc_total.tsv" monetdb.read.csv(conn, file, "total", 2500, delim="\t", nrow.check=2000) file <- "voc_voyages.tsv" monetdb.read.csv(conn, file, "voyages", 8200, delim="\t", nrow.check=5000)
Next, let's install and load dplyr, which is designed to facilitate efficient datasets manipulation in R.
> install.packages("dplyr") > library(dplyr) Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
Make a new connection to the database for dplyr, specifying that the source is a MonetDB instance.
> monetdb_conn <- src_monetdb("demo")
Now, lets create non-matirialized views for all tables from the database we are going to need later in R. Note that none of these will actaully move data from MonetDB into R, rather we are just creating a reference to the tables.
> craftsmen <- tbl(monetdb_conn, "craftsmen") impotenten <- tbl(monetdb_conn, "impotenten") invoices <- tbl(monetdb_conn, "invoices") passengers <- tbl(monetdb_conn, "passengers") seafarers <- tbl(monetdb_conn, "seafarers") soldiers <- tbl(monetdb_conn, "soldiers") total <- tbl(monetdb_conn, "total") voyages <- tbl(monetdb_conn, "voyages")
Let's check out the data we just loaded. First, let's see the dimensions and the column names of the voyages table.
> dim(voyages)  8131 22 > colnames(voyages)  "number" "number_sup" "trip"  "trip_sup" "boatname" "master"  "tonnage" "type_of_boat" "built"  "bought" "hired" "yard"  "chamber" "departure_date" "departure_harbour"  "cape_arrival" "cape_departure" "cape_call"  "arrival_date" "arrival_harbour" "next_voyage"  "particulars"
As you can see, the respose is actaully an R array.
Since we have more than 8000 rows, let's have a look at the first few.
If you are interested in what queries are executed in MonetDB you can turn on the query debug
> options(monetdb.debug.query=T) > head(voyages) QQ: 'SELECT "number", "number_sup", "trip", "trip_sup", "boatname", "master", "tonnage", "type_of_boat", "built", "bought", "hired", "yard", "chamber", "departure_date", "departure_harbour", "cape_arrival", "cape_departure", "cape_call", "arrival_date", "arrival_harbour", "next_voyage", "particulars" FROM "voyages" LIMIT 6' QQ: Query result for query 29 with 6 rows and 22 cols, 6 rows.
As you can see, the connector uses the native LIMIT SQL function for the head query.
Disable the debug view for now.
Let's see how many impotenten we have in the dataset:
> summarise(impotenten, count = n()) count 1 938 .. ...
It is possible to play with the set in many ways, to find out several things that took place during the voyages of the ships, or the money that was earned. A few examples are shown below:
> invoices_grouped_per_chamber <- group_by(invoices, chamber) > average_invoices_per_chamber <- summarise(invoices_grouped_per_chamber, average = mean(invoice)) > arrange(average_invoices_per_chamber, desc(average)) Arrange: desc(average) chamber average 1 A 282996.6 2 Z 259300.5 3 H 150182.4 4 R 149628.3 5 D 149522.8 6 E 149518.0 7 NA 83309.0 .. ... ...
The MonetDB also supports embedded R queries in the database. This feature was introduced in the Oct2014 release. Check out this blog post for more information.
About the VOC dataset
The data for the tutorial is published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way) by the VOC (or Dutch East India Company). A total of 8000 records are provided. They include information about ship name and type, captain, the arrival/departure of harbors along the route, personnel accounts, and anecdotal information. You can find more about the VOC on Wikipedia.