Skip to main content

XQuery VOC

Introduction on the VOC data set

Exploring the wealth of functionality offered by MonetDB is best started using a toy database. An example of such database is the VOC data set that provides a peephole view into the administrative system of a multi-national company, the Vereenigde geoctrooieerde Oostindische Compagnie (VOC for short - The (Dutch) East Indian Company).

The VOC was granted a monopoly on the trade in the East Indies on March 20, 1602 by the representatives of the provinces of the Dutch republic. Attached to this monopoly was the duty to fight the enemies of the Republic and prevent other European nations to enter the East India trade. During its history of over 200 years, the VOC became the largest company of its kind, trading spices like nutmeg, cloves, cinnamon, pepper, and other consumer products like tea, silk and Chinese porcelain. Her factories or trade centers were world famous: Desjima in Japan, Mokha in Yemen, Surat in Persia and of course Batavia, the Company's headquarters on Java.

The history of the VOC is an active area of research and a focal point for multi-country heritage projects, e.g. TANAP, which includes a short historic overview of the VOC written by world expert on the topic F. Gaastra. The archives of the VOC are spread around the world, but a large contingent still resides in the National Archive, The Hague. The archives comprise over 25 million historical records. Much of which has not (yet) been digitized.

The MonetDB/XQuery tutorial is based on the material published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar, Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way). Details on over 8000 voyages are provided. They include information about ship name and type, captain, the arrival/departure of harbors along the route, personnel accounts, and anecdotal information.

Data set

Download the gzipped data-set voc.xml.gz (0.6MB)
and the packaged XQuery scripts voc_xq.tar.gz (746B)

The dataset has been created by scanning old ship-logs. These ship-logs were scanned per page and then an OCR-application produced the XML-output. First the left pages were scanned and after that the right pages were scanned. Information about a single voyage is therefore split into two separate records in the data-set. This may look as a disadvantage, but in fact it represents the real world quite well; we often need to query multiple sources. Once you start executing queries you will also note that the dataset has many handicaps (names are sometimes spelled different, information is missing or placed in other fields, or punctuation is incorrect).

The typical voyage record looks something like the example voyage below. It does however not have a fixed structure. For example: the onboard node may sometimes be split into soldiers, seafarers, etc nodes, but it might as well just give the totals.

<VOC>

  ...
 
  <voyage>
    <number>4411</number>
    <trip>1</trip>
    <leftpage>
      <boatname>GERECHTIGHEID</boatname>
      <master>Isaak Segon</master>
      <tonnage>1150</tonnage>
      <built>1779</built>
      <yard>A</yard>
      <chamber>A</chamber>
      <departure>1783-03-04</departure>
      <harbour>Rammekens</harbour>
      <callatcape><arrival>1783-07-19</arrival><departure>1783-08-22</departure></callatcape>
      <destination><arrival>1783-11-08</arrival><harbour>Batavia</harbour></destination>
    </leftpage>
    <rightpage>
      <onboard>
	<seafarers><one>149</one><two>3</two><three>3</three><six>155</six></seafarers>
	<soldiers><one>135</one><six>118</six></soldiers>
	<craftsmen><one>6</one><six>6</six></craftsmen>
	<passengers><one>6</one><six>5</six></passengers>
      </onboard>
      <particulars><next>8103</next> On 24-11-1782 the GERECHTIGHEID departed 
        from Texel for Rammekens. At Texel a mutiny broke out, but was suppressed.
      </particulars>
    </rightpage>
  </voyage>
  
  ...
  
</VOC>

Example Queries

Once you have added the voc.xml data set into your collection, you can start executing queries. If you don't know how to add documents, please have a look at Data Management. Lets start with a simple query: counting the number of voyages in the dataset:

voc1.xq:
  count(doc("voc.xml")//voyage)

We could list all the harbors the ship ZEELANDIA departed from in a time interval:

voc2.xq:
  for $t in doc("voc.xml")//voyage[leftpage/boatname="ZEELANDIA"]
  let $l := $t/leftpage
  order by zero-or-one($l/departure)
  return 
    <voyage> {
      $l/departure/text(), text { ", " }, $l/harbour/text()
    } </voyage>

We can display the details of the ship differently, for example by summing and listing some values:

voc3.xq:
  let $a := doc("voc.xml")//voyage[leftpage/boatname="ZEELANDIA"]
  return 
    <ship>
      <name> { distinct-values($a/leftpage/boatname) }</name>
      <visited> { 
	distinct-values($a//harbour/text())
      } </visited>
      <crew> {
	for $t in $a
	let $c := sum(
            for $m in $t/rightpage/onboard//*/text()
            let $s := exactly-one($m) cast as xs:integer
            return $s)
	where $c > 0
	return 
	  <voyage> { 
            $t/leftpage/departure,
            <onboard> { $c } </onboard> 
	  } </voyage>
      } </crew>
    </ship>

More interesting would it be to see how many ships sunk, or in other words, did not arrive at their destination:

voc4.xq:
 count(doc("voc.xml")//voyage[not(.//destination/arrival)])

Note that the results of the previous query might be influenced by 'missing' arrival information. We could try to be more precise, by requiring that the word 'wrecked' or 'sunk' should be in the particulars:

voc5.xq:
  for $t in doc("voc.xml")//voyage
  let $p := zero-or-one($t/rightpage/particulars/text())
  where not($t//destination/arrival)
          and (contains($p,"wrecked")
          or contains($p,"sunk"))
  order by zero-or-one($t/leftpage/boatname/text())
  return 
    element { "boat" } { 
      element { "name" } { $t/leftpage/boatname/text() }, 
      text { ": " },
      element { "description" } { $t/rightpage/particulars/text() } 
    } 

Many more interesting queries could be asked. Hopefully the above has given you some inspiration to try some yourself.