Skip to main content

Document management

Document Management

Document management in MonetDB/XQuery is done using XQuery queries. To make this possible, we extended XQuery with new built-in functions, under the namespace pf (i.e. http://pathfinder-xquery.org/).

We recommend reading this section to learn about the difference between documents versus collections, as well as read-only versus updatable collections. Having understood these concepts, we should point out, that an alternative way of adding, deleting and inspecting XML databases is by using the Administrative GUI.

Additional information can be found in the Document Management section of the Reference Manual.

 

Adding a Document

Suppose we want to add the XML document HelloWorld.xml to our database. Of course, you first need to start your Mserver (see the Hello World example).

Open an interactive mclient XQuery session (mclient -lx) and type:

   > pf:add-doc("http://monetdb.cwi.nl/XQuery/files/HelloWorld.xml", 
                "HelloWorld.xml")

Send this query to the server by closing with CTRL-D (CTRL-Z on Windows) on a new (empty) line at the

more>

prompt.

The "HelloWorld.xml" document is now stored persistently in the database!

The first argument of pf:add-doc is the URL where the XML file can be found. The second argument that pf:add-doc expects is the logical name with which we can reference it. (a logical name has to be unique - adding a document under a name that is already present will lead to an error).

XQuery Access to the Database: now we can XQuery queries that retrieve (Hello World Example 8) by its logical name, directly from the database:

  > doc("HelloWorld.xml")

  <?xml version="1.0" encoding="utf-8"?>
  <doc>
   <greet kind="informal">Hi </greet>
   <greet kind="casual">Hello </greet>
   <location kind="global">World</location>
   <location kind="local">Amsterdam</location>
  </doc>

For large documents, you can see a strong performance difference between querying web documents and documents added to the database. The former have to be fetched always (e.g. using HTTP) and read fully, whereas the latter are directly available, and MonetDB/XQuery exploits database indices automatically created on them.

You can also reference the document stored in the database by its original URL (i.e. "http://monetdb.cwi.nl/XQuery/files/HelloWorld.xml"). URLs in the database do not need to be unique, unlike document names, i.e. the same URL may be stored multiple times under different names.

A difference between documents that were explicitly added with pf:add-doc() and those retained in the XML Cache is that the former ones these remain in the database indefinitely, and regardless of the caching rules. Cached documents may be thrown away at any time, and cached file:// URLs are always deleted if the file on disk has changed.

Web Access to the Database: If you have MonetDB/XQuery running on your local machine, you can also access the stored document in your browser at http://localhost:50001/xrpc/doc/HelloWorld.xml. That is, all documents are accessible on the built-in HTTP server of MonetDB/XQuery, by prefixing the document name with xrpc/doc. This built-in HTTP server runs on port 50001 by default (mapi_port+1).

If you have MonetDB/XQuery running on a different machine than the one you use to browse this documentation, you should point it to http://machine:5001/xrpc/doc/HelloWorld.xml

Deleting a Document

Start a mclient XQuery session and type:

  > pf:del-doc("HelloWorld.xml")

After this, the query doc("HelloWorld.xml") won't work anymore and will return an error that the document is not in the collection anymore.

The pf:add-doc() and pf:del-doc() functions can be used anywhere in an XQuery. Note that it is not allowed to use them in updating queries. If a user-defined function uses a document management function, it cannot return a value, and the function must be declared a document management function (similarly to updating function).

Document Collections

MonetDB/XQuery groups documents in so-called collections. This can be useful to organize collections of many documents. Storing documents together in the same collection also makes opening and querying many (small) documents much more efficient.

As an example, we may have stored the XML documents book.xml and bib.xml together in a collection "my-collection" using the following XQuery:

   > for $name in ("book.xml", "bib.xml")
     let $url := concat("http://monetdb.cwi.nl/XQuery/files/", $name)
     return pf:add-doc($url, $name, "my-collection")

Note that this XQuery executes the pf:add-doc() function twice, and it gets passed a third parameter (the collection name "my-collection").

If a document is added to the database without this third parameter (collection name), it is in fact stored in a new collection that has the same name as the document. If a collection by that name already exists, the new document is added to that collection.

Opening All Documents in a Collection

There are two similar functions by the name collection() that take a collection name as parameter and provide access to the XML nodes of all documents in the collection.

Typing pf:collection("my-collection")//author gives exactly the same result as:

   > fn:collection("my-collection")//author

   <author>Serge Abiteboul</author>
   <author>Peter Buneman</author>
   <author>Dan Suciu</author>
   <author><last>Stevens</last><first>W.</first></author>
   <author><last>Stevens</last><first>W.</first></author>
   <author><last>Abiteboul</last><first>Serge</first></author>
   <author><last>Buneman</last><first>Peter</first></author>
   <author><last>Suciu</last><first>Dan</first></author>

Note we get all authors from document bib.xml (i.e. the first three) before those of book.xml. This is because "bib.xml" was shredded before "book.xml", hence all nodes identifiers of "bib.xml" precede those of "book.xml".

The difference between pf:collection() and fn:collection() is that the former return a single node, the so-called collection node, whereas the latter returns many nodes, namely the document nodes of all documents in a collection.

   > count(fn:collection("my-collection"))

   2

   > 
   > count(pf:collection("my-collection"))

   1 

pf:collection() returns a single special collection node, whose immediate children are the document nodes. Therefore, fn:collection("my-collection") is roughly equivalent to pf:collection("my-collection")/*.

On collections that contain many (more than 1000) documents, we advise use of pf:collection() to start XML navigation.

Inspecting the XML Database

We can list all collections in our database using the following XQuery:

   > pf:collections()

   <collection updatable="false" 
               size="0 MB" 
               numDocs="2">my-collection</collection>

We get information in the form of < collection> elements. As attributed, we get information on the size of the collection, the number of documents in it, and whether it is updatable (see below).

We can also list all documents in the database:

   > pf:documents()

   <document updatable="false" 
             url="http://monetdb.cwi.nl/XQuery/files/book.xml" 
             collection="my-collection">book.xml</document>,
   <document updatable="false" 
             url="http://monetdb.cwi.nl/XQuery/files/bib.xml" 
             collection="my-collection">bib.xml</document>,

In the <document> elements, we also see the URL from where the document was loaded.

It is also possible to view the documents in a specific collection, by passing the collection name as parameter to

pf:documents()

.

Updatable vs. Read-Only collections

From MonetDB/XQuery 0.16 on, it is possible to update XML documents. Each document belongs to one collection, even if a document was added without mentioning a collection (in which case it is stored in a single-document collection by the same name as the document).

Collections can be either updatable or read-only, where read-only is the default. Thus, all the collections we created so far (i.e. "HelloWorld.xml" and "my-collection") were read-only. Once a collection is created, its mode (read-only or updatable) cannot be changed anymore.

To get an updatable collection, pf:add-doc() must be called with an additional parameter percentage, that should be an integer between 1 and 99:

   > pf:add-doc("http://monetdb.cwi.nl/XQuery/files/HelloWorld.xml", 
                "greetings.xml", "greetings.xml", 10)

The extra parameter indicates the percentage of free space that will be left on the table pages internally, to cheaply accommodate inserts.

We now invite you to continue reading the tutorial on XML Updates.

Attachment