A Layman's Dive Into Database Management Systems

Start with something you know

To the untrained eye, databases can be intimidating: niche terminology, impenetrable numbers and bare-bones user-interfaces can leave even the more tech-savvy among us scratching our heads. In commercial/business contexts, navigating a database will likely require thorough industry-specific knowledge in order to utilise it for adequate decision-making. A more effective learning trajectory for SQL and the MonetDB suite (or any database management system for that matter) should comprise something the user is already familiar with. To illustrate this point, this blog post explores how one can navigate a particular database, within the scope of the MonetDB suite, that you are likely already familiar with: The Internet Movie Database (IMDb).

Non-commercial datasets have been made available by IMDb in the tab-separated values (.tsv) file format. For access to and descriptions of the datasets, click on the following link: https://developer.imdb.com/non-commercial-datasets/

After downloading the compressed files and extracting the datasets, create a new database imdb using the shell commands displayed below, assuming you have already installed MonetDB and a MonetDB database server is already running on your system (if not, first go to https://www.monetdb.org/easy-setup/):

monetdb create imdb
monetdb release imdb

Now start the mclient command-line interface by executing mclient -d imdb in the shell. Then, create the schema by executing the following:

CREATE TABLE name_basics(
    nconst              TEXT NOT NULL,
    primaryName         TEXT,
    birthYear           INTEGER,
    deathYear           INTEGER,
    primaryProfession   TEXT,
    knownForTitles      TEXT
);

CREATE TABLE title_akas(
    titleId             TEXT NOT NULL,
    ordering            INTEGER,
    title               TEXT,
    region              TEXT,
    language            TEXT,
    types               TEXT,
    attributes          TEXT,
    isOriginalTitle     BOOLEAN
);

CREATE TABLE title_basics(
    tconst              TEXT NOT NULL,
    titleType           TEXT,
    primaryTitle        TEXT,
    originalTitle       TEXT,
    isAdult             BOOLEAN,
    startYear           INTEGER,
    endYear             INTEGER,
    runtimeMinutes      INTEGER,
    genres              TEXT
);

CREATE TABLE title_crew(
    tconst              TEXT NOT NULL,
    directors           TEXT,
    writers             TEXT
);

CREATE TABLE title_episode(
    tconst              TEXT NOT NULL,
    parentTconst        TEXT,
    seasonNumber        INTEGER,
    episodeNumber       INTEGER
);

CREATE TABLE title_principals(
    tconst              TEXT NOT NULL,
    ordering            INTEGER,
    nconst              TEXT,
    category            TEXT,
    job                 TEXT,
    characters          TEXT
);

CREATE TABLE title_ratings(
    tconst              TEXT NOT NULL,
    averageRating       DECIMAL,
    numVotes            INTEGER
);

Now it is possible to load the data using the statements provided below. Make sure to change the file paths to the correct location(s) where the datasets have been saved.

COPY OFFSET 2 INTO name_basics FROM '/path/to/name.basics.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_akas FROM '/path/to/title.akas.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_basics FROM '/path/to/title.basics.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_crew FROM '/path/to/title.crew.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_episode FROM '/path/to/title.episode.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_principals FROM '/path/to/title.principals.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

COPY OFFSET 2 INTO title_ratings FROM '/path/to/title.ratings.tsv' 
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';

Assuming a successful outcome, it is now time to explore and find something to watch!

Let’s start with a simple task: Making a list of all the works of a big-name director. This summer in the Netherlands, where MonetDB Solutions is based, many cinemas have been showing the works of acclaimed Japanase director Akira Kurosawa. To create a list comprising each film he has directed, its year of release and the unique identification key of the director, execute the following query:

SELECT
        startyear AS "Year",
        primarytitle AS "Film Title",
        directors AS "Director(s)"
FROM
        name_basics,
        title_basics,
        title_crew
WHERE
        primaryname ILIKE 'Akira Kurosawa' AND
        nconst = directors AND
        title_crew.tconst = title_basics.tconst
ORDER BY
        startyear,
        primarytitle
;

You should receive the following output:

+------+---------------------------------------+-------------+
| Year | Film Title                            | Director(s) |
+======+=======================================+=============+
| 1943 | Sanshiro Sugata                       | nm0000041   |
| 1944 | The Most Beautiful                    | nm0000041   |
| 1945 | Sanshiro Sugata, Part Two             | nm0000041   |
| 1945 | The Men Who Tread on the Tiger's Tail | nm0000041   |
| 1946 | No Regrets for Our Youth              | nm0000041   |
| 1947 | One Wonderful Sunday                  | nm0000041   |
| 1948 | Drunken Angel                         | nm0000041   |
| 1949 | Stray Dog                             | nm0000041   |
| 1949 | The Quiet Duel                        | nm0000041   |
| 1950 | Rashomon                              | nm0000041   |
| 1950 | Scandal                               | nm0000041   |
| 1951 | The Idiot                             | nm0000041   |
| 1952 | Ikiru                                 | nm0000041   |
| 1954 | Seven Samurai                         | nm0000041   |
| 1955 | I Live in Fear                        | nm0000041   |
| 1957 | The Lower Depths                      | nm0000041   |
| 1957 | Throne of Blood                       | nm0000041   |
| 1958 | The Hidden Fortress                   | nm0000041   |
| 1960 | The Bad Sleep Well                    | nm0000041   |
| 1961 | Yojimbo                               | nm0000041   |
| 1962 | Sanjuro                               | nm0000041   |
| 1963 | High and Low                          | nm0000041   |
| 1965 | Red Beard                             | nm0000041   |
| 1970 | Dodes'ka-den                          | nm0000041   |
| 1975 | Dersu Uzala                           | nm0000041   |
| 1980 | Kagemusha: The Shadow Warrior         | nm0000041   |
| 1985 | Ran                                   | nm0000041   |
| 1991 | Rhapsody in August                    | nm0000041   |
+------+---------------------------------------+-------------+
28 tuples

As you can see, it is a rather extensive body of work spanning decades. Let’s verify the accuracy of this outcome by comparing it to the director’s filmography displayed on the Wikipedia page List of works by Akira Kurosawa. Upon closer inspection, it becomes evident that our list is incomplete. Namely, three films are missing: Those Who Make Tomorrow (1946), Dreams (1990) and Madadayo (1993). From this, we learn an important lesson: Never underestimate the utility of verification using external sources when possible.

Now let’s investigate.

Using the absent titles and their years of release as search terms, use the following query to pull the missing data:

SELECT
        startyear AS "Year",
        primarytitle AS "Film Title",
        directors AS "Director(s)"
FROM
        title_basics,
        title_crew
WHERE
        (primarytitle ILIKE 'Madadayo' OR
        (primarytitle ILIKE 'Dreams' AND startyear = 1990 AND directors IS NOT NULL) OR
        primarytitle ILIKE 'Those Who Make Tomorrow') AND
        title_basics.tconst = title_crew.tconst
ORDER BY
        startyear,
        primarytitle
;

The output should be:

+------+-------------------------+-------------------------------+
| Year | Film Title              | Director(s)                   |
+======+=========================+===============================+
| 1946 | Those Who Make Tomorrow | nm0000041,nm0782846,nm0945428 |
| 1990 | Dreams                  | nm0000041,nm0393094           |
| 1993 | Madadayo                | nm0000041,nm0393094           |
+------+-------------------------+-------------------------------+
3 tuples

Now we see a pattern: all the missing titles have more than one director, one of whom is of course Akira Kurosawa himself as we can see his identification key nm0000041. But given that the identification keys of other directors are lumped into the same string, the initial query used to pull the first list does not pick this up. Unfortunately, this is a common occurrence in the world of data science/engineering; it is often necessary to restructure a dataset before analyzing it per the ETL methodology.

Let’s look for solutions.

A quick fix to this problem is to replace the condition nconst = directors with nconst = splitpart(directors, ',', 1) in the initial query:

SELECT
        startyear AS "Year",
        primarytitle AS "Film Title",
        directors AS "Director(s)"
FROM
        name_basics,
        title_basics,
        title_crew
WHERE
        primaryname ILIKE 'Akira Kurosawa' AND
        nconst = splitpart(directors, ',', 1) AND
        title_crew.tconst = title_basics.tconst
ORDER BY
        startyear,
        primarytitle
;

As a result, the director’s collaborations in which he is credited first as the primary director will also be pulled along with the works in which he is the sole contributor:

+------+---------------------------------------+-------------------------------+
| Year | Film Title                            | Director(s)                   |
+======+=======================================+===============================+
| 1943 | Sanshiro Sugata                       | nm0000041                     |
| 1944 | The Most Beautiful                    | nm0000041                     |
| 1945 | Sanshiro Sugata, Part Two             | nm0000041                     |
| 1945 | The Men Who Tread on the Tiger's Tail | nm0000041                     |
| 1946 | No Regrets for Our Youth              | nm0000041                     |
| 1946 | Those Who Make Tomorrow               | nm0000041,nm0782846,nm0945428 |
| 1947 | One Wonderful Sunday                  | nm0000041                     |
    :               :                                :
    :               :                                :
    :               :                                :
| 1985 | Ran                                   | nm0000041                     |
| 1990 | Dreams                                | nm0000041,nm0393094           |
| 1991 | Rhapsody in August                    | nm0000041                     |
| 1993 | Madadayo                              | nm0000041,nm0393094           |
+------+---------------------------------------+-------------------------------+
31 tuples

To catch a glimpse of the works in which he has been credited as an assistant director, one needs to change the index n in the string function splitpart(directors, ',', n) to an integer quantity above 1. Whilst the simplicity of this fix is appealing, it should not be understated that a more lasting solution is warranted if one intends to use this database for more substantial analyses.

Nevertheless, now that we have a complete list of his major works, let’s perform some statistical analyses.

Sitting through the entire filmography of the acclaimed director would require a lot of dedication, especially considering the lengthy running times of many of his titles by today’s standards. So if you plan to indulge, of course you want to see his best work. Let’s pull the highest rated titles, all with a weighted average rating above or equal to 8 according to the IMDb rating system and with enough votes to be eligible for the Top 250 list, which as of writing is 25000.

Execute the following query:

SELECT
        startyear AS "Year",
        primarytitle AS "Film Title",
        averagerating AS "Weighted Average >= 8.0",
        numvotes AS "Number of Votes",
        runtimeminutes AS "Running Time [minutes]"
FROM
        name_basics,
        title_basics,
        title_crew,
        title_ratings
WHERE
        primaryname ILIKE 'Akira Kurosawa' AND
        nconst = splitpart(directors, ',', 1) AND
        title_crew.tconst = title_basics.tconst AND
        title_ratings.tconst = title_basics.tconst AND
        averagerating >= 8 AND
        numvotes >= 25000
ORDER BY
        startyear,
        primarytitle
;

The result should be the following:

+------+---------------------+-------------------------+-----------------+------------------------+
| Year | Film Title          | Weighted Average >= 8.0 | Number of Votes | Running Time [minutes] |
+======+=====================+=========================+=================+========================+
| 1950 | Rashomon            |                   8.100 |          191347 |                     88 |
| 1952 | Ikiru               |                   8.300 |           98135 |                    143 |
| 1954 | Seven Samurai       |                   8.600 |          388943 |                    207 |
| 1957 | Throne of Blood     |                   8.000 |           58614 |                    110 |
| 1958 | The Hidden Fortress |                   8.000 |           44532 |                    126 |
| 1961 | Yojimbo             |                   8.200 |          139272 |                    110 |
| 1962 | Sanjuro             |                   8.000 |           43163 |                     96 |
| 1963 | High and Low        |                   8.400 |           62440 |                    143 |
| 1975 | Dersu Uzala         |                   8.200 |           35372 |                    142 |
| 1985 | Ran                 |                   8.200 |          145874 |                    160 |
+------+---------------------+-------------------------+-----------------+------------------------+
10 tuples

Now let’s see how much time it would take to watch all of the titles above via the following query:

SELECT
        SUM( CASE WHEN 
                primaryname ILIKE 'Akira Kurosawa' AND
             	averagerating >= 8 AND
             	numvotes >= 25000 THEN
             	runtimeminutes
            END ) AS "Total Viewing Time [minutes]"
FROM
        name_basics,
        title_basics,
        title_crew,
        title_ratings
WHERE
        nconst = splitpart(directors, ',', 1) AND
        title_crew.tconst = title_basics.tconst AND
        title_ratings.tconst = title_basics.tconst
;

The total viewing time for a dedicated cinephile will be:

+------------------------------+
| Total Viewing Time [minutes] |
+==============================+
|                         1325 |
+------------------------------+
1 tuple

This is just over 22 hours of viewing time, so quite a commitment. Of course, viewings can be spread out over a longer period of time and the same way we advise you not to embark on a sleepless binge of the great director’s masterworks, we also encourage you to take the time to learn about SQL and the MonetDB suite if you intend to use it for more complex analyses. For now, we hope this introductory quick dive into the MonetDB suite has given you a taste of its capabilities and inspired you to further explore.

For more updates, keep an eye out on our blog and GitHub ;)