To the untrained eye, databases can be intimidating: niche terminology, impenetrable numbers and bare-bones user-interfaces can leave even the more tech-savvy among us scratching our heads. In commercial/business contexts, navigating a database will likely require thorough industry-specific knowledge in order to utilise it for adequate decision-making. A more effective learning trajectory for SQL and the MonetDB suite (or any database management system for that matter) should comprise something the user is already familiar with. To illustrate this point, this blog post explores how one can navigate a particular database, within the scope of the MonetDB suite, that you are likely already familiar with: The Internet Movie Database (IMDb).
Non-commercial datasets have been made available by IMDb in the tab-separated values (.tsv) file format. For access to and descriptions of the datasets, click on the following link: https://developer.imdb.com/non-commercial-datasets/
After downloading the compressed files and extracting the datasets, create a new database imdb using the shell commands displayed below, assuming you have already installed MonetDB and a MonetDB database server is already running on your system (if not, first go to https://www.monetdb.org/easy-setup/):
monetdb create imdb
monetdb release imdb
Now start the mclient command-line interface by executing mclient -d imdb
in the shell. Then, create the schema by executing the following:
CREATE TABLE name_basics(
nconst TEXT NOT NULL,
primaryName TEXT,
birthYear INTEGER,
deathYear INTEGER,
primaryProfession TEXT,
knownForTitles TEXT
);
CREATE TABLE title_akas(
titleId TEXT NOT NULL,
ordering INTEGER,
title TEXT,
region TEXT,
language TEXT,
types TEXT,
attributes TEXT,
isOriginalTitle BOOLEAN
);
CREATE TABLE title_basics(
tconst TEXT NOT NULL,
titleType TEXT,
primaryTitle TEXT,
originalTitle TEXT,
isAdult BOOLEAN,
startYear INTEGER,
endYear INTEGER,
runtimeMinutes INTEGER,
genres TEXT
);
CREATE TABLE title_crew(
tconst TEXT NOT NULL,
directors TEXT,
writers TEXT
);
CREATE TABLE title_episode(
tconst TEXT NOT NULL,
parentTconst TEXT,
seasonNumber INTEGER,
episodeNumber INTEGER
);
CREATE TABLE title_principals(
tconst TEXT NOT NULL,
ordering INTEGER,
nconst TEXT,
category TEXT,
job TEXT,
characters TEXT
);
CREATE TABLE title_ratings(
tconst TEXT NOT NULL,
averageRating DECIMAL,
numVotes INTEGER
);
Now it is possible to load the data using the statements provided below. Make sure to change the file paths to the correct location(s) where the datasets have been saved.
COPY OFFSET 2 INTO name_basics FROM '/path/to/name.basics.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_akas FROM '/path/to/title.akas.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_basics FROM '/path/to/title.basics.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_crew FROM '/path/to/title.crew.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_episode FROM '/path/to/title.episode.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_principals FROM '/path/to/title.principals.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
COPY OFFSET 2 INTO title_ratings FROM '/path/to/title.ratings.tsv'
ON CLIENT USING DELIMITERS '\t', E'\n' NO ESCAPE NULL AS '\\N';
Assuming a successful outcome, it is now time to explore and find something to watch!
Let’s start with a simple task: Making a list of all the works of a big-name director. This summer in the Netherlands, where MonetDB Solutions is based, many cinemas have been showing the works of acclaimed Japanase director Akira Kurosawa. To create a list comprising each film he has directed, its year of release and the unique identification key of the director, execute the following query:
SELECT
startyear AS "Year",
primarytitle AS "Film Title",
directors AS "Director(s)"
FROM
name_basics,
title_basics,
title_crew
WHERE
primaryname ILIKE 'Akira Kurosawa' AND
nconst = directors AND
title_crew.tconst = title_basics.tconst
ORDER BY
startyear,
primarytitle
;
You should receive the following output:
+------+---------------------------------------+-------------+
| Year | Film Title | Director(s) |
+======+=======================================+=============+
| 1943 | Sanshiro Sugata | nm0000041 |
| 1944 | The Most Beautiful | nm0000041 |
| 1945 | Sanshiro Sugata, Part Two | nm0000041 |
| 1945 | The Men Who Tread on the Tiger's Tail | nm0000041 |
| 1946 | No Regrets for Our Youth | nm0000041 |
| 1947 | One Wonderful Sunday | nm0000041 |
| 1948 | Drunken Angel | nm0000041 |
| 1949 | Stray Dog | nm0000041 |
| 1949 | The Quiet Duel | nm0000041 |
| 1950 | Rashomon | nm0000041 |
| 1950 | Scandal | nm0000041 |
| 1951 | The Idiot | nm0000041 |
| 1952 | Ikiru | nm0000041 |
| 1954 | Seven Samurai | nm0000041 |
| 1955 | I Live in Fear | nm0000041 |
| 1957 | The Lower Depths | nm0000041 |
| 1957 | Throne of Blood | nm0000041 |
| 1958 | The Hidden Fortress | nm0000041 |
| 1960 | The Bad Sleep Well | nm0000041 |
| 1961 | Yojimbo | nm0000041 |
| 1962 | Sanjuro | nm0000041 |
| 1963 | High and Low | nm0000041 |
| 1965 | Red Beard | nm0000041 |
| 1970 | Dodes'ka-den | nm0000041 |
| 1975 | Dersu Uzala | nm0000041 |
| 1980 | Kagemusha: The Shadow Warrior | nm0000041 |
| 1985 | Ran | nm0000041 |
| 1991 | Rhapsody in August | nm0000041 |
+------+---------------------------------------+-------------+
28 tuples
As you can see, it is a rather extensive body of work spanning decades. Let’s verify the accuracy of this outcome by comparing it to the director’s filmography displayed on the Wikipedia page List of works by Akira Kurosawa. Upon closer inspection, it becomes evident that our list is incomplete. Namely, three films are missing: Those Who Make Tomorrow (1946), Dreams (1990) and Madadayo (1993). From this, we learn an important lesson: Never underestimate the utility of verification using external sources when possible.
Now let’s investigate.
Using the absent titles and their years of release as search terms, use the following query to pull the missing data:
SELECT
startyear AS "Year",
primarytitle AS "Film Title",
directors AS "Director(s)"
FROM
title_basics,
title_crew
WHERE
(primarytitle ILIKE 'Madadayo' OR
(primarytitle ILIKE 'Dreams' AND startyear = 1990 AND directors IS NOT NULL) OR
primarytitle ILIKE 'Those Who Make Tomorrow') AND
title_basics.tconst = title_crew.tconst
ORDER BY
startyear,
primarytitle
;
The output should be:
+------+-------------------------+-------------------------------+
| Year | Film Title | Director(s) |
+======+=========================+===============================+
| 1946 | Those Who Make Tomorrow | nm0000041,nm0782846,nm0945428 |
| 1990 | Dreams | nm0000041,nm0393094 |
| 1993 | Madadayo | nm0000041,nm0393094 |
+------+-------------------------+-------------------------------+
3 tuples
Now we see a pattern: all the missing titles have more than one director, one of whom is of course Akira Kurosawa himself as we can see his identification key nm0000041. But given that the identification keys of other directors are lumped into the same string, the initial query used to pull the first list does not pick this up. Unfortunately, this is a common occurrence in the world of data science/engineering; it is often necessary to restructure a dataset before analyzing it per the ETL methodology.
Let’s look for solutions.
A quick fix to this problem is to replace the condition nconst = directors
with nconst = splitpart(directors, ',', 1)
in the initial query:
SELECT
startyear AS "Year",
primarytitle AS "Film Title",
directors AS "Director(s)"
FROM
name_basics,
title_basics,
title_crew
WHERE
primaryname ILIKE 'Akira Kurosawa' AND
nconst = splitpart(directors, ',', 1) AND
title_crew.tconst = title_basics.tconst
ORDER BY
startyear,
primarytitle
;
As a result, the director’s collaborations in which he is credited first as the primary director will also be pulled along with the works in which he is the sole contributor:
+------+---------------------------------------+-------------------------------+
| Year | Film Title | Director(s) |
+======+=======================================+===============================+
| 1943 | Sanshiro Sugata | nm0000041 |
| 1944 | The Most Beautiful | nm0000041 |
| 1945 | Sanshiro Sugata, Part Two | nm0000041 |
| 1945 | The Men Who Tread on the Tiger's Tail | nm0000041 |
| 1946 | No Regrets for Our Youth | nm0000041 |
| 1946 | Those Who Make Tomorrow | nm0000041,nm0782846,nm0945428 |
| 1947 | One Wonderful Sunday | nm0000041 |
: : :
: : :
: : :
| 1985 | Ran | nm0000041 |
| 1990 | Dreams | nm0000041,nm0393094 |
| 1991 | Rhapsody in August | nm0000041 |
| 1993 | Madadayo | nm0000041,nm0393094 |
+------+---------------------------------------+-------------------------------+
31 tuples
To catch a glimpse of the works in which he has been credited as an assistant director, one needs to change the index n
in the string function splitpart(directors, ',', n)
to an integer quantity above 1. Whilst the simplicity of this fix is appealing, it should not be understated that a more lasting solution is warranted if one intends to use this database for more substantial analyses.
Nevertheless, now that we have a complete list of his major works, let’s perform some statistical analyses.
Sitting through the entire filmography of the acclaimed director would require a lot of dedication, especially considering the lengthy running times of many of his titles by today’s standards. So if you plan to indulge, of course you want to see his best work. Let’s pull the highest rated titles, all with a weighted average rating above or equal to 8 according to the IMDb rating system and with enough votes to be eligible for the Top 250 list, which as of writing is 25000.
Execute the following query:
SELECT
startyear AS "Year",
primarytitle AS "Film Title",
averagerating AS "Weighted Average >= 8.0",
numvotes AS "Number of Votes",
runtimeminutes AS "Running Time [minutes]"
FROM
name_basics,
title_basics,
title_crew,
title_ratings
WHERE
primaryname ILIKE 'Akira Kurosawa' AND
nconst = splitpart(directors, ',', 1) AND
title_crew.tconst = title_basics.tconst AND
title_ratings.tconst = title_basics.tconst AND
averagerating >= 8 AND
numvotes >= 25000
ORDER BY
startyear,
primarytitle
;
The result should be the following:
+------+---------------------+-------------------------+-----------------+------------------------+
| Year | Film Title | Weighted Average >= 8.0 | Number of Votes | Running Time [minutes] |
+======+=====================+=========================+=================+========================+
| 1950 | Rashomon | 8.100 | 191347 | 88 |
| 1952 | Ikiru | 8.300 | 98135 | 143 |
| 1954 | Seven Samurai | 8.600 | 388943 | 207 |
| 1957 | Throne of Blood | 8.000 | 58614 | 110 |
| 1958 | The Hidden Fortress | 8.000 | 44532 | 126 |
| 1961 | Yojimbo | 8.200 | 139272 | 110 |
| 1962 | Sanjuro | 8.000 | 43163 | 96 |
| 1963 | High and Low | 8.400 | 62440 | 143 |
| 1975 | Dersu Uzala | 8.200 | 35372 | 142 |
| 1985 | Ran | 8.200 | 145874 | 160 |
+------+---------------------+-------------------------+-----------------+------------------------+
10 tuples
Now let’s see how much time it would take to watch all of the titles above via the following query:
SELECT
SUM( CASE WHEN
primaryname ILIKE 'Akira Kurosawa' AND
averagerating >= 8 AND
numvotes >= 25000 THEN
runtimeminutes
END ) AS "Total Viewing Time [minutes]"
FROM
name_basics,
title_basics,
title_crew,
title_ratings
WHERE
nconst = splitpart(directors, ',', 1) AND
title_crew.tconst = title_basics.tconst AND
title_ratings.tconst = title_basics.tconst
;
The total viewing time for a dedicated cinephile will be:
+------------------------------+
| Total Viewing Time [minutes] |
+==============================+
| 1325 |
+------------------------------+
1 tuple
This is just over 22 hours of viewing time, so quite a commitment. Of course, viewings can be spread out over a longer period of time and the same way we advise you not to embark on a sleepless binge of the great director’s masterworks, we also encourage you to take the time to learn about SQL and the MonetDB suite if you intend to use it for more complex analyses. For now, we hope this introductory quick dive into the MonetDB suite has given you a taste of its capabilities and inspired you to further explore.
For more updates, keep an eye out on our blog and GitHub ;)