Limit memory usage of MonetDB on Linux with CGROUPS

MonetDB is well-known for its eagerness to “eat up” all your memory. Fortunately, on Linux, there is a memory resource controller called cgroups, with which one can rein in that behaviour a bit. Let me share some experiences of MonetDB under cgroups here.

About CGROUP

Perhaps it is a less known fact that Docker, LXC and other container technologies are implemented using Linux Namespace Isolation and Linux Control Groups, aka cgroups. The latter is a Linux kernel feature introduced in 2008. The main idea behind cgroups is to manage hardware and operating system resources for different groups of processes. Let’s for example say we want to run two database servers on a single host which has 64GB of RAM. If we want each database server to have some guaranteed access to half of this memory capacity, we can set up two special memory cgroups, group_1 and group_2, and put a memory limit of 32 GB on each of them and associate the database server processes with their respective cgroup.

Besides the memory controller there are other subsystems that are useful for database systems (although limiting CPU usage of MonetDB is much easier). For example the cpusets CGroup subsystem is a very useful subsystem to assign specific CPUs to specific processes. This can be used to divide the total number of threads equally among multiple database servers. There are CGroup subsystems that deal with labelling and prioritising network packages, (block) devices, snapshotting, CPU usage accounting, etc.

So why would you choose to use a direct cgroup approach instead of a managed container layer like Docker? Well I can imagine that in enterprise production environments certain, (overly) strict security policies prevent the usage of third party tools outside of the tools that come with the OS distribution and some well defined software inventory. reason could be that for simple use cases, manual cgroups configurations offers a more lightweight, transparent and dynamic way of implementing resource management. On the other hand, if you need high customisability, then again cgroups offers the most fine-grained access on Linux systems to resource management.

Setting up a CGROUP for MonetDB

To keep it simple, we will only focus on the memory controller subsystem here, and how to get it done on my Ubuntu 18.04 (Bionic) machine which runs Linux kernel from version 4.15. Furthermore I am using cgroups version 1. There is also a more recent version 2 of cgroups which has a different architecture.

Let’s say that our MonetDB user is running mserver5 locally on his/her desktop. And the user wants to prevent MonetDB from eating all of the machine’s 4 GB RAM. So the user decides to limit mserver5 to cap physical memory usage to 2 GB.

My Ubuntu machine already comes with a couple of cgroup hierarchies for specific tasks. These hierarchies are just a set of directory trees. You can find the special file system called “cgroups” mounted on the root of each of these trees. The hierarchy that we are interested in is found at /sys/fs/cgroup/memory which is the preconfigured root of a cgroup hierarchy that controls memory limits. we can do something like

$ cd /sys/fs/cgroup/memory
$ sudo mkdir monetdb-cgroup
$ cd monetdb-cgroup

After the creation of the cgroup, notice that it is already filled with special files:

$ ls
-rw-r--r-- 1 root root 0 Oct 30 11:19 cgroup.clone_children
--w--w--w- 1 root root 0 Oct 30 11:19 cgroup.event_control
-rw-r--r-- 1 root root 0 Oct 30 11:19 cgroup.procs
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.failcnt
--w------- 1 root root 0 Oct 30 11:19 memory.force_empty
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.numa_stat
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.oom_control
---------- 1 root root 0 Oct 30 11:19 memory.pressure_level
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.stat
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.swappiness
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Oct 30 11:19 notify_on_release
-rw-r--r-- 1 root root 0 Oct 30 11:19 tasks

To put a limit on the memory usage, perform

$ sudo $(echo 2G > memory.limit_in_bytes)

Assuming that a single monetdb instance was already running, we can then simply add the corresponding mserver5 process to our newly created monetdb-cgroup:

$ sudo $(pgrep mserver5 > tasks)

Voila! The mserver5 process’ private physical memory usage is now bound by above commands to 2 GB. Even if mserver5 was using more than 2 GB of memory at the time of its cgroup association, all incoming allocations are then swapped out until the process is using less than its cgroup physical memory limit. This even works for sharable memory: as long as only a single process uses the potentially sharable memory pages (e.g. pages associated with memory mapped files), those pages are considered by the kernel to be private and charged to the process that uses them (the most). You can easily see the difference in memory usage using something like the top command.

I have to put up a disclaimer that this is the desired behaviour according to the official Linux kernel documentation and it is produced on the aforementioned Ubuntu system. But before this successful implementation, I have done a similar experiment with a slightly older Ubuntu distribution (16.04 xenial with kernel version 4.4; I believe). On that set up, the cgroup limit didn’t correctly limit the memory usage of MonetDB, because somehow it wasn’t accounting the sharable memory to the private memory of the database server process. I have no clue why this is the case and I am suspecting a bug of some sort in the older kernel.

In case your system doesn’t have a predefined cgroup hierarchy for configuring the memory controller, the steps to create a cgroup hierarchy are pretty straightforward and outlined in the above mentioned kernel documentation.

I have noticed that the directory that contains the roots of the cgroup hierarchies, always has the tmpfs file system mounted. I haven’t found a reason why this is the case or if it also works with another/inherited file system

$ sudo mkdir /sys/fs/cgroup # Doesn’t necessarily have to be this path
$ sudo mount -t tmpfs none /sys/fs/cgroup
$ cd /sys/fs/cgroup
$ sudo mount -t cgroup none /sys/fs/cgroup/memory -o memory

Some gotcha’s

To configure cgroups, you need root privileges. Either because you need to use the mount command or because predefined cgroups like /sys/fs/cgroup/memory are owned by root and have write-restrictions. However, after you have configured a (parent) cgroup you can change the ownership of that cgroup to a non-root user using chown.

Be aware that with cgroups version1, only one cgroup hierarchy can exist for any unique set of cgroups subsystems. So if we would create another hierarchy, say /sys/fs/cgroup/another-memory, that has only the memory controller as a subsystem (-o memory), then the corresponding mount command for another-memory will fail, since there is already an existing hierarchy with the exact same subsystem(s). But if we would mount it with two subsystems, say the memory AND cpuset controllers (-o cpuset,memory), the mount should succeed.

CGroups and OOM-killer

Sometimes when MonetDB hits the (memory) limit of its CGroup, it might get killed by the OOM-killer. Triggering the OOM-killer seems to be one strategy used by CGroups to enforce the memory limitation on the processes (see also [1] and [2]). To prevent MonetDB from being killed by the OOM-killer, one can set “oom_score_adj” to a very low value, e.g. -1000. Valid values of “oom_score_adj” are integers in the range of -1000 to 1000. The lower the value is, the lower the chance is that a process is going to be killed (see also [3]).

Notes for Ubuntu 20.04 (2020-04-29)

Tried out the above instructions on a freshly installed Ubuntu 20.04. Stumbled upon several problems. Here they are, and their solutions.

First, on Ubuntu 20.04 the commands “sudo $(echo 2G > memory.limit_in_bytes)” and “sudo $(pgrep mserver5 > tasks)” might not work. A workaround is to first become root (i.e. “sudo su”), then execute those commands without “sudo”.

Second, if you want to set “memory.memsw.limit_in_bytes”, but it is not in the “monetdb-cgroup” directory just created, you can try the following:

STEP 0\. BEFORE:
0.1\. BOOT_IMAGE doesn't have the expected values:
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-28-generic root=UUID=aa167f0b-53ed-4449-b157-15238f0e5644 ro quiet
0.2\. GRUB_CMDLINE_LINUX doesn't have the expected values:
# grep GRUB_CMDLINE_LINUX /etc/default/grub
GRUB_CMDLINE_LINUX=""

STEP 1\. Change GRUB_CMDLINE_LINUX into the following:
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

STEP 2\. make the grub config file definitive:
# sudo update-grub

STEP 3\. restart the machine

STEP 4\. check we have the expected values:
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-28-generic root=UUID=aa167f0b-53ed-4449-b157-15238f0e5644 ro cgroup_enable=memory swapaccount=1 quiet

Now you should be able to set memory.memsw.limit_in_bytes.


[1] https://stackoverflow.com/questions/27439910/why-cgroup-s-memory-subsystem-use-oom-killer-instead-of-return-memory-allocation

[2] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-memory

[3] https://askubuntu.com/questions/60672/how-do-i-use-oom-score-adj