Arjen,

Thanks for your input. It's all 100% correct. 

I should stress that we don't use container as "lightweight VMs". We use them mainly because of the advantages they bring in terms of ease of deployment.
So we are not expecting that containers simulate a VM.

The only point I disagree with is that the difference between containers and VMs is the fundamental problem here. Yes, they are different and I'm aware of that.

To me, the fundamental problem is that there are no means to prevent each MonetDB instance to claim all available memory. This has nothing to do with containers, the same happens on bare metal or on VMs.

Roberto

On Thu, 20 Jun 2019 at 11:34, Arjen de Rijke <arjen.de.rijke@cwi.nl> wrote:
Hi All,

To add to this from a "sysadmin" perspective, one fundamental problem is that "containers" and "virtual machines" appear to be similar, but fundamentally they are very different. And that manifests itself in the problems that you are describing.

This means that things like memory limits in containers look a lot like the amount of memory in a (virtual) machine, but they are really different. If a real machine runs out of memory, the kernel tries to schedule different processes in such a way that the machine continues to work, for example by swapping to disc. Only in the last resort will it kill a process. With containers this is different. There the container runtime will kill a container that exceeds the memory limit. Because the primary use-case for containers is stateless applications, such as webservers, this behaviour is perfectly fine. But if you are running statefull applications, such as databases, in containers, this creates problems.

Containers "promise" to make maintainance easier, compared to virtual machines. But this comes at a price. Building and maintaining a virtual machine is not as easy as building and maintaining a container, but the advantage of a virtual machine is that it is "exactly" the same as a real machine. So a database running inside a virtual machine behaves (more or less) the same as on physical hardaware. Under normal circumstances running a database inside a container is fine, but in the more extreme cases, you run into the problems you describe.

In his email Roberto writes: "... So far, the only mechanism I know to obtain the correect behavior is to run actual VMs for MonetDB. ..." and he is right. But that implies that the container behaviour is wrong, but that is not true. The container behaves exactly as it is supposed to do. The problem is that it does is not match Roberto's intended use case. He also says: "... this is very cumbersome and I want to avoid that as much as possible. ..." and he is not wrong, but there are tools available to build vm images relatively easy. But maintaining a set of vm's with a database on a machine is not as easy as running a set of containers, that is certainly true. But it is a trade-off. You can try to work around the limitations of containers, but that will make managing the containers more difficult and more error prone, which is exactly what you wanted to avoid in the first place. You could also try to make the database aware of the containers memory limit, but then the database has to guarantee it will not exceed this limit under any circumstance, which is impossible (i guess). 

Some of Jenny's remarks can help running MonetDB in containers and maybe some changes to MonetDB might help as well. But the fundamental problem remains, containers and (virtual) machines are different and in certain cases you will notice the difference. In order to make the right choice, you need to know what your requirements are and how the different solutions are implemented. For many cases it is perfectly fine to run databases in containers, but unfortunately i don't think it will always work.

Arjen

PS, i am only talking about linux here.


----- Original Message -----
> From: "Ying Zhang" <Y.Zhang@cwi.nl>
> To: "Communication channel for MonetDB users" <users-list@monetdb.org>
> Sent: Wednesday, June 19, 2019 11:26:37 PM
> Subject: Re: Guidelines for MonetDB in production environments

>> On 14 Jun 2019, at 15:05, Roberto Cornacchia <roberto.cornacchia@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> I'm struggling with optimizing resource sharing of MonetDB in production
>> environments (see also:
>> https://www.monetdb.org/pipermail/users-list/2018-June/010276.html).
>
> Hai Roberto,
>
> We don’t have good solution yet to share resources among MonetDB instances, but
> recently we have gathered some information on this topic.  Let me share it
> here.
>
>>
>> We run MonetDB instances for several projects / customers on each of our
>> servers.
>> Each MonetDB instance is a docker container (used mainly because of ease of
>> deployment and environment isolation). It is not unusual to have 5-10 MonetDB
>> containers on the same server.
>> In principle, Docker does not even have much to do with this, but it puts
>> everything in a realistic context.
>>
>>
>> ** Memory
>> mserver5 checks the system memory and calibrates on that. When 10 instances are
>> running, they all assume they have the whole memory for themselves.
>>
>> Docker allows to set limits on the container memory. It does that by using
>> cgroups (so Docker just makes things easier, but it's really about cgroups).
>> However, memory limits set by cgroups are not namespaced
>> (https://ops.tips/blog/why-top-inside-container-wrong-memory/#memory-limits-set-by-cgroups-are-not-namespaced).
>
> We’ve often used tools such as numactrl and cgroups to limit hardware resources
> MonetDB can use.  They indeed limit the resources available to mdb, but we only
> realised it recently that mdb is not aware of those limits, so it can cause
> various problems.  This is an open issue reported here:
> https://www.monetdb.org/bugzilla/show_bug.cgi?id=6710
>
> FYI, depending on the system, uses sysctl or GlobalMemoryStatusEx for memory,
> the former with system-dependent arguments.  For number of cores mdb uses
> sysconf, sycctl, or GetSystemInfo.  See gdk_utils.c (MT_init()) and
> gdk_system.c (MT_check_nr_cores()).
>
>> This means that each container will still see the whole memory and will simply
>> get killed when the container limit has been reached (definitely not a
>> solution).
>
> It doesn’t solve the problem, but in this blog (especially at the end), we gave
> some ideas how to avoid the OOM-killer:
> https://www.monetdb.org/blog/limit_memory_usage_on_linux_with_cgroups
>
> However, please be aware that lowering the OOM-killer priority would just make
> OOM-killer choose a different victim, which can be a disaster on a production
> server.
> Docker even has an option to disable OOM-killer on a container. But the
> consequences may be even worse, as without a victim processes can just freeze
> forever.
>
>
> For windows, we have actually added an option *inside* mdb to limit its memory
> usage.  I think with that one, mdb is actually aware of the limits…  The code
> is not released yet.
>
>> So far, the only mechanism I know to obtain the correect behavior is to run
>> actual VMs for MonetDB. But this is very cumbersome and I want to avoid that as
>> much as possible.
>> Should we let 10 instances believe they each have the whole memory, and let them
>> fight for it? (well, that's what's happening now, and I know for sure it's
>> bad).
>> Perhaps the solution can be as easy as allowing an explicit max memory setting,
>> together with some documentation on the consequences of using low / high
>> values.
>
> I’m also thinking about an explicit max-memory setting.  One that’s similar
> --set gdk_nr_threads = N so that one can set it to the same amount of MEM as
> the limit in the external tools.  It’s a bit hacky, but is probably the easiest
> to implement.  Let me check with the others if this is something we can in
> short term.
>
> An idea solution would be to let MonetDB to also check for the resource limits
> set by CGroups, numactl, Docker, etc.
> Perhaps what we need to do is look at the resource limits (getrlimit function
> call) to get the (soft) limit.  If they are lower than what we found by using
> sysctl/sysconf, we should use the lower value. Actually, the Linux cgroups
> manual refers to getrlimit, so they may have to do with each other.
>
> For cgroups on linux one can do amongst others: cat /proc/<PID>/cgroup
> to get the cgroup of the process with a specific pid. Once one knows the cgroup,
> one can look up the memory limits in the cgroup directory assuming sufficient
> permissions.
>
>>
>> ** CPU
>> Again, Docker allows to set quotas per container. I think cgroups CPU limits are
>> namespaced, so perhaps this would just work well, I haven't really tried yet.
>
> I wonder if --set gdk_nr_threads = N can be of any help here.
>
>>
>> ** I/O
>> Same issue. It would be ideal to be able to set priorities, so that mserver5
>> instances that do background work get a lower I/O priority than instances
>> serving online queries.
>
> This is probably even more difficult than MEM and CPU limitations, since MonetDB
> heavily relies on mmapped files and let the OS decide what’s best.  And so far,
> we have barely received any user requests on this particular topic...
>
> I know about some research work on improving mmapped files, which allows
> application to assign a priority to each page.
>
> Maybe madvise can help a bit here.
>
>> Also, recommendations on swap settings would be interesting. How much swap? How
>> to tune swappiness kernel settings?
>>
>> I am very aware that there is no simple answer to most of these questions. Many
>> variables are in the picture.
>> Still, some general thoughts from the developers would be appreciated.
>>
>> I think I have read pretty much everything has ever been written about MonetDB,
>> but when it comes to resource utilization I have always bumped into the very
>> unrealistic assumption that each MonetDB instance has a whole server for
>> itself.
>> As I mentioned above, things could get already much better with simple
>> improvements, like allowing to set the maximum memory usable by each instance.
>>
>> But more in general, I feel there is much need for some guidelines for
>> production environments. Or at least, to start the discussion.
>
> Let’s try to keep this discussion more active.
>
> Just my ¥0.02
>
> Jennie
>
>>
>> Best regards,
>> Roberto
>> _______________________________________________
>> users-list mailing list
>> users-list@monetdb.org
>> https://www.monetdb.org/mailman/listinfo/users-list
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list