Thanks again Sjoerd, I appreciate your thoughts on this.

Indeed, after the problem started I have tried to play with kernel parameters (swappiness, memory_overcommit, etc), unfortunately with no improvements. Then I moved all parameters back to normal (thoroughly double-checked).

What is funny is that I have two more VMs that are the exact copy of this one (same OS, same running containers, same limits assigned to containers, same data, same versions of everything - they were created programmatically). 
The one difference is that these two have less RAM assigned to the VM. And they don't fail. Ironic, isn't it?
More physical memory available makes me think that more memory is used by the OS as cache memory. In principle this should not affect the amounts counted as RSS, but it's really the only difference I can see.

For the record, yesterday I tried to use oom_kill_disable option in the container as a workaround, to prevent a kill that seemed unjustified. 
That didn't go well. OOM was indeed not invoked, but the cgroups' hard limit was still there, which means the kernel still refused to let it allocate more memory than 16g. 
That resulted in:

!ERROR:MALException:mal.interpreter:GDK reported error: GDKload: cannot read: name=02/25/22515, ext=theap, 8192 bytes missing.
!ERROR:!OS: Cannot allocate memory
!ERROR:!ERROR: GDKload: cannot read: name=01/53/15346, ext=theap, 8192 bytes missing.
!ERROR:!OS: Cannot allocate memory

I could try to set *only* the soft limit in the container. That is in principle a bit risky, as the container is allowed to go rogue with memory, but let's at least see what happens.

To be clear: I understand the role of the kernel in this, and don't expect MonetDB developers to debug kernel behaviours. However, correct memory management is crucial for MonetDB, so the two are rather related in practice. I ask and comment here thinking that unraveling the issue could be useful to other users in similar conditions.


On Thu, 11 Mar 2021 at 09:05, Sjoerd Mullender <sjoerd@monetdb.org> wrote:
What I'm thinking now that is happening is this.  I assume that your
hardware has plenty of memory available and that the system as a whole
is not under memory pressure.  It is just this one container that is
using up its allocated space.  I think the kernel is here at fault in
that it prefers to kill the process in the container to moving some of
its memory pages to swap.  I have no idea whether you can tell the
kernel to either use more memory for that container (since there is
probably still plenty around) or to swap pages of that container out.
In either case, that is not something that mserver5 can do or control.

The memory settings available to mserver5 are just how much virtual
memory (or address space) it is allowed to use (hard limit), and an
indication of how soon to switch from allocating memory for BATs using
malloc to using memory-mapped files.  In either case, physical memory is
used for those BATs, and if the kernel is unwilling to relinquish that
memory (sending pages to swap/disk) to make space within the container,
then you will run into this problem.

On 09/03/2021 21.08, Roberto Cornacchia wrote:
> Thanks again for these details and for the gdk estimates to look at.
>
> To me what is important here is not much to understand why swap wasn't
> used, but the fact that no swap used actually simplifies things.
>
>  > This 16g is a combination of what was allocated (and used) through
>  > malloc and through mmap.  Forget about memory being a malloc thing.
>  > Mmap also uses memory.
>
> The kernel seems to tell me that in this case the 16g used are physical
> RAM. Not swapped, not mmapped:
>
> memory: usage 16777216kB, limit 16777216kB, failcnt 244063804
> memory+swap: usage 16777964kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup out of memory: Kill process 975803 (mserver5) score 1003
> or sacrifice child
> Killed process 4134544 (mserver5), UID 0, total-vm:28382372kB,
> anon-rss:16776396kB, file-rss:13284kB, shmem-rss:0kB
>
>  > GDK_mem_maxsize is not a hard limit.  It's a value on which some
>  > decisions having to do with allocating memory are based.
>
> This is actually what I was asking confirmation about.
> I'm not complaining of anything, but this is not really documented and
> I'm just trying to understand exactly which tools I have: is it a soft
> limit in the sense that it actually tries to stay more or less below
> this limit, or is it really only used to estimate other values?
>
> The point is not to force some exact behaviour, but to apply some
> resource management *without* having the server killed, and that is a
> big problem.
>
> On Tue, 9 Mar 2021 at 20:26, Sjoerd Mullender <sjoerd@monetdb.org
> <mailto:sjoerd@monetdb.org>> wrote:
>
>     On 09/03/2021 17.44, Roberto Cornacchia wrote:
>      > Sjoerd,
>      >
>      > Thanks for these details.
>      >
>      > Let me focus on these two concepts:
>      >
>      >  > Allocated address space may or may not reside in physical memory.
>      >  > The kernel decides that.
>      >
>      > Absolutely.
>      > Still, you can decide what's the max that malloc() can use:
>      >
>      >  > gdk_mem_maxsize is the maximum amount of address space we want to
>      >  > allocate using malloc and friends.
>
>     GDK_mem_maxsize is not a hard limit.  It's a value on which some
>     decisions having to do with allocating memory are based.
>     GDK_vm_maxsize
>     is a fairly hard limit.  We may go over it in critical code (during
>     transaction commit), but otherwise allocations (both malloc and mmap)
>     will fail if you were to go over this limit.
>
>      >
>      > Isn't malloc() using RSS + swap to back its allocations? Does
>     that mean
>      > that gdk_mem_maxsize should be a cap to what we want to be able to
>      > allocate on RSS + swap?
>      > In this case, actually, swap usage was 0.
>      >
>      > So I still don't understand why mallocs for 16g happened,
>      > when gdk_mem_maxsize was 14g.
>
>     This 16g is a combination of what was allocated (and used) through
>     malloc and through mmap.  Forget about memory being a malloc thing.
>     Mmap also uses memory.
>
>     Why swap is unused I don't know.  As I said, it's the kernel that does
>     that.  We have nothing to do with that.  It may be you have kernel
>     parameters set that cause the kernel to not use swap.
>
>     By using the debugger you can check how much MonetDB thinks it has
>     allocated.  Look at the values of the variables
>     GDK_mallocedbytes_estimate and GDK_vm_cursize.  But again, that is
>     allocated address space, not memory.  And there may be fragmentation as
>     well, so the amount of address space in use by the process may well be
>     higher (of course these numbers don't take space for declared variables
>     into account).
>
>
>
>
>      >
>      >
>      > On Tue, 9 Mar 2021 at 17:27, Sjoerd Mullender <sjoerd@monetdb.org
>     <mailto:sjoerd@monetdb.org>
>      > <mailto:sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>>> wrote:
>      >
>      >     We do not in any way control RSS (resident set size).  That
>     is fully
>      >     under control of the kernel.
>      >
>      >     gdk_mem_maxsize is the maximum amount of address space we want to
>      >     allocate using malloc and friends.
>      >     gdk_vm_maxsize is the maximum amount of address space we want to
>      >     allocate (malloc + mmap).
>      >     Neither value has anything to do with how much actual,
>     physical memory
>      >     is being used.  They are just measures of how much address
>     space is
>      >     used, allocated either through malloc or malloc+mmap.  Allocated
>      >     address
>      >     space may or may not reside in physical memory.  The kernel
>     decides
>      >     that.
>      >
>      >     Of course, if you're using the address space (however you got
>     it),
>      >     there
>      >     must be physical memory to which the address space is mapped.
>      >
>      >     The difference between malloc and mmap is mostly where the
>     physical,
>      >     disk-spaced backing (if any) for the virtual memory is
>     located, i.e.
>      >     where the kernel can copy the memory to if it needs space.
>     In the case
>      >     of mmap (our use of it, anyway) it is files in the file
>     system, and in
>      >     the case of malloc it is swap (if you have it) or physical
>     memory (if
>      >     you don't).
>      >
>      >     On 09/03/2021 16.58, Roberto Cornacchia wrote:
>      >      > Hi,
>      >      >
>      >      > I would appreciate some help interpreting the following
>      >     memory-related
>      >      > issues.
>      >      >
>      >      > I've got a mserver5 instance running with cgroups v1
>     constraints
>      >      >
>      >      > - memory.limit_in_bytes = 17179869184 (16g)
>      >      > - memory.memsw.limit_in_bytes = 9223372036854771712
>      >      >
>      >      > gdk_mem_maxsize is initialised as 0.815 * 17179869184
>     = 14001593384.
>      >      >
>      >      > So I get:
>      >      > sql>select * from env() where name in ('gdk_mem_maxsize',
>      >     'gdk_vm_maxsize');
>      >      > +-----------------+---------------+
>      >      > | name            | value         |
>      >      > +=================+===============+
>      >      > | gdk_vm_maxsize  | 4398046511104 |
>      >      > | gdk_mem_maxsize | 14001593384   |
>      >      > +-----------------+---------------+
>      >      >
>      >      > That looks good.
>      >      >
>      >      >
>      >      > To my surprise, this instance gets frequently OOM-killed for
>      >     reaching
>      >      > 16g of RSS (no swap used):
>      >      >
>      >      > memory: usage 16777216kB, limit 16777216kB, failcnt 244063804
>      >      > memory+swap: usage 16777964kB, limit 9007199254740988kB,
>     failcnt 0
>      >      > kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
>      >      > Memory cgroup out of memory: Kill process 975803
>     (mserver5) score
>      >     1003
>      >      > or sacrifice child
>      >      >
>      >      > Now, there are two different aspects: giving a process a
>     memory
>      >     cap and
>      >      > making the process respect that cap without getting killed.
>      >      >
>      >      > - if the process allocates more than defined with cgroups,
>     then
>      >     it gets
>      >      > killed. That is fine, it doesn't surprise me
>      >      > - the question is: why did monetDB surpass the 16g limit?
>      >      >
>      >      > Even more surprising, given that it "prudently" initialises
>      >     itself at
>      >      > 80% of the available memory.
>      >      >
>      >      > Perhaps I was under the wrong assumption that MonetDB
>     would never
>      >      > allocate more than gdk_mem_maxsize, but now I seem to realise
>      >     that it
>      >      > simply uses this value to optimise its memory management
>     (e.g. to
>      >     decide
>      >      > how early to mmap).
>      >      >
>      >      > So, am I correct that setting gdk_mem_maxsize (indirectly via
>      >     gcroups or
>      >      > directly via memmaxsize parameter) does not guarantee rss
>     memory
>      >     will
>      >      > stay underthat value?
>      >      >
>      >      > If that is true, I am back at square 1 in my quest for how
>     to cap
>      >     rss
>      >      > usage (without getting the process killed).
>      >      >
>      >      > Thanks for your help.
>      >      > Roberto
>      >      >
>      >      > _______________________________________________
>      >      > users-list mailing list
>      >      > users-list@monetdb.org <mailto:users-list@monetdb.org>
>     <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>
>      >      > https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>
>      >     <https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>>
>      >      >
>      >
>      >     --
>      >     Sjoerd Mullender
>      >     _______________________________________________
>      >     users-list mailing list
>      > users-list@monetdb.org <mailto:users-list@monetdb.org>
>     <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>
>      > https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>
>      >     <https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>>
>      >
>      >
>      > _______________________________________________
>      > users-list mailing list
>      > users-list@monetdb.org <mailto:users-list@monetdb.org>
>      > https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>
>      >
>
>     --
>     Sjoerd Mullender
>     _______________________________________________
>     users-list mailing list
>     users-list@monetdb.org <mailto:users-list@monetdb.org>
>     https://www.monetdb.org/mailman/listinfo/users-list
>     <https://www.monetdb.org/mailman/listinfo/users-list>
>
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>

--
Sjoerd Mullender
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list