tail heap hash found to be invalid on repeated DB loads

Eyal Rozenberg E.Rozenberg at cwi.nl
Thu Mar 29 11:02:35 CEST 2018



On 03/29/2018 09:47 AM, Sjoerd Mullender wrote:
>> 2. If it is expected, why isn't a correct hash written back to the disk?
>> Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command
>> eventually/immediately get translated into a write?
> 
> It doesn't happen all the time

If I have a mmapped column, and I theh memcmp() comes out false, then 
there's a memcpy() from the new hash to the h->base. Why would this not 
always result in a write, assuming monetdb is stopped normally?

> and you don't want to be writing to
> files when there is no need.  We've had complaints about that.

You're obviously right about not wanting to write without good reason, 
but one would expect that in the simple case of data being inserted once 
and then never modified, the hash on disk will be consistent with the 
hash recomputation on load.

> The problem is that during normal database operations, string heaps can
> be appended to, but if the change doesn't actually get committed, the
> appends need to be undone.  This can happen e.g. when a transaction is
> aborted due to server shutdown (crash or regular).  The committed data
> indicates how large the string heap is (BBP.dir), but it may be that
> there is more data, so that gets truncated.  That is fine.  But the hash
> may have been changed as well.  If in the aborted run the file had been
> loaded through GDKmalloc, that change wouldn't have gone to disk, but if
> it had been memory mapped, the change may well have gone to disk.  That
> is what needs to be repaired.
> If a string heap is larger than GDK_ELIMLIMIT (64KiB), we can't be sure
> where strings start, and anyway, it becomes expensive to run through the
> whole heap, so we only rebuild the hash for the first 64 KiB. This > means that it is likely the hash entry will be different if the heap is
> larger since the newer hash values will be different. 

So it seems the result is that even when all changes have been 
committed, and even in the simplest case of only executing COPY INTO 
once and no other writes - we can still get (sometimes? often? always?) 
recomputation mismatch.

> Note though, that
> the hash table is completely valid, even if different.  The entries just
> point at different strings, but since it is opportunistic, that's fine.

But if that's the case, why even bother with the memcmp()? If there's no 
good reason to expect the hash to be identical, it can simply be 
recomputed always. (Or alternatively, there could be some play with 
flags and the order of flushing mmapped pages to disk to ensure either 
the hash is valid or some dirty flag read from disk is up.)


More information about the developers-list mailing list