tail heap hash found to be invalid on repeated DB loads
E.Rozenberg at cwi.nl
Thu Mar 29 11:02:35 CEST 2018
On 03/29/2018 09:47 AM, Sjoerd Mullender wrote:
>> 2. If it is expected, why isn't a correct hash written back to the disk?
>> Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command
>> eventually/immediately get translated into a write?
> It doesn't happen all the time
If I have a mmapped column, and I theh memcmp() comes out false, then
there's a memcpy() from the new hash to the h->base. Why would this not
always result in a write, assuming monetdb is stopped normally?
> and you don't want to be writing to
> files when there is no need. We've had complaints about that.
You're obviously right about not wanting to write without good reason,
but one would expect that in the simple case of data being inserted once
and then never modified, the hash on disk will be consistent with the
hash recomputation on load.
> The problem is that during normal database operations, string heaps can
> be appended to, but if the change doesn't actually get committed, the
> appends need to be undone. This can happen e.g. when a transaction is
> aborted due to server shutdown (crash or regular). The committed data
> indicates how large the string heap is (BBP.dir), but it may be that
> there is more data, so that gets truncated. That is fine. But the hash
> may have been changed as well. If in the aborted run the file had been
> loaded through GDKmalloc, that change wouldn't have gone to disk, but if
> it had been memory mapped, the change may well have gone to disk. That
> is what needs to be repaired.
> If a string heap is larger than GDK_ELIMLIMIT (64KiB), we can't be sure
> where strings start, and anyway, it becomes expensive to run through the
> whole heap, so we only rebuild the hash for the first 64 KiB. This > means that it is likely the hash entry will be different if the heap is
> larger since the newer hash values will be different.
So it seems the result is that even when all changes have been
committed, and even in the simplest case of only executing COPY INTO
once and no other writes - we can still get (sometimes? often? always?)
> Note though, that
> the hash table is completely valid, even if different. The entries just
> point at different strings, but since it is opportunistic, that's fine.
But if that's the case, why even bother with the memcmp()? If there's no
good reason to expect the hash to be identical, it can simply be
recomputed always. (Or alternatively, there could be some play with
flags and the order of flushing mmapped pages to disk to ensure either
the hash is valid or some dirty flag read from disk is up.)
More information about the developers-list