[Monetdb-developers] [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33

Peter Boncz P.Boncz at cwi.nl
Fri Feb 19 01:46:23 CET 2010


Hi Stefan

Thanks, indeed in all areas improvements are needed:
1) indeed (scary use of free!) this should be corrected
2) typically yes. I do recall now that BATfetchjoin heap sharing will
invalidate the otherwise always applying order correlation. If we have a way
to detect that a heap is shared, we should treat those shared string heaps
as WILLNEED.
3) also correct. The MT_mmap_find() could easily find entries by range
overlap, then inform would find the relevant heap

Finally, now sequential advise will not trigger preloading, but I actually
think it can help (if you have enough memory). Maybe prefetch sequential
heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory.

Peter

-----Original Message-----
From: Stefan Manegold [mailto:Stefan.Manegold at cwi.nl] 
Sent: vrijdag 19 februari 2010 1:34
To: monetdb-developers at lists.sourceforge.net; Peter Boncz
Cc: monetdb-checkins at lists.sourceforge.net
Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010,
1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33

Peter,

I have some questions to make sure I understand your new code correctly:

1)
I don't see any plance in the hash code (at least not in gdk_search.mx)
where the "free" element of a hash heap is set (or used) other than the
initialization to 0 in HEAPalloc;
thus, I guess, "free" for hash heaps is always 0;
hence, shouln't we use "size" instead of "free" for the madvise & preload
size of hash heaps (as we did in the original BATpreload/BATaccess code)?

2)
Am I right that for string heaps you conclude from a strong order
correlation between the off-heap and the string heap (due sequential
load/insertion) that also the first and last BUN in the offset point to the
"first" and "last" string in the string heap?
Well, indeed, since access is to be considered in page size granularity,
this might be reasonable ...


3)
(This was the same in the previous version of the code)
For BUN heaps, in case of views (slices), the base pointer of the view's
heap might not be the same as the parent's heap, in fact, it might not be
page-aligned. 
If I understand the MT_mmap_tab[] array correctly, it identifies heap by
their page-aligned base pointer of the parent's heap.
Hence, BATaccess() on a slice view BAT with non-aligned heap->base
pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned
heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform()
does nothing with that heap. With preload==1 it does hence not resgister the
posix_madvise() call that access_heap() does. COnsequently, with
preload==-1, MT_mmap_inform() will never reset the advise set via slice
views, unless there is (also) access to the original parent's heap (i.e.,
with page-aligned heap->base pointer.
I jjust noticed this, but do not yet understand, whether and if so which
consequences this (might) have ...


Stefan


On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote:
> Update of /cvsroot/monetdb/MonetDB/src/gdk
> In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734
> 
> Modified Files:
>       Tag: Feb2010
> 	gdk_posix.mx gdk_storage.mx 
> Log Message:
> did experimentation with sequential mmap I/O.
> - on very fast subsystems (such as 16xssd) it is three times slower than
optimally tuned direct I/O (1GB/s vs 3GB/s)
> - with less disks the difference is smaller (e.g. 140 vs 200MB/s)
> regrettably, nothing helped to get it higher.
> 
> the below checkin makes the following changes:
> - simplified BATaccess code by separating out routine
> - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast)
> - observe that large string heaps have a high sequential correletaion
>   hense always WILLNEED fetching is overkill
> - move the madvise() call back to BATaccess at the start of the access but
removing
>   the advise is done in vmtrim, as you need the overview when the last
user is away.
> - the basic advise is SEQUENTIAL (ie decent I/O)
> 
> 
> 
> Index: gdk_storage.mx
> ===================================================================
> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v
> retrieving revision 1.149.2.32
> retrieving revision 1.149.2.33
> diff -u -d -r1.149.2.32 -r1.149.2.33
> --- gdk_storage.mx	18 Feb 2010 01:04:11 -0000	1.149.2.32
> +++ gdk_storage.mx	18 Feb 2010 22:39:08 -0000	1.149.2.33
> @@ -697,156 +697,95 @@
>  	return BATload_intern(i);
>  }
>  @- BAT preload
> -To avoid random disk access to large (memory-mapped) BATs it may help to
issue a preload
> -request. 
> -Of course, it does not make sense to touch more then we can physically
accomodate.
> +To avoid random disk access to large (memory-mapped) BATs it may help to
issue a preload request. 
> +Of course, it does not make sense to touch more then we can physically
accomodate (budget).
>  @c
> -size_t 
> -BATaccess(BAT *b, int what, int advise, int preload) {
> -	size_t *i, *limit;
> -	size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0;
> -	size_t step = MT_pagesize()/sizeof(size_t);
> -	size_t pages = (size_t) (0.8 * MT_npages());
> -
> -
assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad
vise==MMAP_WILLNEED||advise==MMAP_DONTNEED);
> -
> -	/* VAR heaps (inherent random access) */
> -	if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) {
> -		if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size >
MT_MMAP_TILE) {
> -			MT_mmap_inform(b->H->vheap->base, b->H->vheap->size,
preload, MMAP_WILLNEED, 0);
> -		}
> -		if (preload > 0 && pages > 0) {
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
H->vheap\n", BATgetId(b), advise);
> -			limit = (size_t *) (b->H->vheap->base +
b->H->vheap->free) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)b->H->vheap->base +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> +/* modern linux tends to use 128K readaround  = 64K readahead
> + * changes have been going on in 2009, towards true readahead
> + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c 
> + * 
> + * Peter Feb2010: I tried to do prefetches further apart, to trigger
multiple readahead
> + *                units in parallel, but it does improve performance
visibly 
> + */
> +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz,
int touch, int preload, int advise) {
> +	size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 =
0, page = MT_pagesize();
> +	int t = GDKms();
> +	if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) {
> +		MT_mmap_inform(h->base, h->size, preload, advise, 0);
> +		if (preload > 0) {
> +			void* alignedbase = (void*) (((size_t) base) &
~(page-1));
> +			size_t alignedsz = (sz + (page-1)) & ~(page-1);
> +                	int ret = posix_madvise(alignedbase, sz, advise);
> +        		if (ret) THRprintf(GDKerr, "#MT_mmap_inform:
posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", 
> +					h->filename, PTRFMTCAST alignedbase,
alignedsz >> 20, advise, errno);
>  		}
>  	}
> -	if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) {
> -		if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size >
MT_MMAP_TILE) {
> -			MT_mmap_inform(b->T->vheap->base, b->T->vheap->size,
preload, MMAP_WILLNEED, 0);
> -		}
> -		if (preload > 0 && pages > 0) {	
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
T->vheap\n", BATgetId(b), advise);
> -			limit = (size_t *) (b->T->vheap->base +
b->T->vheap->free - sizeof(size_t)) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)b->T->vheap->base +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit  && pages > 3; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> +	if (touch && preload > 0) {
> +		/* we need to ensure alignment, here, as b might be a view
and heap.base of views are not necessarily aligned */
> +		size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) -
1) & (~(sizeof(size_t) - 1)));
> +		size_t *hi = (size_t *) (base + sz);
> +		for (hi -= 8*page; lo <= hi; lo += 8*page) {
> +			/* try to trigger loading of multiple pages without
blocking */
> +			v0 += lo[0*page]; v1 += lo[1*page]; v2 +=
lo[2*page]; v3 += lo[3*page];
> +			v4 += lo[4*page]; v5 += lo[5*page]; v6 +=
lo[6*page]; v7 += lo[7*page];
>  		}
> +		for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo;
>  	}
> +	IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s)
= %dms \n", id, hp, preload, (int) (sz>>20), 
> +
(advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK
NOWN", GDKms()-t);
> +	return v0+v1+v2+v3+v4+v5+v6+v7;
> +}
>  
> -	/* BUN heaps (no need to preload for sequential access) */
> -	if ( what&USE_HEAD && b->H->heap.base ) {
> -		if (b->H->heap.storage != STORE_MEM && b->H->heap.size >
MT_MMAP_TILE) {
> -			MT_mmap_inform(b->H->heap.base, b->H->heap.size,
preload, advise, 0);
> -		}
> -		if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) {
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
H->heap\n", BATgetId(b), advise);
> -			limit = (size_t *) (Hloc(b, BUNlast(b)) -
sizeof(size_t)) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> -		}
> -	}
> -	if ( what&USE_TAIL && b->T->heap.base ) {
> -		if (b->T->heap.storage != STORE_MEM && b->T->heap.size >
MT_MMAP_TILE) {
> -			MT_mmap_inform(b->T->heap.base, b->T->heap.size,
preload, advise, 0);
> +size_t 
> +BATaccess(BAT *b, int what, int advise, int preload) {
> +	ssize_t budget = (ssize_t) (0.8 * MT_npages());
> +	size_t v = 0, sz;
> +	str id = BATgetId(b);
> +	BATiter bi = bat_iterator(b);
> +
> +
assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad
vise==MMAP_WILLNEED||advise==MMAP_DONTNEED);
> +	if (BATcount(b) == 0) return 0;
> +
> +	/* HASH indices (inherent random access). handle first as they
*will* be access randomly (one can always hope for locality on the other
heaps) */
> +	if ( what&USE_HHASH || what&USE_THASH ) {
> +		gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK),
"BATaccess");
> +		if ( what&USE_HHASH && b->H->hash && b->H->hash->heap &&
b->H->hash->heap->base) {
> +			budget -= sz = (b->H->hash->heap->free > (size_t)
budget)?budget:(ssize_t)b->T->hash->heap->free;
> +			v += access_heap(id, "hhash", b->H->hash->heap,
b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED);
>  		}
> -		if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) {
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
T->heap\n", BATgetId(b), advise);
> -			limit = (size_t *) (Tloc(b, BUNlast(b)) -
sizeof(size_t)) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit && pages > 3; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> +		if ( what&USE_THASH && b->T->hash && b->T->hash->heap &&
b->T->hash->heap->base) {
> +			budget -= sz = (b->T->hash->heap->free > (size_t)
budget)?budget:(ssize_t)b->T->hash->heap->free;
> +			v += access_heap(id, "thash", b->T->hash->heap,
b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED);
>  		}
> +		gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) &
BBP_BATMASK), "BATaccess");
>  	}
>  
> -	/* HASH indices (inherent random access) */
> -	if ( what&USE_HHASH || what&USE_THASH )
> -		gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK),
"BATaccess");
> -	if ( what&USE_HHASH && b->H->hash && b->H->hash->heap &&
b->H->hash->heap->base ) {
> -		if (b->H->hash->heap->storage != STORE_MEM &&
b->H->hash->heap->size > MT_MMAP_TILE) {
> -			MT_mmap_inform(b->H->hash->heap->base,
b->H->hash->heap->size, preload, MMAP_WILLNEED, 0);
> +	/* we only touch stuff that is going to be read randomly (WILLNEED).
Note varheaps are sequential wrt to the references, or small */
> +	if ( what&USE_HEAD) {
> +		if (b->H->heap.base) {
> +			char *lo = BUNhloc(bi, BUNfirst(b)), *hi =
BUNhloc(bi, BUNlast(b)-1);
> +			budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
> +			v += access_heap(id, "hbuns", &b->H->heap, lo, sz,
(advise == BUF_WILLNEED), preload, advise);
>  		}
> -		if (preload > 0 && pages > 0) {
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
H->hash\n", BATgetId(b), advise);
> -			limit = (size_t *) (b->H->hash->heap->base +
b->H->hash->heap->size - sizeof(size_t)) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)b->H->hash->heap->base +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> +		if (b->H->vheap && b->H->vheap->base) {
> +			char *lo = BUNhead(bi, BUNfirst(b)), *hi =
BUNhead(bi, BUNlast(b)-1);
> +			budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
> +			v += access_heap(id, "hheap", b->H->vheap, lo, sz,
(advise == BUF_WILLNEED), preload, advise);
>  		}
>  	}
> -	if ( what&USE_THASH && b->T->hash && b->T->hash->heap &&
b->T->hash->heap->base ) {
> -		if (b->T->hash->heap->storage != STORE_MEM &&
b->T->hash->heap->size > MT_MMAP_TILE) {
> -			MT_mmap_inform(b->T->hash->heap->base,
b->T->hash->heap->size, preload, MMAP_WILLNEED, 0);
> +	if ( what&USE_TAIL) {
> +		if (b->T->heap.base) {
> +			char *lo = BUNtloc(bi, BUNfirst(b)), *hi =
BUNtloc(bi, BUNlast(b)-1);
> +			budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
> +			v += access_heap(id, "tbuns", &b->T->heap, lo, sz,
(advise == BUF_WILLNEED), preload, advise);
>  		}
> -		if (preload > 0 && pages > 0) {
> -			IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
T->hash\n", BATgetId(b), advise);
> -			limit = (size_t *) (b->T->hash->heap->base +
b->T->hash->heap->size - sizeof(size_t)) - 4 * step;
> -			/* we need to ensure alignment, here, as b might be
a view and heap.base of views are not necessarily aligned */
> -			i = (size_t *) (((size_t)b->T->hash->heap->base +
sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
> -			for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
4) {
> -				v1 += *i;
> -				v2 += *(i + step);
> -				v3 += *(i + 2*step);
> -				v4 += *(i + 3*step);
> -			}
> -			limit += 4 * step;
> -			for (; i <= limit  && pages > 0; i+= step, pages--)
{
> -				v1 += *i;
> -			}
> +		if (b->T->vheap && b->T->vheap->base) {
> +			char *lo = BUNtail(bi, BUNfirst(b)), *hi =
BUNtail(bi, BUNlast(b)-1);
> +			budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
> +			v += access_heap(id, "theap", b->T->vheap, lo, sz,
(advise == BUF_WILLNEED), preload, advise);
>  		}
>  	}
> -	if ( what&USE_HHASH || what&USE_THASH )
> -		gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) &
BBP_BATMASK), "BATaccess");
> -
> -	return v1 + v2 + v3 + v4;
> +	return v;
>  }
>  @}
>  
> 
> Index: gdk_posix.mx
> ===================================================================
> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v
> retrieving revision 1.176.2.21
> retrieving revision 1.176.2.22
> diff -u -d -r1.176.2.21 -r1.176.2.22
> --- gdk_posix.mx	18 Feb 2010 01:03:55 -0000	1.176.2.21
> +++ gdk_posix.mx	18 Feb 2010 22:38:53 -0000	1.176.2.22
> @@ -909,10 +909,8 @@
>  		unload = MT_mmap_tab[i].usecnt == 0;
>  	}
>  	(void) pthread_mutex_unlock(&MT_mmap_lock);
> -	if (i >= 0 && preload > 0) 
> -		ret = posix_madvise(base, len, advise);
> -	else if (unload)
> -		ret = posix_madvise(base, len, MMAP_NORMAL);
> +	if (unload)
> +		ret = posix_madvise(base, len, BUF_SEQUENTIAL);
>  	if (ret) {
>  		stream_printf(GDKerr, "#MT_mmap_inform:
posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) =
%d\n",
>  			      (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ?
MT_mmap_tab[i].fd : -1),
> 
> 
>
----------------------------------------------------------------------------
--
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Monetdb-checkins mailing list
> Monetdb-checkins at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/monetdb-checkins
> 
> 

-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold at cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4199       |





More information about the developers-list mailing list