[Monetdb-developers] BAT sizes

Stefan Manegold Stefan.Manegold at cwi.nl
Mon Oct 15 16:29:32 CEST 2007


[I once again felt free to share this with the community ...]

Henning,

in case BAT capacities are significantly larger than their actual content
(count), this might indeed have a negative influence on performance.
(1) in case the BAT is memory-mapped when loaded, it "only" blocks some more
address space than strictly necessary (no problem on 64-bit systems,
potentially a problem on 32-bit systems);
(2) in case the BAT is malloced when loaded, it also occupied some more
memory than strictly necessary (potential problem on but 64- & 32-bit
systems).

However, unless there is some accurate estimation, it is often hard (or
virually impossible) to "guess" a BATs size before filling it; hence, a
"generous" initial size allocation is good to avoid expensive BAT extents.

In your case, I'm lost concerning which BATs your taling about. The
shredder-generated pre_* (actually rid_*) BATs need to be allocated before
reading the document; hence, there is know knowledge about the number of
nodes in the document, and as far as I can tell no trivial way to estimate
this accurately. Hence, the shredder needs to guess something --- JanF can
tell more, I guess...

In case of the TIJAH indices, I have no clue at all, how/where/when they are
built and whether there might be better information available to not
overallocate but allocate only just enough space. You or some of your
colleagues in Twente should know all the details.

Finally, is there any concrete case where you actually experiences any
problems due to "over-allocation", or are you just wondering?

Stefan

ps: in the case given below, the batsize just fits the BAT's capacity; only
    the count is smaller than the capacity (obviously, it cannot be larger)
    --- if you want/need to know why, you better ask him/her who
    allocated/created/filled the "tj_DFLT_FT_INDEX_size1" BAT ...


On Mon, Oct 15, 2007 at 02:36:46PM +0200, Henning Rode wrote:
> hej stefan,
> 
> sorry, that i did not answer earlier. i justed wanted to report the
> actual sizes of pf/tijah indices in a paper. so that is done now.
> 
> still, i was asking myself, whether it might have any kind of
> performance influences, that BAT capacities are so much higher than the
> actual BAT counts. This is of course handy, when we still want to add
> new entries, but once we indexed a collection, we usually only query it.
> 
> in case of our "pre_size" BAT this difference between BATsize and
> BATdsksize can easily be 250MB or more.
> 
> best -henning
> 
> mil>var t := bat("tj_DFLT_FT_INDEX_size1");
> mil>t.count().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |203091470
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>t.capacity().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |260898816
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>t.batsize().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |1043599360
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>t.batdsksize().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |812366848
> 
> 
> 
> mil>var x := t.copy();
> mil>x.count().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |203091470
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>x.capacity().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |260898816
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>x.batsize().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |1043599360
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>x.batdsksize().print();
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  |812366848
> 
>                               |
> +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> mil>t.access(BAT_READ);
> 
> 
> Stefan Manegold wrote:
> > Henning,
> > 
> > you should also check & report b.capacity(), i.e.,
> > 
> > b.count();
> > b.capacity();
> > b.info().reverse().like("batBuns").like("size").print();
> > b.batsize();   
> > b.batdsksize();
> > 
> > var c := b.copy();
> > 
> > c.count();
> > c.capacity();
> > c.info().reverse().like("batBuns").like("size").print();
> > c.batsize();   
> > c.batdsksize();
> > 
> > Stefan
> > 
> > 
> > On Sat, Oct 06, 2007 at 07:43:28PM +0200, Stefan Manegold wrote:
> >> [felt free to cc the monetdb-developers list as more people might be
> >>  interested or want to contribute]
> >>
> >> Henning,
> >>
> >> are you just "concerned" or are you having concrete problems with the bat
> >> sizes?
> >>
> >> In cany case, to give any reasonable answer we'd need to know more about the
> >> details. In particular how large is the BAT your talking about.
> >>
> >> I.e., with "b" being your BAT and "c := b.copy()", please check & report
> >>
> >> b.count();
> >> b.info().reverse().like("batBuns").like("size").print();
> >> b.batsize();
> >> b.batdsksize();
> >>
> >> c.count();
> >> c.info().reverse().like("batBuns").like("size").print();
> >> c.batsize();
> >> c.batdsksize();
> >>
> >> Stefan
> >>
> >>
> >> On Fri, Oct 05, 2007 at 01:47:01PM +0200, Henning Rode wrote:
> >>> hej stefan,
> >>>
> >>> thanks for the answer. so in conclusion, the over-allocation of memory
> >>> is quite normal, and nothing to worry about.
> >>>
> >>> i was more surprised that the copied BAT still has this considerable
> >>> over-allocation of memory, though it exactly knows how many entries it
> >>> needs to hold.
> >>>
> >>> groeten -henning
> >> -- 
> >> | Dr. Stefan Manegold | mailto:Stefan.Manegold at cwi.nl |
> >> | CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
> >> | 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
> >> | The Netherlands     | Fax : +31 (20) 592-4312       |
> > 
> 

-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold at cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4312       |




More information about the developers-list mailing list