On Wed, Mar 18, 2009 at 11:51 AM, Martin Kersten <Martin.Kersten@cwi.nl> wrote:

Yue Sheng wrote:

I'm not sure how "The (parallel) load used scratch area as well" is related to the question.

If you look at the code, you will notice that there is a two phase
loading process involving (possibly) multiple threads

Sorry if I'm a bit slow.

On Wed, Mar 18, 2009 at 11:25 AM, Martin Kersten <Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>> wrote:

Yue Sheng wrote:

Sorry, if I wasn't clear on the first question:

(1) we ramp up N for the first insert to claim sufficient space.
Sure, understand that one.

But:

The claimed space got "given back" *right after* the first
insert. (this is the part I don't understand.)

The (parallel) load used scratch area as well

Question: how does the second, third, .... inserts get the
"benefit" of the ramp up that we did for the first insert?

Is this a bit clearer what my question pertains?

Thanks.

On Wed, Mar 18, 2009 at 10:26 AM, Martin Kersten
<Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>
<mailto:Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>>>
wrote:

Yue Sheng wrote:

Three questions that bothers me are:
(1) why we need to ramp up N to total of all line in
first insert.

to let the kernel claim sufficient space

Reason I ask is that right after first insert, the allocation
drop right down from, say 100GB to 35GB, and stays
roughly there
for *all* subsequent inserts. I totally do not understand
this.
(2) in your opinion, based on this experience, what could
be the
potential problem here?

little to none, as the files are memory mapped, which only
may cause
io on some systems

(3) in your opinion, would the newer version cure the
problem?

a system can never correctly guess what will come,
especially since the source of a COPY command need not be a file
but standard input, i.e. a stream.

Thanks.

On Tue, Mar 17, 2009 at 10:51 PM, Martin Kersten
<Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>
<mailto:Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>>
<mailto:Martin.Kersten@cwi.nl
<mailto:Martin.Kersten@cwi.nl> <mailto:Martin.Kersten@cwi.nl
<mailto:Martin.Kersten@cwi.nl>>>>

wrote:

Yue Sheng wrote:

Martin,

It almost worked...

This is what I did and what have happened:

I have 322 files to insert into data base,
totaling 650
million rows

I divided the file list into two, then for each
sub list

(a) I insert first file in the list with N set to
650milllion
rows, (b) all subsequent files have N set to the
number
of lines
in *that* file

once first list done, then

(c) I insert first file in the second list with N
set to
650million rows,
(d) all subsequent files have N set to the number
of lines in
*that* file

Then the same problem happened: it stucked at file
number
316.

ok. using the 650M enables MonetDB to allocate enough
space
and does
not have to fall back on guessing. Guessing is
painful, because
when a file of N records has been created and it needs
more,
it makes
a file of size 1.3xN. This leads to memory fragmentation.

in your case i would have been a little mode spacious
and used
700M as a start, because miscalculation of 1 gives a
lot of pain.
Such advice is only needed in (a)

Note: This is farther then previous tries, which all
stopped in
the region of file number 280 +/- a few.

My observation:
(i) at (a), the VSIZE went up to around 46GB, then
after
first
insert, it drops to around 36GB

ok fits

(ii) at (c), the VSIZE went up to around 130GB, then
after first
insert, it drops to around 45GB

you tell the system to extend existing BATs prepare for
another 650 M,
which means it allocates 2*36 G, plus room for the old one
gives 108GB
then during processings some temporary BATs may be
needed,e.g. to check
integrity constraints after each file,.
Then it runs out of swapspace.

(iii) the "Free Memory", as reported by Activity
Monitor,
just
before it failed at file number 316, dipped to as
low as
7.5MB!

yes, you are running out of swapspace on your system.
This should not have happened, because the system uses
mmapped files
and may be an issue with the MacOS or relate to a
problem we
fixed
recently

My question:
(1) why we need to ramp N up to total number of
lines (it
takes
along time to do that), then only have it drop down to
30GB-40GB
right after

this might indicate that on MacOS, just like Windows,
mmaped
files
need to be written to disk. With a disk bandwidth of
50MB/sec it
still takes several minutes

the first insert and stay roughly there? Does it
mean we're
giving back all the pre-allocation space back to
the OS? Then
should we set N always to total number of lines?
If so,
it would
take much much longer to process all the files...
(2) How come RSIZE never goes above 4GB? (3) Does
sql log
file
size have some limit, that we need to tweak?

no limit

(4) Has anyone successfully implemented the 64bit
version of
MondeDB and successfully inserted more than
1billion rows?

you platform may be the first, but Amherst has worked with
Macs for
years

(5) when you say you "...The VSIZE of 44G is not too
problematic, i am looking at queries letting it tumble
between
20-80 GB....," What does it mean? Mine went up to
as high as
135GB...

explained above.

regards, Martin

Thanks, as always.