[Monetdb-developers] MonetDB: default - The sample module for the monetdb5 server.

Roberto Cornacchia roberto.cornacchia at gmail.com
Tue Sep 27 12:10:55 CEST 2011


Yes, I agree about the difference in efficiency.
My considerations were only about usability and in particular from the point
of view of *generated* SQL.

The syntax I suggested works well in MonetDB, so yes, both can be useful,
with different focus.
They actually achieve the same result, in two different ways.
The SAMPLE clause is a special top-k, that draws k tuples in random order
The syntax I'm using is a special ranking. It ranks all tuples randomly and
then takes the first k.

To allow the SAMPLE syntax in subqueries would also imply to allow LIMIT in
subqueries, because they really boil down to the same semantics. Although I
do think SQL should allow this, I do not think that MonetDB/SQL should
deviate from the standard.

Roberto

On Mon, Sep 26, 2011 at 15:30, Lefteris <lsidir at gmail.com> wrote:

> Roberto,
>
> we were also puzzled with which syntax to adopt. The first issue is that
> the syntax is also dictating in a way implementation (although that sounds
> not correct:)). A syntax that has a order by rand() implies that you are
> attaching random numbers to the rows and then sort on that. This sorting is
> very coslty and it is the worst case since you have random order to begin
> with. Something like that would cost NlogN where N is the size of rows.
> Moreover, that syntax implies that you are sampling the base table (or some
> intermediate result anyway) and then "continue" the query evaluation.
> however sampling is not a very kind operator and cannot by easily pushed
> down: a sample of a join is not equal with the join of 2 samples. These are
> in my view the implications (or benefits!) of such a syntax. So MonetDB
> could as well support the syntax you propose and should!
>
> On the other hand, the syntax SAMPLE we just introduce, is applied solely
> on the final result set of the query, after or query evaluation is done. The
> algorithm has a complexity of S, where S is the size of the sample (as
> opposed to NlogN of the previous). Also , the sample operator has the same
> semantics of limits, it is an operator that shows a limited (random) portion
> of the entire (usually computed) result. that means that all joins,
> predicated etc. have been evaluated before a sample is taken. Of course we
> would love to have that functionality in sub-queries too, hence the
> workaround. What is really missing now is a function where the table
> definition is a parameter, indeed!
>
> With this 2 views in mind, I would suggest that both syntax are necessary
> to exist! Each of them serving a different purpose (and different underline
> implementation eventually).
>
> the other solution would be to allow SAMPLE is subqueries and "depart" from
> the SQL spirit.
>
> What do you think?
>
> lefteris
>
> On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia <
> roberto.cornacchia at gmail.com> wrote:
>
>> Lefteris, I would like to add something about the syntax (and semantics)
>> of this feature.
>> See below
>>
>>
>>> +/*
>>> + * @- Uniform Sampling.
>>> + *
>>> + * A new SQL operator has been added to support sampling the result of a
>>> query.
>>> + * The syntax for sampling is:
>>> + * SELECT ... FROM ... WHERE ... SAMPLE s
>>> + *
>>> + * where s if is an integer greater than 1, it defines the number of
>>> rows to be
>>> + * in the sample. If s is a double between [0.0,1.0] the it refers to
>>> the
>>> + * percentage of the result to be sampled. That is if s=0.3 then the
>>> sample
>>> + * will be 30% the size of the query result.
>>> + *
>>>
>>
>> It has been discussed already whether this choice or an explicit PERCENT
>> is more intuitive.
>> Personally, I don't care much about that.
>>
>>
>>> + * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that it
>>> can only
>>> + * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a
>>> + * subquery.
>>
>>
>> This is much more interesting to me.
>> I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can
>> think of it as a special LIMIT.
>> What I don't like about it, is that it then shares the same limitations
>> (which are absurd, as the whole SQL often is, from my point of view), which
>> is what you mention above about the subquery.
>> You suggest a workaround here:
>>
>> However, if this is needed, then one may define a function, for
>>> + * example
>>> + *
>>> + * CREATE FUNCTION mysample ()
>>> + * RETURNS TABLE(col a,...)
>>> + * BEGIN
>>> + *    RETURN
>>> + *      SELECT a,...
>>> + *      FROM name_table
>>> + *      SAMPLE 100;
>>> + * end;
>>> + *
>>> + * and then use function mysample() for example to populate a new table
>>> with
>>> + * the sample. E.g.,
>>> + *
>>> + * INSERT INTO sample_table (SELECT * FROM mysample());
>>> + *
>>> + *
>>
>>
>> To me, writing a function with an hard-coded table name in it is... weird!
>> Like writing a function that can only do 2+2.
>> (This could be much better if table identifiers were allowed as function
>> parameters, a feature that I'd love to see implemented!).
>>
>> To this, I much prefer a solution based on the window functions RANK()
>> OVER()  and ROW_NUMBER() OVER() which luckily are implemented in MonetDB,
>> and the function rand().
>>
>> Your example above becomes a regular query that has no nesting-issue:
>>
>> SELECT a,b,c
>> FROM (
>>   SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank
>>   FROM name_table
>> ) as x
>> where x.my_rank <= 100;
>>
>> Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.
>>
>> I'm not comparing the efficiency of the two solutions here, your solution
>> could well be more efficient.
>> I'm only comparing them from a usability point of view.
>>
>> I noticed that my own usage of the LIMIT clause is becoming less and less
>> frequent (in favour of the syntax I report here) because of the nesting
>> issue.
>> What I am wondering is, how much useful is to introduce a new syntax for
>> the sampling that has this (important, in my view) limitation? How often
>> would you need a sample in the outermost select clause only?
>>
>> Cheers,
>> Roberto
>>
>>
>> _______________________________________________
>> Checkin-list mailing list
>> Checkin-list at monetdb.org
>> http://mail.monetdb.org/mailman/listinfo/checkin-list
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20110927/066679a4/attachment.html>


More information about the developers-list mailing list