Bug 6763 - COPY INTO table concurrently,but the tasks of second group take a particularly long time or is blocked sometimes.
Summary: COPY INTO table concurrently,but the tasks of second group take a particularl...
Status: NEW
Alias: None
Product: SQL
Classification: Unclassified
Component: all (show other bugs)
Version: -- development
Hardware: x86_64 (amd64/em64t) Linux
: High critical
Assignee: SQL devs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-20 13:03 CEST by xavier
Modified: 2019-09-26 11:38 CEST (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description xavier 2019-09-20 13:03:27 CEST
User-Agent:       Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36
Build Identifier: 

COPY INTO table concurrently,but the tasks of second group take a particularly long time or is blocked sometimes.

the table:
create table tb1(
    linenum1 bigint,
    linenum2 bigint,
    linenum3 bigint,
    linenum4 bigint,
    linename1 varchar(128) null,
    time1 timestamp,
    time2 timestamp,
    linenum5 bigint,
    linename2 varchar(256),
    linename3 varchar(256),
    linename4 varchar(256),
    linename5 varchar(24) null,
    linenum6 bigint,
    linenum7 bigint,
    linenum8 bigint,
    linename6 varchar(256),
    linename7 varchar(256),
    linename8 varchar(256),
    linename9 varchar(64) null,
    linename10 varchar(64) null,
    linename11 varchar(64) null,
    linename12 varchar(64) null,
    linename13 varchar(64) null,
    linename14 varchar(64) null,
    linename15 text,
    linename16 varchar(64) null,
    linename17 varchar(64) null,
    mulrouflag boolean,
    intflag boolean,
    linenum9 bigint,
    linenum10 bigint,
    linenum11 bigint,
    linename18 varchar(256),
    linename19 varchar(256),
    linename20 varchar(256),
    linename21 varchar(4) null,
    linenum12 bigint,
    linenum13 bigint,
    linenum14 bigint,
    linename22 varchar(32) null,
    linename23 varchar(32) null,
    time3 timestamp,
    linenum15 bigint,
    linename24 char(2),
    linename25 char(2),
    linename26 varchar(256) null,
    linenum16 bigint not null
);
environment:
  DELL R730:Intel E5-2630 V4@2.2GHz * 2,memory 64G.

Data files: The amount of data per CSV file is about 10 million, 8.6G in size.

Single tasks: Import 10 CSV files into one database table one by one, and perform 10 tasks and import them into 10 tables concurrently. The script reads as follows:

[xavier@server1 bin]$ cat 1.sh 
#!/bin/bash
tablename=$1
dataname=$2
for k in {1..10}
do
	time ./mclient -d testdb -s "copy 10000000 records  into $tablename from '/data/$dataname' using delimiters ',';"
done

[xavier@server1 bin]$ cat 2.sh 
#!/bin/bash
for i in {1..10}
do 
	./1.sh gre$i data_$i &
done


Reproducible: Always

Steps to Reproduce:
1.execute script 2.sh
Actual Results:  
I use the HDD.
It takes 20 minutes to 30 minutes to import the first CSV file concurrently.

When 10 concurrent tasks start importing the second CSV file, a single CSV file task takes more than 700 minutes.Sometime,it got blocked.

At this point, a large number of read operations were found on the disk where the database was located, and the imported data disk did not have any read or write operations. The CPU is now in a wait state.


But when I used the SSD, there was no problem with the same tasks: 10 concurrent persistent import tasks. The first CSV file data import time is from 20mins to 30mins, after which the import time is basically stable from 30mins to 40mins. System resource monitoring: The memory utilization rate is 90%, and no abnormality is found in the CPU and disk.
But when I use the SSD, I do the same thing: concurrently importing 10 tasks, this is normal. The first CSV file data import time is from 20mins to 30mins, after which the import time is basically stable from 30mins to 40mins. System resource monitoring: The memory utilization rate is 90%, and no abnormality is found in the CPU and disk.
Comment 1 Sjoerd Mullender cwiconfidential 2019-09-24 15:03:00 CEST
For various reasons it's not a very good idea to do multiple COPY INTO queries concurrently.
Each COPY INTO query uses multiple threads (up to the number of cores in the system) and they do not take into account as to what might be running in parallel.
Each COPY INTO necessarily works on all columns of the table in parallel (that's how the data is provided), so the complete table is basically in memory.  If there are multiple large table being worked on in parallel it puts a lot of presure on the memory subsystem.

In addition to this, if you COPY INTO the same table sequentially, the columns likely have to be resized.  A resize operation can be expensive in that it may result in having to copy the data (and then you temporarily have two copies in memory).

My advise: do it sequentially and for the first COPY INTO provide a count of the total number of rows (not just the number of rows in the first file) so that the columns can be preallocated.

If you make sure that all COPY INTO queries are the only queries running, you can also use the LOCKED keyword which may result in less copying of the data.
Comment 2 xavier 2019-09-26 11:38:01 CEST
The number of rows of data in each file is the same.I just divided a large file into 10 copies.The 10 data is imported into a table in order.I performed the operation of importing to ten tables at the same time.


When I use the HDD, all the 10 import tasks that are concurrently go to the second file import time is particularly long.However, when I switched to using SSD, there was no such problem.

Could it be said that resizing columns consumes a lot of time?