Huge spill files from the query that is running insert.
The parameter gp_workfile_* even doesn't succeed in stopping the query from running out of space and it also seems like these parameter has no effect on the query.
xxxx=# show gp_workfile_limit_files_per_query; gp_workfile_limit_files_per_query ----------------------------------- 100000 (1 row) Time: 0.810 ms xxxx=# show gp_workfile_limit_per_query xxxx-# ; gp_workfile_limit_per_query ----------------------------- 10GB (1 row) Time: 0.472 ms xxxx=# show gp_workfile_limit_per_segment; gp_workfile_limit_per_segment ------------------------------- 10GB (1 row) Time: 0.516 ms xxxx=# INSERT INTO tmp_verd_s_prik_0003 select t.ag , t.vsnr , t.aend_guelt_ab , a.* dssprod-# , case when (upper(trim(substr(a.aend_sachb,1,2)))) not in ('AV', 'BV', 'CV', 'DV', 'XV') then 'MASCH' when (upper(trim(a.aend_sachb))) like 'XV88%' then 'AVO' xxxx-# else upper(t193.orga_symbol) end orga_symb xxxx-# FROM xxxx a xxxx-# INNER JOIN xxxxx t xxxx-# ON (trunc(a.ver_id/100000)*100000) = t.pol_nr xxxx-# LEFT OUTER JOIN xxxx t193 xxxx-# ON upper(trim(t193.benutzer_nummer)) = case when (upper(trim(substr(a.aend_sachb,1,2)))) in ('AV', 'BV', 'CV', 'DV', 'XV') then 'XV'||trim(substr(a.aend_sachb,3,5)) xxxx-# else (upper(trim(a.aend_sachb))) end xxxx-# and t193.guelt_ab <= date(a.aend_beg_dat) xxxx-# and date(a.aend_beg_dat) < t193.guelt_bis xxxx-# WHERE mod(trunc(a.ver_id/10000000),1)= 0; ERROR: could not write to temporary file: No space left on device (seg192 sdw25:1025 pid=405640)
The size of the spill files overall was above the parameter gp_workfile_limit_per_query.
xx$ ssh sdw27 du -sh /data2/primary/gpseg14/base/3174952/pgsql_tmp/
xx$ ssh sdw27 "ls -R /data2/primary/gpseg14/base/3174952/pgsql_tmp/ | wc -l"
This is due to how the query operator (such as hash join / sort join) and the gp_workfile_limit_per_segment works. When you have a query with large gp_workfile_limit_files_per_query, which in our case is 100000 , the query plan operator ( like HashJoin / SortJoin etc) operator creates 65536 files at each segment, so thats a total of 65536 * (number of segments) files per host. Most of these files are very small or empty initially
[gpadmin@sdw27 pgsql_tmp]$ find -type f -size 0 | wc -l ; find -type f -size -1M | wc -l 39195 39127
Workfile reports usage (i.e from view gp_toolkit.gp_workfile_* ) of ~ 500MB per segment, but the OS reports high disk usage usage per segment (ie. 225GB in the above issue). The difference is coming from the internal allocation that the OS does for XFS for even small files (allocsize=16mb).
/dev/sdc on /data1 type xfs (rw,nodev,noatime,inode64,allocsize=16m) /dev/sdf on /data2 type xfs (rw,nodev,noatime,inode64,allocsize=16m)
If you allocate 16MB for each file and you create 64k files you need 1TB from the OS.
64k * 16MB = 1024 GB
So the RCA is that creating many small files on XFS filesystem with allocsize=16mb allocates huge empty blocks, and the workfile accounting does not take these into account , it only counts actually bytes written to the files, not the padding.
Reduce the gp_workfile_limit_files_per_query to a lower value like 10000, HashJoin now starts off with a small number of files (it knows that if it tries to create 65k files, it will fail). So it creates only 8k files per segment. This still uses a lot of extra disk space, but it fits on the current disk. Since we have fewer files and they contain more data, the accounting of workfile works, and the query is stopped when its usage reaches gp_workfile_limit_per_query (i.e 10GB in the above example).
Do not change the allocation size of xfs to below 16m as if we lower the allocsize we would have disk fragmentation and slow scan rate , so 16m is a recommended value to avoid such situation.