帳票開発の諸々 (Oracle BI Publisher Blog): AWKによる集計性能その２ (AWK performance, Part 2)

前回に続き、AWKの性能検証を続けます。
今回はレコード（行）のサイズを256バイトに減らして計測を行います。

【検証環境】
前回と同様です。

【事前準備】
データ生成のソースは以下の通りです。
前回からの変更は、固定文字 "x" の数を減らしてサイズを調整する点のみです。

List 4: createData.pl

#!/usr/bin/perl

use strict;
use warnings;

foreach my $i ( 1 .. 1000000 ){
    print sprintf( "%010d,",     $i);                   # row number
    print "x" x 211 . ",";                              # fixed text (dummy)
    print sprintf( "id%02d,",  int( rand(100) ));       # Key-1: eg) country id
    print sprintf( "id%05d,",  int( rand(1000) ));      # Key-2: eg) branch id
    print sprintf( "id%010d,", int( rand(100000000) )); # Key-3: eg) customer id
    print sprintf( "%7d\n",    int( rand(1000000) ));   # value
}

生成結果は、1000万行で約2.4GBとなります。今回は4000万行、および1億行まで増やして検証します。

【検証結果】
各サイズのファイルに対し、検証を5回行った平均は以下の通りです。

Table 2: Result

File size	Records (rows)	Elapsed time (average, mm:ss)
2.4GB	10,000,000	1:25
9.6GB	40,000,000	5:39
24GB	100,000,000	14:22

Figure 2: Result (256 bytes record)

1億行まで、単位時間当たりの性能劣化はほとんどありません。

【多重実行】
続いて、多重実行の検証を行います。スクリプト例は以下の通りです。

List 5: parallel.sh

#!/usr/bin/bash
cat sampledata_a.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.a &
cat sampledata_b.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.b &
cat sampledata_c.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.c &
cat sampledata_d.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.d &
cat sampledata_e.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.e &
cat sampledata_f.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.f &
cat sampledata_g.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.g &
cat sampledata_h.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.h &

wait

cat result.sub.[a-h] > result.txt

データは先ほどの256バイト長の2.4GBのテキストファイル（1ファイルあたり1千万行）を使用します。実行結果は以下の通りです。

Table 3: Result

Parallelism	Records (rows)	Elapsed time (average, mm:ss)
4	10,000,000 * 4 = 40 Million	1:27
8	10,000,000 * 8 = 80 Million	1:30
12	10,000,000 * 12 = 120 Million	1:36
16	10,000,000 * 16 = 160 Million	2:08

Figure 3: Result (Parallel Processing)

マシンの物理コア数（12コア）まではほぼリニアに性能が向上しています。概ね良好な結果が得られました。
16多重では処理時間が大幅に増加していますが、単位時間当たりの処理性能（処理行数）はほぼ変わっていません。

1億2千万件（12多重）の集計に1分30秒程度という処理性能は非常に魅力的であると言えます。

[Summary]
Again, AWK performance test. The record size is reduced to 256 bytes.

[Hardware]
Same as the previous post.

[Source code]
Please see List 4.

[Result]
Table 2 and Figure 2 show the number of target records and the average of process time.

[Parallel Processing]
List 5, the script performs parallel processing. Its results are shown in Table 3 and Figure 3.
120 million records are summarized in one and half minute (1:36). Quite nice.

帳票開発の諸々 (Oracle BI Publisher Blog)

2013/01/07

AWKによる集計性能その２ (AWK performance, Part 2)

0 件のコメント:

コメントを投稿

2013/01/07

AWKによる集計性能 その２ (AWK performance, Part 2)

0 件のコメント:

コメントを投稿

AWKによる集計性能その２ (AWK performance, Part 2)