帳票開発の諸々 (Oracle BI Publisher Blog): 2013

2013/03/16

MySQLでの小計と合計 (MySQL: Subtotal and Grand Total)

過去の「SQLで小計と合計を取得する (SQL: Subtotal and Grand Total)」では、オラクルDBでの集計を行いました。
たまたまMySQLを使う機会があったため、今回はMySQLを使用して小計と合計を求めます。構文と結果は以下の通りです。

List 1: createData.pl

mysql> select deptno, job, sum(sal) as sum_sal
    ->   from EMP
    ->   group by deptno, job with rollup;
+--------+-----------+---------+
| deptno | job       | sum_sal |
+--------+-----------+---------+
|     10 | CLERK     |    1300 |
|     10 | MANAGER   |    2450 |
|     10 | PRESIDENT |    5000 |
|     10 | NULL      |    8750 |
|     20 | ANALYST   |    6000 |
|     20 | CLERK     |    1100 |
|     20 | MANAGER   |    2975 |
|     20 | NULL      |   10075 |
|     30 | CLERK     |     950 |
|     30 | MANAGER   |    2850 |
|     30 | SALESMAN  |    5600 |
|     30 | NULL      |    9400 |
|   NULL | NULL      |   28225 |
+--------+-----------+---------+
13 rows in set (0.00 sec)

Dev Shed Forumsではcoalesceを使用したサンプルが紹介されています。
http://forums.devshed.com/mysql-help-4/group-by-with-rollup-replace-null-value-580155.html

List 2: Workaround 1

select coalesce( mth , 'GRAND TOTAL' ) as dates,
       sum( salary ) as sum_salary
  from ( select month(date) as mth,
                salary
           from tb1 ) as dt
 group by mth with rollup

stackoverflow では、ifnullを使用したサンプルが紹介されています。この例では複数列の集計を一度に行っています。
http://stackoverflow.com/questions/6822666/how-can-i-replace-null-category-titles-in-mysql-rollup-function

List 3: Workaround 2

SELECT IFNULL(name, 'MainTotal') AS name,
       IFNULL(location, 'SubTotal') AS location,
       SUM(IF(division='OEM',totalHours,NULL)) AS OEM,
       SUM(IF(division='A/M',totalHours,NULL)) AS AM,
       SUM(IF(division='SKF',totalHours,NULL)) AS SKF,
       SUM(IF(division='RE',totalHours,NULL)) AS RE,
       location as location
  from $databasetable
 group by location, name with rollup

なお、明記はされていませんが、これらの回避策は集計キーとなる列にNull値が含まれていない事を前提としていますので注意が必要です。上記のサンプルを使用する際には、「元データにNULLが含まれていない」という前提を確認して下さい。

なお、MySQLには、集計列を判別する為の関数（Oracleのgrouping関数に相当）が提供されていないようです。

参考資料：マニュアル（日本語: MySQL 5.1）
http://dev.mysql.com/doc/refman/5.1/ja/group-by-modifiers.html

[Summary]
Not only Oracle, but MySQL also provides rollup function. List 1, List 2 and List 3 shows the basic usage.
You may need to determine the summarized row, like Oracle "grouping" function. But MySQL does not provide that kind of function.

References:

Manual (English: MySQL 5.6)
http://dev.mysql.com/doc/refman/5.6/en/group-by-modifiers.html

Workaround (Dev Shed Forums)
http://forums.devshed.com/mysql-help-4/group-by-with-rollup-replace-null-value-580155.html

Workaround (stackoverflow)
http://stackoverflow.com/questions/6822666/how-can-i-replace-null-category-titles-in-mysql-rollup-function

2013/01/28

BI PublisherからMySQLに接続する (Connect to MySQL)

WebLogicにはあらかじめMySQL用のJDBCドライバが組み込まれています。

List 1: MySQL JDBC Driver

[oracle@server1 ~ ]$ find $WL_HOME/server/lib/ -name mysql*.jar
/opt/oracle/middleware/wlserver_10.3/server/lib/mysql-connector-java-commercial-5.1.14-bin.jar

パスも通されている為、別段の準備の必要もなく使用できます。これを利用して、Oracle BI PublisherからMySQLへの接続を確認します。

設定は以下の通りです。

Driver Class: com.mysql.Driver
Connection String: jdbc:mysql//{server_name}:{port}/{db_name}

Figure 1: Connection succeeded

問題なく接続できます。

[Summary]
MySQL JDBC driver is registered with WebLogic library by default, and is ready to connect always.
List 1 shows the connection settings.

2013/01/07

AWKによる集計性能その２ (AWK performance, Part 2)

前回に続き、AWKの性能検証を続けます。
今回はレコード（行）のサイズを256バイトに減らして計測を行います。

【検証環境】
前回と同様です。

【事前準備】
データ生成のソースは以下の通りです。
前回からの変更は、固定文字 "x" の数を減らしてサイズを調整する点のみです。

List 4: createData.pl

#!/usr/bin/perl

use strict;
use warnings;

foreach my $i ( 1 .. 1000000 ){
    print sprintf( "%010d,",     $i);                   # row number
    print "x" x 211 . ",";                              # fixed text (dummy)
    print sprintf( "id%02d,",  int( rand(100) ));       # Key-1: eg) country id
    print sprintf( "id%05d,",  int( rand(1000) ));      # Key-2: eg) branch id
    print sprintf( "id%010d,", int( rand(100000000) )); # Key-3: eg) customer id
    print sprintf( "%7d\n",    int( rand(1000000) ));   # value
}

生成結果は、1000万行で約2.4GBとなります。今回は4000万行、および1億行まで増やして検証します。

【検証結果】
各サイズのファイルに対し、検証を5回行った平均は以下の通りです。

Table 2: Result

File size	Records (rows)	Elapsed time (average, mm:ss)
2.4GB	10,000,000	1:25
9.6GB	40,000,000	5:39
24GB	100,000,000	14:22

Figure 2: Result (256 bytes record)

1億行まで、単位時間当たりの性能劣化はほとんどありません。

【多重実行】
続いて、多重実行の検証を行います。スクリプト例は以下の通りです。

List 5: parallel.sh

#!/usr/bin/bash
cat sampledata_a.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.a &
cat sampledata_b.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.b &
cat sampledata_c.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.c &
cat sampledata_d.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.d &
cat sampledata_e.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.e &
cat sampledata_f.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.f &
cat sampledata_g.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.g &
cat sampledata_h.txt | awk -F "," '{ s[$3 $4] += $6; c[$3 $4] += 1} END { for( x in s) print x,s[x],c[x] }' | sort > result.sub.h &

wait

cat result.sub.[a-h] > result.txt

データは先ほどの256バイト長の2.4GBのテキストファイル（1ファイルあたり1千万行）を使用します。実行結果は以下の通りです。

Table 3: Result

Parallelism	Records (rows)	Elapsed time (average, mm:ss)
4	10,000,000 * 4 = 40 Million	1:27
8	10,000,000 * 8 = 80 Million	1:30
12	10,000,000 * 12 = 120 Million	1:36
16	10,000,000 * 16 = 160 Million	2:08

Figure 3: Result (Parallel Processing)

マシンの物理コア数（12コア）まではほぼリニアに性能が向上しています。概ね良好な結果が得られました。
16多重では処理時間が大幅に増加していますが、単位時間当たりの処理性能（処理行数）はほぼ変わっていません。

1億2千万件（12多重）の集計に1分30秒程度という処理性能は非常に魅力的であると言えます。

[Summary]
Again, AWK performance test. The record size is reduced to 256 bytes.

[Hardware]
Same as the previous post.

[Source code]
Please see List 4.

[Result]
Table 2 and Figure 2 show the number of target records and the average of process time.

[Parallel Processing]
List 5, the script performs parallel processing. Its results are shown in Table 3 and Figure 3.
120 million records are summarized in one and half minute (1:36). Quite nice.