Optimasi Data Analytic/OLAP Menggunakan Tabel Agregasi pada Pentaho Schema Workbench Part II


Seperti Starwars, pembahasan optimasi data analytic/OLAP menggunakan agregasi ini dimulai dari tengah. InsyaAllah kalau ada waktu akan ditulis bagaimana membuat query hasil agregasi menggunakan aggregation designer (part I nya). Untuk tulisan kali ini, saya copy paste dari blog http://herwinrayen.wordpress.com/.

Selamat membaca :)

Introduction

Aggregation is one way of speeding up the Mondrian report performance by improving the performance of the query. Measures are summed up for selected dimensions from the original base fact table. An aggregate table coexists with the base fact table, and contains pre-aggregated measures build from the fact table. It is registered in Mondrian’s schema, so that Mondrian can choose to use whether to use the aggregate table rather than the fact table.

For example,for this post I am using one of my design (Mondrian.PIFCurrent.xm) analyzer file for implementing aggregation cube. For this cube I have created aggregate table for NAIC dimension and if we pull NAIC dimension with Policy Count (Measure), then the Mondrian uses NAIC aggregate table which holds fewer rows instead of PIF Current Analyzer (FactPIFCurrent) fact table which hold millions of rows. This will improve the performance of the Analyzer report.

 Aggregation Techniques

Mondrian supports two aggregation techniques:

  • Lost dimension
  • Collapsed dimension

Lost Dimension:

The dimensions that are not used in Aggregate table but used in fact table are called as Lost Dimension

For Example:

Consider fact table “FactPIFCurrent” with column NIAC,fk_agencyagent_id,PolicyCount and we created aggregate table in name “agg_NAIC_FactPIFCurrent” with columns NAIC and PolicyCount only. Here, fk_agencyagent_id is lost, so it is referred as lost dimension.

Collapsed Dimension:

In “collapsed” dimension, the dimension key in the aggregate table is replaced with a set of dimension levels; the dimension key column is replaced with a set of columns; a fully renormalized summary table for that dimension.

For Example:

Consider fact table “FactPIFCurrent” with column NIAC, fk_agencyagent_id, PolicyCount and we created aggregate table in name “agg_Agent_FactPIFCurrent” with columns NAIC, Head, Branch, Division, Agent,PolicyCount. Here, fk_agencyagent_id is replaced with Head, Branch, Division, Agent column from dimAgencyAgent table, so it is referred as collapsed dimension.

Defining Aggregation Table

A fact table can have zero or more aggregate tables. Every aggregate table is associated with just one fact table. There are predefined rules while creating aggregate tables.

  • The aggregate table’s name must comply with the aggregate table name rule. Represented as a template regular expression the rule is:

agg_+_(fact table name)

As an example, if aggregation table is created for using NAIC dimension, then it can be named as “agg_NAIC_factPiffCurrent”.

  • One measure column in name “fact_count” should be created by default in all aggregation tables.
  • If foreign key is used in aggregate table, then it should be same name as used in fact table. If the fact table had foreign key columns “agent_id” and “time_id”, then the following aggregate table columns (for example) would match:

 time_id
agent_id
TIME_ID
AGENT_ID
time_ID
AGENT_id

  • Aggregation dimension column name should be same as cube Hierarchy name/ Level name / Level column name.  If any of these have space characters, ‘ ‘, these are mapped to underscore characters, ‘_’, and, similarly, dot characters, ‘.’, are also mapped to underscores. So, if the hierarchy_name is “Time”, level_name is “Month” and level_column_name is month_of_year, the possible aggregate table column names are:

 time_month

time_month_of_year

month_of_year

For this rule, the “hierarchy_name” and “level_name” are converted to lower case while the “level_column_name” must match exactly.

  • Aggregation measure column name should be same as fact table cube’s measure name.
  • All measures column from fact table should be included in each aggregated tables.

Defining Aggregation Cube

Mondrian uses all aggregation tables by default which coexists with the base fact table, without aggregation tables the fact table cube’s does not exist. On a per cube basis, in a schema file a user can both include and exclude aggregate tables. Following are the elements used for designing aggregation Cube.

  • AggName element which identifies the name of an aggregate table.
  • AggFactCount element holds the default fact_count column in aggregated table
  • AggExclude element used for excluding aggregate table.
  • AggIgnoreColumn element is used to specifically state to Mondrian that the given aggregate columns should be ignored.
  • AggForeignKey elements define mappings from the base fact table foreign key column names into the aggregate table column names.
  • AggMeasure elements map names defined in the cube’s schema, to the aggregate table’s column names.
  • AggLevel elements map names defined in the cube’s schema, to the aggregate table’s column names.
  • AggPattern elements match the candidate aggregate table names using a regular expression.

In a given Table element, all of the AggExclude are applied first, followed by the AggName element rules and then the AggPattern rules.

Note: With a single line in the cubes definition in the schema file, one can force Mondrian not to recognize any aggregate tables for the cube’s fact table. As an example, for the PIFCurrent cube the following excludes all aggregate tables because the regular expression pattern “.*” matches all candidate aggregate table names.

<Table name=”factPIFCurrent” >     <AggExclude pattern=”.*” /></Table>

Configuration

Need to change the following property in Mondrian.properties file.

  • mondrian.rolap.aggregates.Use=true
  • mondrian.rolap.aggregates.Read=true
  • mondrian.native.crossjoin.enable=true
  • mondrian.native.nonempty.enable=false

Aggregation Design

For this post as said in Introduction, I have used two aggregation tables in “Mondrian.PIFCurrent.xml” analyzer file for implementing aggregation cube.

  • One with NAIC/NEW RENEWAL ROLLOVER/POLICY EFF MONTH/PROPERTY TYPE/POLICY COUNT/WRITTEN PREMIUM combination and its aggregation table name is agg_Premium_FactPIFCurrent.

Table Script
CREATE TABLE [dbo].[agg_Premium_FactPIFCurrent](
[NAIC] [varchar](10)NOTNULL,
[NewRenewalRollover Name] [varchar](100)NOTNULL,
[policy eff month] [varchar](100)NOTNULL,
[Property Type Hirearchy] [varchar](1000)NOTNULL,
[Policy Count] NUMERIC(20, 0)NULL,
[Building Coverage] NUMERIC(38,2)NULL,
[Contents Coverage] NUMERIC(38,2)NULL,
[Total Premium] NUMERIC(38,2)NULL,
[Written Premium] NUMERIC(38,2)NULL,
[Policy Fee] NUMERIC(38,2)NULL,
[fact_count] [int] NOTNULL,
)ON [PRIMARY] 

Load Script:
INSERTINTO [agg_Premium_FactPIFCurrent]
(
[NAIC],
[NewRenewalRollover Name],
[policy eff month],
[Property Type Hirearchy],
[Policy Count],
[Building Coverage],
[Contents Coverage],
[Total Premium],
[Written Premium],
[Policy Fee],
[fact_count]
)

SELECT
fact.NAIC,
dother.[NewRenewalRollover Name],
dother.[Policy Eff Month Name],
dtype.[Property Type Description],
count(fact.[PolicyRowId]),
SUM(fact.[Building Coverage]),
SUM(fact.[Contents Coverage]),
SUM(fact.[Total Premium]),
SUM(fact.[Written Premium]),
SUM(fact.[Policy Fee]),
COUNT(*)AS fact_count
FROM
FACTPIFCURRENT as fact
join dimother as dother on(fact.FK_Other_ID=dother.ID)
join dimType as dtype on(fact.FK_Type_ID = dtype.ID)
GROUPBY
fact.NAIC,
dother.[NewRenewalRollover Name],
dother.[Policy Eff Month Name],
dtype.[Property Type Description]

  • The other with HEAD/BRANCH/DIVISION/AGENT/POLICY COUNT combination by referring its key value “FK_AgencyAgent_ID“ from factPIFCurrent fact table and its aggregation table name is agg_HBDAPCnt_FactPIFCurrent.

Table Script:

CREATE TABLE [dbo].[agg_HBDAPCnt_FactPIFCurrent](
[NAIC] [varchar](10)NOTNULL,
[FK_AgencyAgent_ID] [int] NOTNULL,
[Policy Count] NUMERIC(20, 0)NULL,
[Building Coverage] NUMERIC(38,2)NULL,
[Contents Coverage] NUMERIC(38,2)NULL,
[Total Premium] NUMERIC(38,2)NULL,
[Written Premium] NUMERIC(38,2)NULL,
[Policy Fee] NUMERIC(38,2)NULL,
[fact_count] [int] NOTNULL,
)ON [PRIMARY]

Load Script:

INSERTINTO agg_HBDAPCnt_FactPIFCurrent
(
[NAIC],
[FK_AGENCYAGENT_ID],
[Policy Count],
[Building Coverage],
[Contents Coverage],
[Total Premium],
[Written Premium],
[Policy Fee],
[fact_count]
)
1
SELECT
fact.NAIC,
fact.[FK_AGENCYAGENT_ID],
count(fact.[PolicyRowId]),
SUM(fact.[Building Coverage]),
SUM(fact.[Contents Coverage]),
SUM(fact.[Total Premium]),
SUM(fact.[Written Premium]),
SUM(fact.[Policy Fee]),
COUNT(*)AS fact_count
FROM
FACTPIFCURRENT as fact
groupby
fact.NAIC,
fact.[FK_AGENCYAGENT_ID]
After creating aggregate table, follow the below steps to modify “Mondrian.PIFCurrent.xml” analyzer file using SWB (workbench tool) for including aggregation table.

Step 1:
Right click on the fact table, add aggregate name and include aggregate table name “agg_Premium_FactPIFCurrent”.

Step 2:
Default “fact_count” measure column in aggregate table should be used in AggFactCount attribute in design.

Step 3:
Add aggregate measure column,

Step 4:
Similarly, add all Measure columns created in aggregate table “agg_Premium_FactPIFCurrent”.
Step 5:
Add aggregate dimension column,

Step 6:
Similarly, add all dimension columns created in aggregate table “agg_Premium_FactPIFCurrent”.
Step 7:
Follow Step 1 to Step 5 for including aggregate table “agg_HBDAPCnt_FactPIFCurrent”.

Step 8:
Publish the cube and test the Analyzer report.

Output

Following is the screen shot and SQL generated for each aggregate table.

  • NAIC/NEW RENEWAL ROLLOVER/POLICY EFF MONTH/PROPERTY TYPE/POLICY COUNT/WRITTEN PREMIUM

Screen Shot:

SQL:

Select
“agg_Premium_FactPIFCurrent”.”NAIC” as “c0″, “agg_Premium_FactPIFCurrent”.”NewRenewalRollover Name” as “c1″, “agg_Premium_FactPIFCurrent”.”policy eff month” as “c2″, “agg_Premium_FactPIFCurrent”.”Property Type Hirearchy” as “c3″, “agg_Premium_FactPIFCurrent”.”Policy Count” as “m0″, “agg_Premium_FactPIFCurrent”.”Written Premium” as “m1″
From
“agg_Premium_FactPIFCurrent” as “agg_Premium_FactPIFCurrent”
Where
“agg_Premium_FactPIFCurrent”.”NAIC” in (36560, 99999) and “agg_Premium_FactPIFCurrent”.”policy eff month” in (‘Apr 2011′, ‘Aug 2010′, ‘Dec 2010′, ‘Feb 2011′, ‘Jan 2011′, ‘Jul 2011′, ‘Jun 2011′, ‘Mar 2011′, ‘May 2011′, ‘Nov 2010′, ‘Oct 2010′, ‘Sep 2010′) and “agg_Premium_FactPIFCurrent”.”Property Type Hirearchy” in (‘Commercial’, ‘Residential’)

  • HEAD/BRANCH/DIVISION/AGENT/POLICY COUNT

Screen Shot:

SQL:
Select
“_ActiveCompany”.”NAIC” as “c0″, “_dimAgencyAgent”.”Head” as “c1″, “_dimAgencyAgent”.”Branch” as “c2″, “_dimAgencyAgent”.”Division” as “c3″, “_dimAgencyAgent”.”Agent Number” as “c4″,
“agg_HBDAPCnt_FactPIFCurrent”.”Policy Count” as “m0″
From
“refActiveCompanyList” as “_ActiveCompany”, “dimAgencyAgent” as “_dimAgencyAgent”, “agg_HBDAPCnt_FactPIFCurrent” as “agg_HBDAPCnt_FactPIFCurrent”
Where
“agg_HBDAPCnt_FactPIFCurrent”.”FK_AgencyAgent_ID” = “_dimAgencyAgent”.”ID” and “_dimAgencyAgent”.”NAIC” = “_ActiveCompany”.”NAIC” and “_ActiveCompany”.”NAIC” in (36560, 99999) and “_dimAgencyAgent”.”Head” = ’10049′ and “_dimAgencyAgent”.”Branch” = ’00000′ and “_dimAgencyAgent”.”Division” = ’000′

Miscellaneous Topics

  • Mondrian query fact table for those dimension which is not used in aggregate table.
  • We can improve the performance of the aggregation by indexing the aggregate table.
  • We can able to create 80-100 aggregate tables per cube.
About these ads
Comments
One Response to “Optimasi Data Analytic/OLAP Menggunakan Tabel Agregasi pada Pentaho Schema Workbench Part II”
  1. Gurjeet says:

    nice post …. thanks for sharing knowledge :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: