KEMBAR78
Learning Open Source Business Intelligence | PDF
Open Source BI
Deep Dive




      Ravi Samji
      21/04/2011
Agenda
Business Intelligence – Why, What & Who?
Open Source BI – Introduction, Tech Stack
OLAP Engine – Mondrian
UI Layer – JPivot
Performance & Scalability
Constraints
About Yodlee
Business Intelligence – Why?
Data is the biggest asset
   Structured and Unstructured format
Most of our assets are buried
Helps us understand customer behavior
Helps us deliver better business value
Measure performance
Business Intelligence – What?
Reporting
Analytics
Data/Text Mining
ETL
Predictive Analytics
Business Intelligence – Who?
Open Source BI – Introduction
Mondrian – OLAP Engine
   Initially Independent Open Source Initiative
   Now Part of Pentaho Open Source BI Suite
100% Pure Java
Supports MDX and XML/A
Bundled With Other Open Source Packages
Open Source BI – Tech Stack

    JFreeChart              WCF




                                  log4j
                                  log4j
                  JPivot

                 Mondrian




                  RDBMS
OLAP Engine – Mondrian
Cube Definition – schema.xml
MDX – Query language to access multi dimensional data
Operates on normalized relational database
Mondrian – schema.xml
Logical model of a multi dimensional database
Cube, VirtualCube
Dimensions, Hierarchies, Levels
Measure, CalculatedMember
Logical Model – Multi Dimensional
<Schema>
                  Database
 <Cube name="Sales">
  <Table name="sales_fact_1997"/>
  <Dimension name="Gender" foreignKey="customer_id">
   <Hierarchy hasAll="true" allMemberName="All Genders" primaryKey="customer_id">
    <Table name="customer"/>
    <Level name="Gender" column="gender" uniqueMembers="true"/>
   </Hierarchy>
  </Dimension>
  <Dimension name="Time" foreignKey="time_id">
   <Hierarchy hasAll="false" primaryKey="time_id">
    <Table name="time_by_day"/>
    <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/>
    <Level name="Quarter" column="quarter" uniqueMembers="false"/>
    <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/>
   </Hierarchy>
  </Dimension>
  <Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###"/>
  <Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##"/>
  <Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00"/>
  <CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]">
   <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00"/>
  </CalculatedMember>
 </Cube>
</Schema>
Dimensions & Shared Dimensions
 <Schema>

 <Dimension name="Time">
  <Hierarchy hasAll="false" primaryKey="time_id">
   <Table name="time_by_day"/>
   <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/>
   <Level name="Quarter" column="quarter" uniqueMembers="false"/>
   <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/>
  </Hierarchy>
 </Dimension>

  <Cube name="Sales">
   <Table name="sales_fact_1997"/>

  <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/>

   <Measure …/>
   <CalculatedMember …/>
  </Cube>

  <Cube name=“Warehouse">
   <Table name="sales_fact_1997"/>

  <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/>

  <Measure …/>
   <CalculatedMember …/>
  </Cube>

 </Schema>
Hierarchies
<Schema>

<Dimension name="Time">
 <Hierarchy hasAll="false" primaryKey="time_id">
  <Table name="time_by_day"/>
  <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/>
  <Level name="Quarter" column="quarter" uniqueMembers="false"/>
  <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/>
 </Hierarchy>

 <Hierarchy name=“Fiscal Calendar” hasAll="false" primaryKey="time_id">
  <Table name="time_by_day"/>
  <Level name="Year" column=“fiscal_year" type="Numeric" uniqueMembers="true"/>
  <Level name="Quarter" column=“fiscal_quarter" uniqueMembers="false"/>
  <Level name="Month" column=“fiscal_month_of_year" type="Numeric" uniqueMembers="false"/>
 </Hierarchy>
</Dimension>

 <Cube name="Sales">
  <Table name="sales_fact_1997"/>

  <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/>

 <Measure …/>
  <CalculatedMember …/>
 </Cube>
</Schema>
Schema.xml – Extensions
Plug-in classes
In-line tables
Views
User defined functions
Extensions – Plug-in Classes
Member Reader
Member Formatter
Cell Reader
Cell Formatter
Property Formatter
Extensions – In-line Tables
<Dimension name="Severity">
 <Hierarchy hasAll="true" primaryKey="severity_id">

 <InlineTable alias="severity">
  <ColumnDefs>
   <ColumnDef name="id" type="Numeric"/>
   <ColumnDef name="desc" type="String"/>
  </ColumnDefs>
  <Rows>
   <Row>
    <Value column="id">1</Value>
    <Value column="desc">High</Value>
   </Row>
   <Row>
    <Value column="id">2</Value>
    <Value column="desc">Medium</Value>
   </Row>
   <Row>
    <Value column="id">3</Value>
    <Value column="desc">Low</Value>
   </Row>
  </Rows>
 </InlineTable>

  <Level name="Severity" column="id" nameColumn="desc" uniqueMembers="true"/>
 </Hierarchy>
</Dimension>
Extensions – Views
<Cube name="Operations">

 <View alias="StateCountyCity">
  <SQL dialect="generic">
   <![CDATA[
SELECT s.state_name, c.county_name, t.city_name, s.state_id, c.county_id, t.city_id
FROM state s
LEFT JOIN county c ON (c.state_id = s.state_id)
LEFT JOIN city t ON (c.county_id = t.county_id)
  ]]>
  </SQL>
 </View>

</Cube>
Extensions – User Defined Functions
  Must implement mondrian.spi.UserDefinedFunction
  Implementation must be available in classpath
  UDF Definition in schema.xml
<Schema>
 ...
 <UserDefinedFunction name="PlusOne" className=“my.udf.PlusOne" />
</Schema>

  MDX Usage
WITH MEMBER [Measures].[Unit Sales Plus One]
    AS 'PlusOne([Measures].[Unit Sales])'
SELECT
    {[Measures].[Unit Sales]} ON COLUMNS,
    {[Gender].MEMBERS} ON ROWS
FROM [Sales]
MDX / JDBC Parallels
Mondrian                                JDBC
Connection – mondrian.olap.Connection   Connection – java.sql.Connection
Query – mondrian.olap.Query             Statement – java.sql.Statement
Result – mondrian.olap.Result           ResultSet – java.sql.ResultSet
Access Axis & Cell from Result          Access Rows & Columns from ResultSet
UI Layer – JPivot
Performance & Scalability
Enable SQL statement logging to analyze
mondrian generated SQL statements
Index on foreign/join keys
Use Aggregate Tables & Materialized Views
Query results in session
Constraints
Composite key joins are not supported
Uniqueness within a level is not based on id
Have had issues re-using same table with a
different alias
Make mondrian happy schema – must be
normalized
Requires dedicated Time dimension table
Summary
100% Pure Java BI tool
Not too difficult to work with
Extensible for different front-end layers
Scalable
Viable alternative to proprietary tools
    No vendor lock-in – Open Source
    Less TCO
    Quicker Time To Market

Learning Open Source Business Intelligence

  • 1.
    Open Source BI DeepDive Ravi Samji 21/04/2011
  • 2.
    Agenda Business Intelligence –Why, What & Who? Open Source BI – Introduction, Tech Stack OLAP Engine – Mondrian UI Layer – JPivot Performance & Scalability Constraints
  • 3.
  • 4.
    Business Intelligence –Why? Data is the biggest asset Structured and Unstructured format Most of our assets are buried Helps us understand customer behavior Helps us deliver better business value Measure performance
  • 5.
    Business Intelligence –What? Reporting Analytics Data/Text Mining ETL Predictive Analytics
  • 6.
  • 7.
    Open Source BI– Introduction Mondrian – OLAP Engine Initially Independent Open Source Initiative Now Part of Pentaho Open Source BI Suite 100% Pure Java Supports MDX and XML/A Bundled With Other Open Source Packages
  • 8.
    Open Source BI– Tech Stack JFreeChart WCF log4j log4j JPivot Mondrian RDBMS
  • 9.
    OLAP Engine –Mondrian Cube Definition – schema.xml MDX – Query language to access multi dimensional data Operates on normalized relational database
  • 10.
    Mondrian – schema.xml Logicalmodel of a multi dimensional database Cube, VirtualCube Dimensions, Hierarchies, Levels Measure, CalculatedMember
  • 11.
    Logical Model –Multi Dimensional <Schema> Database <Cube name="Sales"> <Table name="sales_fact_1997"/> <Dimension name="Gender" foreignKey="customer_id"> <Hierarchy hasAll="true" allMemberName="All Genders" primaryKey="customer_id"> <Table name="customer"/> <Level name="Gender" column="gender" uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Time" foreignKey="time_id"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###"/> <Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##"/> <Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00"/> <CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]"> <CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00"/> </CalculatedMember> </Cube> </Schema>
  • 12.
    Dimensions & SharedDimensions <Schema> <Dimension name="Time"> <Hierarchy hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> <Cube name=“Warehouse"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
  • 13.
    Hierarchies <Schema> <Dimension name="Time"> <HierarchyhasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column="the_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column="quarter" uniqueMembers="false"/> <Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> <Hierarchy name=“Fiscal Calendar” hasAll="false" primaryKey="time_id"> <Table name="time_by_day"/> <Level name="Year" column=“fiscal_year" type="Numeric" uniqueMembers="true"/> <Level name="Quarter" column=“fiscal_quarter" uniqueMembers="false"/> <Level name="Month" column=“fiscal_month_of_year" type="Numeric" uniqueMembers="false"/> </Hierarchy> </Dimension> <Cube name="Sales"> <Table name="sales_fact_1997"/> <DimensionUsage name=“Time" source=“Time" foreignKey="time_id”/> <Measure …/> <CalculatedMember …/> </Cube> </Schema>
  • 14.
    Schema.xml – Extensions Plug-inclasses In-line tables Views User defined functions
  • 15.
    Extensions – Plug-inClasses Member Reader Member Formatter Cell Reader Cell Formatter Property Formatter
  • 16.
    Extensions – In-lineTables <Dimension name="Severity"> <Hierarchy hasAll="true" primaryKey="severity_id"> <InlineTable alias="severity"> <ColumnDefs> <ColumnDef name="id" type="Numeric"/> <ColumnDef name="desc" type="String"/> </ColumnDefs> <Rows> <Row> <Value column="id">1</Value> <Value column="desc">High</Value> </Row> <Row> <Value column="id">2</Value> <Value column="desc">Medium</Value> </Row> <Row> <Value column="id">3</Value> <Value column="desc">Low</Value> </Row> </Rows> </InlineTable> <Level name="Severity" column="id" nameColumn="desc" uniqueMembers="true"/> </Hierarchy> </Dimension>
  • 17.
    Extensions – Views <Cubename="Operations"> <View alias="StateCountyCity"> <SQL dialect="generic"> <![CDATA[ SELECT s.state_name, c.county_name, t.city_name, s.state_id, c.county_id, t.city_id FROM state s LEFT JOIN county c ON (c.state_id = s.state_id) LEFT JOIN city t ON (c.county_id = t.county_id) ]]> </SQL> </View> </Cube>
  • 18.
    Extensions – UserDefined Functions Must implement mondrian.spi.UserDefinedFunction Implementation must be available in classpath UDF Definition in schema.xml <Schema> ... <UserDefinedFunction name="PlusOne" className=“my.udf.PlusOne" /> </Schema> MDX Usage WITH MEMBER [Measures].[Unit Sales Plus One] AS 'PlusOne([Measures].[Unit Sales])' SELECT {[Measures].[Unit Sales]} ON COLUMNS, {[Gender].MEMBERS} ON ROWS FROM [Sales]
  • 19.
    MDX / JDBCParallels Mondrian JDBC Connection – mondrian.olap.Connection Connection – java.sql.Connection Query – mondrian.olap.Query Statement – java.sql.Statement Result – mondrian.olap.Result ResultSet – java.sql.ResultSet Access Axis & Cell from Result Access Rows & Columns from ResultSet
  • 20.
  • 21.
    Performance & Scalability EnableSQL statement logging to analyze mondrian generated SQL statements Index on foreign/join keys Use Aggregate Tables & Materialized Views Query results in session
  • 22.
    Constraints Composite key joinsare not supported Uniqueness within a level is not based on id Have had issues re-using same table with a different alias Make mondrian happy schema – must be normalized Requires dedicated Time dimension table
  • 23.
    Summary 100% Pure JavaBI tool Not too difficult to work with Extensible for different front-end layers Scalable Viable alternative to proprietary tools No vendor lock-in – Open Source Less TCO Quicker Time To Market