[e]
[e]
[e]

Data Warehousing with MySql

Data Warehousing is the idea of setting up a separate database structure for purely reporting purposes, so you can stop compromising between reporting and transactional demands.
MySql is a database that is fast, easy to use, and (generally) free.

> Aggregation
> Benchmarks
> ClickstreamData
> DatesAndTimes
> Home
> Indexing
> KPIs
> Links
> MathML
> Money
> MysqlCompared
> PhpPlugins
> Recipe
> Regression
> Scripting
> Sudoku
> SummaryData
> TableTypes
> TuningMysql
New Page:

search:

>
edit css
show debugging
contact: tom (.) cunningham (at) gmail (dot) com
[e]
[e]

Benchmarks



Personal experience. In 2005 I have worked with a 3Gb d-warehouse, feeding off a 5Gb transactional DB. Timings:
  • 1 hour to SQL-snapshot & remount the transactional DB.
  • 1/2-hour to do ETL (100M of new data).
  • 2 hours to generate KPIs (~1200 of them).
  • 60-seconds to do a query over a month's data from the fact table (~4M rows, 200Mb).

    Cox Communications. Mysql.com has a white-paper on Cox Communications using MySql for their database: they say 2 billion rows; 3,600 tables; 27 servers; 600GB data.
  • Cox Communications powers massive data warehouse with MySql [mysql.com] (2004)

    Travelocity (Sabre). "Each month about 100 Gb of Travelocity data is captured for analysis. Sabre keeps a year's worth of data on hand for a total of 1.2Tb, made up of 3.8 billion rows of individual data elements." (Note: Travelocity also uses MySql for fare-searching, but that is a different installation). They use a commercial ETL tool, and a stats package from SAS, & run it on a $ US25K Linux server.
  • Data Warehousing with Open Source [techtarget.com] (Nov 04)

    OReilly. Under 10Gb data. Use perl & the PTK for reporting.
  • Data Warehousing with Open Source [techtarget.com] (Nov 04)

    Various. MySQL Customers With Data Warehouse Applications [mysql.com].

    A Script for Testing Variations




    #####################################################
    #---------------- MERGE FACT TABLE TEST -------------
    #-------------tom.cunningham-@-gmail.com--23-Dec-05--


    CREATE DATABASE IF NOT EXISTS sandbox;
     USE sandbox;

    CREATE TABLE big_table LIKE mysql.help_keyword;
    ALTER TABLE big_table ENGINE=MERGE UNION=(mysql.help_keyword);

    DROP TABLE IF EXISTS dimension_1;
    CREATE TABLE dimension_1 (
         key_1       INT PRIMARY KEY NOT NULL,
         attribute_1 VARCHAR(255) NOT NULL,
         INDEX attribute_1 (attribute_1(10))
    );
     SET @A:=1;
    INSERT INTO dimension_1
    SELECT @A:=@A+1,
         SHA(RAND())
    FROM big_table b1, big_table b2, big_table b3
    LIMIT 100000;

    DROP TABLE IF EXISTS dimension_2;
    CREATE TABLE dimension_2 (
         key_2       INT PRIMARY KEY NOT NULL,
         attribute_2 VARCHAR(255) NOT NULL,
         INDEX attribute_1 (attribute_2(10))
    );
     SET @A:=1;
    INSERT INTO dimension_2
    SELECT @A:=@A+1,
         SHA(RAND())
    FROM big_table b1, big_table b2, big_table b3
    LIMIT 100000;


    DROP TABLE IF EXISTS facts;
    CREATE TABLE facts (
         key_1 INT UNSIGNED,
         key_2 INT UNSIGNED,
         fact_1 INT UNSIGNED,
         fact_2 INT UNSIGNED,
         PRIMARY KEY key_1_key_2 (key_1,key_2)
    );
    INSERT IGNORE INTO facts
    SELECT CEIL(RAND()*100000),
         CEIL(RAND()*100000),
         CEIL(RAND()*1000000),
         CEIL(RAND()*1000000)
    FROM big_table b1, big_table b2, big_table b3
    LIMIT 1000000;


    ###### Typical query: (takes about 1.5seconds)
    SELECT SQL_NO_CACHE attribute_1, sum(fact_1), avg(fact_2)
    FROM facts
    JOIN dimension_1 ON dimension_1.key_1=facts.key_1
    JOIN dimension_2 ON dimension_2.key_2=facts.key_2
    WHERE attribute_1 LIKE 'ff%'
     AND attribute_2 LIKE 'a3%'
    GROUP BY attribute_1
    LIMIT 1;


    ###### Variation 1: Merged fact table;

    DROP TABLES f1, f2, f3, f4, f5;
    CREATE TABLE f1 LIKE facts; INSERT INTO f1 SELECT * FROM facts WHERE key_1 BETWEEN 0 AND 20000;
    CREATE TABLE f2 LIKE facts; INSERT INTO f2 SELECT * FROM facts WHERE key_1 BETWEEN 20001 AND 40000;
    CREATE TABLE f3 LIKE facts; INSERT INTO f3 SELECT * FROM facts WHERE key_1 BETWEEN 40001 AND 60000;
    CREATE TABLE f4 LIKE facts; INSERT INTO f4 SELECT * FROM facts WHERE key_1 BETWEEN 60001 AND 80000;
    CREATE TABLE f5 LIKE facts; INSERT INTO f5 SELECT * FROM facts WHERE key_1 BETWEEN 80001 AND 100000;
    RENAME TABLE facts TO facts_old;
    CREATE TABLE facts LIKE facts_old;
    ALTER TABLE facts ENGINE=MERGE UNION=(f1,f2,f3,f4,f5);


    ###### Simple vs. Composite Indexes:

    CREATE INDEX simple ON facts (key_1);
    CREATE INDEX composite ON facts (key_1, key_2);
    CREATE UNIQUE INDEX composite_unique ON facts (key_1, key_2);

        # Simple: ~0.9s
    SELECT SQL_NO_CACHE AVG(fact_1)
     FROM facts USE INDEX (simple)
     JOIN dimension_1 using (key_1)
    WHERE attribute_1 BETWEEN 'a3' AND 'g2';

        # composite: ~1.0s
    SELECT SQL_NO_CACHE AVG(fact_1)
     FROM facts USE INDEX (composite)
     JOIN dimension_1 using (key_1)
    WHERE attribute_1 BETWEEN 'a3' AND 'g2';

        # composite and unique: ~9.9s
    SELECT SQL_NO_CACHE AVG(fact_1)
     FROM facts USE INDEX (composite_unique)
     JOIN dimension_1 using (key_1)
    WHERE attribute_1 BETWEEN 'a3' AND 'g2';



  • Total Time: 0.06s
    REQUEST:Array
    (
        [link_body/body] => {incl:Benchmarks}
        [PHPSESSID] => b9a289e2959005ab5c102380940272d1
    )
    SESSION:Array
    (
        [body/body] => {incl:Benchmarks}
    )
    MESSAGES:PLUGINS: