[e]
[e]
[e]

Data Warehousing with MySql

Data Warehousing is the idea of setting up a separate database structure for purely reporting purposes, so you can stop compromising between reporting and transactional demands.
MySql is a database that is fast, easy to use, and (generally) free.

> Aggregation
> Benchmarks
> ClickstreamData
> DatesAndTimes
> Home
> Indexing
> KPIs
> Links
> MathML
> Money
> MysqlCompared
> PhpPlugins
> Recipe
> Regression
> Scripting
> Sudoku
> SummaryData
> TableTypes
> TuningMysql
New Page:

search:

>
edit css
show debugging
contact: tom (.) cunningham (at) gmail (dot) com
[e]
[e]

Aggregation



The Aggregation Problem


Eventually a table gets too big to query on-demand, and a server upgrade would be too expensive. Then you have two options: (a) give up and do batch queries overnight; or (b) generate summary data, which is able to answer most queries (and then do the rest in batch queries).

Generating summary data is sensible, but is a delicate task.

Automatic Summaries: OLAP


An OLAP server (OnLine Analytical Processing) can handle aggregation for you. You feed them a lot of flat data, and some information about which columns are attributes and which are facts, and about how the facts should be summarised. The server then stores the summaries of each column for each different combination of attributes in a cube.

The summary data needn't be comprehensive, e.g. the cube may contain total sales of shoes on the 13th of June; and it may store the total number of customers who were from Poland in June; but it may not store how many customers from Poland bought shoes on the 13th of June.

Once the cube is calculated it can respond to queries very quickly; and if it was carefully built it should store sufficient information to answer most ad-hoc queries.

Oracle 10g has a function to transparently rewrite SQL queries to use a cube instead of the tables it references. However, it has to be told in exactly what circumstances to do this. (REF)

MS Excel versions 2000+ have built-in OLAP functions, which let you generate cubes from raw data (eg., a MySql DB via ODBC), and then query the cube using standard Excel pivot tables. This works OK, but it doesn't seem well prepared to take million-row data sets.

Other products which support cubes are expensive.

Automatic Summaries: Materialized Views


(...)

Manual Summaries: Aggregate Tables


If you're too cheap to afford an OLAP server, or wish finer control of aggregation, it is not too complicated to implement a summary-data system yourself with aggregate tables.

The basic way you build an aggregate table is this: take a fact table, join it every dimension that you can, and then add every single attribute to the group-by clause. The result will be a summary table that does absolutely no summarising -- it has just as many records as your original fact table.

The decision then is which attributes to take out of the "group by" clause. By removing attributes from that clause you reduce the number of rows in the aggregate table - making it quicker - but also reduce the number of queries you can use the aggregate table to answer  - making it less useful.

This is the basic form:

CREATE TABLE fact_sales_aggregate
SELECT date,
      ds.store_region, ...,
      dp.product_name, ...,
      SUM(fs.num_sold) AS num_sold,
      SUM(fs.total_price) AS total_price

 FROM fact_sales fs
 JOIN dim_store ds ON sd.store_key=fs.store_key
 JOIN dim_product dp ON dp.product_key=fs.product_key
GROUP BY
      fs.date,
      ds.store_region, ...,
      dp.product_name, ...;


Once you have the query like this, try commenting out different combinations of attributes to find a compromise where (a) you can answer most interesting questions; (b) you have many fewer rows than the original base table.

A slightly quicker way of doing this is with the query:

SELECT COUNT(DISTINCT ds.store_region, dp.product_name, ...)
 FROM ...


In some cases no single compromise will be ideal, and it will be sensible to make multiple aggregate tables. The extra overhead of managing a second aggregate table after a first should not be much.

COUNT(*)


One query that will obviously work differently on the base table and the summary table is this:
SELECT COUNT(*) FROM table WHERE ...

A single row in the summary table represents many rows in the base table, but in the current state we don't know how many. This information is likely to be important, so we need to make a new column in the summary table, e.g. "agg_count", which simply stores a COUNT(*) of the number of rows in the underlying table.

Cut-Down Dimensions


There are two options for a summary table: either you can make it a standard fact table, with references to dimension tables; or you can make it a self-contained flat table which contains a mixture of attributes and facts (this is what will be created by the query above).

The first option requires creating all new dimension tables, which contain less columns than the original dimension tables.

The second option is all-round simpler, but makes a bigger, slower table.

[can't use indexes the same on 2nd type of table]

Once you've decided on a structure for your summary table it's best to do a proper "CREATE TABLE", rather than "CREATE TABLE ... SELECT ...", because the latter form will tend to use inefficient column types.

Aggregate Table Growth


If you plot the size of the base table against the size of the aggregate table, you hopefully will get a convex shape: i.e., the aggregate table grows at less than O(n). This is because the aggregate table only needs a row for each combination of distinct attributes, and it is reasonable to expect this variety of facts to grow less slowly than the quantity of facts.


Rewriting Queries


Obviously you can now query the summary table in ad-hoc reports, and get answers back more quickly. However, it's much nicer if your query tools can transparently query either the summary table, or the underlying fact table when necessary.

My algorithm for query rewrite is this:
  • Check the query for certain strings that are difficult to rewrite, if the query contains any, send it to the base table. (see below)

  • Replace the entire "FROM ... JOIN ..." section with a reference just to your summary table.

  • Try sending this query to the database with "EXPLAIN" on the front. If the DB complains (eg., because it mentions a column which doesn't exist in your summary table), then use the base table.

  • Otherwise, use the base table.


  • Estimating Query Length


    Transparently rewriting queries is good, but it can then be surprising and annoying to the user that some queries take so much longer than others.

    If the queries go through your own query tool, then you can tell the user how long a query is expected to take before they execute it. The number of rows MySql expects to examine can be found by running an EXPLAIN on the query, and multiplying together all the numbers in the "rows" column. Then multiply this number by a number (my server does around 100,000 rows / second).

    The estimated query-length is generally within an order of magnitude of the real length, which is good enough. There are a lot of extra features you could add to get a more accurate estimate.

    Manual Aggregation Problems


    Some bits of SQL will work on the summary table, but will return the wrong answer, which is obviously very bad. One option is to try to parse the SQL to get all the column names, and only rewrite if the parsing is successful (this is what Ralph Kimball is suggesting in his 1993 essay). A problem with this is that you may have to update the rules each time you use slightly different SQL.

    The simpler alternative is to have a black-list of SQL, such that if it appears then do not rewrite the query.

  • COUNT(*) This obviously doesn't work. You can rewrite this to: "SUM(agg_count)". However, a slightly safer alternative is to write "SUM(agg_count)" in the original query, and then when not using the summary table replace "agg_count" with "1", which will work in the same way as COUNT(*).

  • SUM(IF(a=b,1,0)) This is the same problem: here "1" needs to be replaced with "agg_count".

  • AVG(a) This needs to be manually rewritten in the original query as: SUM(a)/SUM(agg_num).

  • COUNT DISTINCT (useful with ROLLUP)

  • status-change subqueries

  • attribute/facts


  • [list of all aggregate functions to be added to the blacklist]


    Other Notes


  • indexes

  • fine-grained attributes

  • keeping data up-to-date

  • base table schema changes

  • materialized views with query rewrite (Oracle, DB2)
  • indexed views (MS-SQL)
  • view simulation with triggers (PostgreSQL)


    Links


  • system for MySql query-rewriting. [stoev.org]
  • Aggregate Navigation With (Almost) No Metadata Richard Kimball.



  • Total Time: 0.57s
    REQUEST:Array
    (
        [link_body/body] => {incl:Aggregation}
        [PHPSESSID] => b9a289e2959005ab5c102380940272d1
    )
    SESSION:Array
    (
        [body/body] => {incl:Aggregation}
    )
    MESSAGES:PLUGINS: