> rRoot Entry` FPyFwWordDocumentp
{|&SLprObjectPool"C"CDSummaryInformation$|,4(u
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnox{|}~SummaryInformation(
DocumentSummaryInformation8
CompObjjz
> s
FMicrosoft Word Picture
MSWordDocWord.Picture.69qL?L2$+@E> s
FMicrosoft Word Document
MSWordDocWord.Document.69qata trends and anomalies. Many tools represent the dataset as an N-dimensional space. Two and three-dimensional sub-slabs of this space are rendered as 2D or 3D objects. Color and time (motion) add two more dimensions to the display giving the potential for a 5D display.
How do traditional relational databases fit into this picture? How can flat files (SQL tables) possibly model an N-dimensional problem? Relational systems model N-dimensional data as a relation with N-attribute domains. For example, 4-dimensional earth-temperature data is typically represented by a Weather table shown below. The first four columns represent the four dimensions: x, y, z, t. Additional columns represent measurements at the 4D points such as temperature, pressure, humidity, and wind velocity. Often these measured values are aggregates over time (the hour) or space (a measurement area).
Table 1: WeatherTime (UCT)LatitudeLongitudeAltitude
(m)Temp
(c)Pres
(mb)27/11/94:150037:58:33N122:45:28W10221100927/11/94:150034:16:18N 27:05:55W 10231024
The SQL standard provides five functions to aggregate the values in a table: COUNT(), SUM(), MIN(), MAX(), and AVG(). For example, the average of all measured temperatures is expressed as:
SELECT AVG(Temp)
FROM Weather;
In addition, SQL allows aggregation over distinct values. The following query counts the distinct number of reporting times in the Weather table:
SELECT COUNT(DISTINCT Time)
FROM Weather;
Many SQL systems add statistical functions (median, standard deviation, variance, etc.), physical functions (center of mass, angular momentum, etc.), financial analysis (volatility, Alpha, Beta, etc.), and other domain-specific functions.
Some systems allow users to add new aggregation functions. The Illustra system, for example, allows users to add aggregate functions by adding a program with the following three callbacks to the database system [Illustra]:
Init (&handle): Allocates the handle and initializes the aggregate computation.
Iter (&handle, value): Aggregates the next value into the current aggregate.
value = Final(&handle): Computes and returns the resulting aggregate by using data saved in the handle. This invocation deallocates the handle.
Consider implementing the Average() function. The handle stores the count and the sum initialized to zero. When passed a new non-null value, Iter()increments the count and adds the sum to the value. The Final() call deallocates the handle and returns sum divided by count.
Aggregate functions return a single value. Using the GROUP BY construct, SQL can also create a table of many aggregate values indexed by a set of attributes. For example, The following query reports the average temperature for each reporting time and altitude:
SELECT Time, Altitude, AVG(Temp)
FROM Weather
GROUP BY Time, Altitude;
GROUP BY is an unusual relational operator: It partitions the relation into disjoint tuple sets and then aggregates over each set as illustrated in Figure 1.
Figure 1: The GROUP BY relational operator partitions a table into groups. Each group is then aggregated by a function. The aggregation function summarizes some column of groups returning a value for each group.
Red Brick systems added some interesting aggregate functions that enhance the GROUP BY mechanism [Red Brick]:
Rank(expression): returns the expressions rank in the set of all values of this domain of the table. If there are N values in the column, and this is the highest value, the rank is N, if it is the lowest value the rank is 1.
N_tile(expression, n): The range of the expression (over all the input values of the table) is computed and divided into n value ranges of approximately equal population. The function returns the number of the range holding the value of the expression. If your bank account was among the largest 10% then your rank(account.balance,10) would return 10. Red Brick provides just N_tile(expression,3).
Ratio_To_Total(expression): Sums all the expressions and then divides the expression by the total sum.
To give an example:
SELECT Percentile,MIN(Temp),MAX(Temp)
FROM Weather
GROUP BY N_tile(Temp,10) as Percentile
HAVING Percentile = 5;
returns one row giving the minimum and maximum temperatures of the middle 10% of all temperatures. As mentioned later, allowing function values in the GROUP BY is not yet allowed by the SQL standard.
Red Brick also offers three cumulative aggregates that operate on ordered tables.
Cumulative(expression): Sums all values so far in an ordered list.
Running_Sum(expression,n): Sums the most recent n values in an ordered list. The initial n-1 values are null.
Running_Average(expression,n): Averages the most recent n values in an ordered list. The initial n-1 values are null.
These aggregate functions are optionally reset each time a grouping value changes in an ordered selection.
2. Problems With GROUP BY:
SQL's aggregation functions are widely used. In the spirit of aggregating data, Table 2 shows how frequently the database and transaction processing benchmarks use aggregation and GROUP BY. Surprisingly, aggregates also appear in the online-transaction processing TPC-C query set. Paradoxically, the TPC-A and TPC-B benchmark transactions spend most of their energies maintaining aggregates dynamically: they maintain the summary bank account balance, teller cash-drawer balance, and branch balance. All these can be computed as aggregates from the history table [TPC].
Table 2: SQL Aggregates in Standard BenchmarksBenchmarkQueriesAggregatesGROUP BYsTPC-A, B100TPC-C1840TPC-D162715Wisconsin1832AS3AP23202SetQuery751
The TPC-D query set has one 6D GROUP BY and three 3D GROUP BYs. 1D and 2D GROUP BYs are the most common.
Certain forms of data analysis are difficult if not impossible with the SQL constructs. As explained next, three common problems are: (1) Histograms, (2) Roll-up Totals and Sub-Totals for drill-downs, (3) Cross Tabulations.
The standard SQL GROUP BY operator does not allow a direct construction of histograms (aggregation over computed categories.) For example, for queries based on the Weather table, it would be nice to be able to group times into days, weeks, or months, and to group locations into areas (e.g., US, Canada, Europe,...). This would be easy if function values were allowed in the GROUP BY list. If that were allowed, the following query would give the daily maximum reported temperature.
SELECT day, nation, MAX(Temp)
FROM Weather
GROUP BY Day(Time) AS day,
Country(Latitude,Longitude)
AS nation;
Some SQL systems support histograms but the standard does not. Rather, one must construct a table-valued expression and then aggregate over the resulting table. The following statement demonstrates this SQL92 construct.
SELECT day, nation, MAX(Temp)
FROM ( SELECT Day(Time) AS day, Country(Latitude, Longitude)
AS nation,
Temp
FROM Weather
) AS foo
GROUP BY day, nation;
A second problem relates to roll-ups using totals and sub-totals for drill-down reports. Reports commonly aggregate data at a coarse level, and then at successively finer levels. The car sales report in Table 3 shows the idea. Data is aggregated by Model, then by Year, then by Color. The report shows data aggregated at three levels. Going up the levels is called rolling-up the data. Going down is called drilling-down into the data.
Table 3: Sales Roll Up by Model by Year by Color
Model
Year
ColorSales
by Model
by Year
by Color Sales
by Model
by Year
Sales
by Model
Chevy 1994black50 white40 901995black85 white115 200 290
Table 3 is not relational null values in the primary key are not allowed. It is also not convenient -- the number of columns grows as the power set of the number of aggregated attributes. Table 4 is a relational and more convenient representation. The dummy value "ALL" has been added to fill in the super-aggregation items.:
Table 4: Sales SummaryModelYearColorUnitsChevy 1994black50Chevy 1994white40Chevy 1994ALL90Chevy 1995black85Chevy 1995white115Chevy 1995ALL200Chevy ALLALL290
The SQL statement to build this SalesSummary table from the raw Sales data is:
SELECT Model, ALL, ALL, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model
UNION
SELECT Model, Year, ALL, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model, Year
UNION
SELECT Model, Year, Color, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model, Year, Color;
This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions.
Roll-up is asymmetric notice that the table above does not aggregate the sales by year. It lacks the rows aggregating sales by color rather than by year. These rows are:
ModelYearColorUnitsChevy ALLblack135Chevy ALLwhite155
These additional rows could be captured by adding the following clause to the SQL statement above:
UNION
SELECT Model, ALL, Color, SUM(Sales)
FROM Sales
WHERE Model = 'Chevy'
GROUP BY Model, Color;
The symmetric aggregation result is a table called a cross-tabulation, or cross tab for short (spreadsheets and some desktop databases call them pivot tables.) Cross tab data is routinely displayed in the more compact format of Table 5.
Table 5: Chevy Sales Cross TabChevy 19941995 total (ALL)black 5085135white40115155 total (ALL)90200290
This cross tab is a two-dimensional aggregation. If other automobile models are added, it becomes a 3D aggregation. For example, data for Ford products adds an additional cross tab plane.
Table 5a: Ford Sales Cross TabFord 19941995 total (ALL)black 5085135white107585 total (ALL)60160220
The cross tab array representation is equivalent to the relational representation using the ALL value. Both generalize to an N-dimensional cross tab.
The representation of Table 4 and unioned GROUP BYs solve the problem of representing aggregate data in a relational data model. The problem remains that expressing histogram, roll-up, drill-down, and cross-tab queries with conventional SQL is daunting. A 6D cross-tab requires a 64-way union of 64 different GROUP BY operators to build the underlying representation. Incidentally, on most SQL systems this will result in 64 scans of the data, 64 sorts or hashes, and a long wait.
Building a cross-tabulation with SQL is even more daunting since the result is not a really a relational object the bottom row and the right column are unusual. Most report writers build in a cross-tabs feature, building the report up from the underlying tabular data such as Table 4 and its extension. See for example the TRANSFORM-PIVOT operator of Microsoft Access [Access].
3. The Data CUBE Operator
The generalization of these ideas seems obvious: Figure 2 shows the concept for aggregation up to 3-dimensions. The traditional GROUP BY can generate the core of the N-dimensional data cube. The N-1 lower-dimensional aggregates appear as points, lines, planes, cubes, or hyper-cubes hanging off the core data cube.
The data cube operator builds a table containing all these aggregate values. The total aggregate is represented as the tuple:
ALL, ALL, ALL,..., ALL, f(*)
Points in higher dimensional planes or cubes have fewer ALL values. Figure 3 illustrates this idea with an example.
We extend SQLs SELECT-GROUP-BY-HAVING syntax to support histograms, decorations, and the CUBE operator.
Currently the SQL GROUP BY syntax is:
GROUP BY
{ [collate clause] ,...}
Figure 2: The CUBE operator is the N-dimensional generalization of simple aggregate functions. The 0D data cube is a point. The 1D data cube is a line with a point. The 2D data cube is a cross tabulation, a plane, two lines, and a point. The 3D data cube is a cube with three intersecting 2D cross tabs.
To support histograms, extend the syntax to:
GROUP BY
{ ( | )
[ AS ]
[ ]
,...}
The next step is to allow decorations, columns that do not appear in the GROUP BY but that are functionally dependent on the grouping columns. Consider the example:
SELECT department.name, sum(sales)
FROM sales JOIN department
USING (department_number)
GROUP BY sales.department_number;
The department.name column in the answer set is not allowed in current SQL, it is neither an aggregation column (appearing in the GROUP BY list) nor is it an aggregate. It is just there to decorate the answer set with the name of the department. We recommend the rule that if a decoration column (or column value) is functionally dependent on the aggregation columns, then it may be included in the SELECT answer list.
These extensions are independent of the CUBE operator. They remedy some pre-existing problems with GROUP BY. Some systems already allow these extensions, for example Microsoft Access allows function-valued GROUP BYs.
Creating the CUBE requires generating the power set (set of all subsets) of the aggregation columns. We propose the following syntax to extend SQLs GROUP BY operator:
GROUP BY (
{ ( | )
[ AS ]
[ ]
,...}
[ WITH ( CUBE | ROLLUP ) ]
)
EMBED Word.Picture.6
Figure 3: A 3D data cube (right) built from the table at the left by the CUBE statement at the top of the figure.
Figure 3 has an example of this syntax. To give another, here follows a statement to aggregate the set of temperature observations:
SELECT day, nation, MAX(Temp)
FROM Weather
GROUP BY Day(Time) AS day,
Country(Latitude, Longitude)
AS nation
WITH CUBE;
The semantics of the CUBE operator are that it first aggregates over all the