Total Pageviews

Thursday, September 8, 2011

MDX: Multi-Dimensional eXtensions



MDX is a language that allows to describe multidimensional queries. As a matter of fact, the name MDX stands for Multi-Dimensional eXtensions. As this name suggests, MDX is an extension of SQL, so you will find its structure to be similar to this of SQL, and the same keywords serving similar functions. There will be some differences however. MDX builds multidimensional view of the data when SQL builds relational view.
The following table helps to draw analogies between relational and multidimensional terms and it is suggested, that you’ll refer to this table when MDX and SQL queries are compared
Multidimensional termRelational analog
CubeTable
LevelColumn (string or discrete number)
DimensionSeveral related columns or dimension table
MeasureColumn (discrete or continuous numeric)
Dimension memberThe value in the specific row and column of dimension table

Sample cube

All examples in the following sections will be built around sample cube. This cube is sales analysis cube for hypothetical company which sells computer accessories world-wide. This cube contains typical dimensions such as Time, Geographic, Products, Customers etc. The actual cube is attached to this book and can be viewed using Microsoft OLAP software (such as MDXSample). All queries in the book will work immediately against this cube, except the cases where it specially noted.

Our first MDX query.

Let’s start to build our first MDX query. The basic MDX query has the following structure:
SELECT axis1 ON COLUMNS, axis2 ON ROWS FROM cube
Let’s compare that to the similar SQL statement:
SELECT column1, column2, …, columnnFROM table
You may find some of MDX keywords familiar. Indeed, SELECT and FROM serve exactly the same purpose they do in SQL. The SELECT keyword tells us that we want to execute a query and get data results back. The FROM keyword specifies the source of the data. In the SQL case we receive the data from the table, and in MDX case we receive data from the cube. However, there is something different between above two queries. SQL basically gives us relational view of the data, which is always two-dimensional. The result of the SQL query is a table which has two dimensions: rows and columns. And these dimensions are not symmetrical. Rows are all of the same structure, and this structure is defined by the columns. Columns, of course may be of different types and have different meanings.
In the MDX world, we can specify any number of dimensions to form result of our query. We will call them axes to prevent the confusion with cube’s dimensions. We can have zero, one, two, three or any other number of axes. And all these axes are perfectly symmetric.
There is no special meaning to rows and to columns in MDX other than for display purposes. You should note however, that most OLAP tools (such as Microsoft’s MDX Sample) are not capable in visualizing of MDX query results containing more than two axes. The list of columns in SQL defines the structure of the result data. In MDX, you have to define the structure of every axis. Therefore in our example axis1 will be definition of the axis displayed horizontally, and axis2 will be definition of the axis displayed vertically. Axis is a collection of dimension members, or more generally tuples. We will discuss tuples later, so for now we’ll restrict ourselves to the axes containing members only.
There are many versatile ways to define an axis in MDX. We will start with simple one. Suppose we want to present on the axis all members of certain dimension. The syntax for such an axis definition would be
<dimension name>.MEMBERS
And similarly, if we want to see all dimension members belonging to the certain level of this dimension, the syntax would be
<dimension name>.<level name>.MEMBERS
Now let’s construct our first MDX query using all information we learned so far. We’ll use as an example cube ‘Sales’. (The definition of this cube can be found in appendix). Our first MDX query will show sales in different continents over years. Therefore the query will look like
SELECT
  Years.MEMBERS ON COLUMNS,
  Regions.Continent.MEMBERS ON ROWS
FROM Sales
The result of this query visualized through the table will look like the following:

1994
1995
1996
1997
N. America
120,000
200,000
400,000
600,000
S. America
-
10,000
30,000
70,000
Europe
55,000
95,000
160,000
310,000
Asia
30,000
80,000
220,000
200,000
Let’s try to analyze the results of this query. We have a table with two axes. Horizontal axis (i.e. columns) contains members of dimension ‘Years’, and vertical axis (i.e. rows) contains members of level ‘Continent’ of dimension ‘Regions’. All these are explicitly specified in the query. However, one can ask: “And what are numbers in this table mean?”. Indeed, if you look into the query, you will find no reference to any quantitative data. So what happened ? We know, that in the cube we have to always specify measures, which are our quantitative data. Since, we didn’t explicitly tell in our MDX query which measure we want to use, the default measure was used. In our example, he default measure was ‘Sales’ (not to be confused with cube name ‘Sales’!). So, for instance, the interpretation of leftmost top number in the table is “In the year 1994, the sales in N.America were 120,000”. Now for comparison, let’s construct SQL query which produces the same data.
SELECT Continent, Year, Sales FROM Sales_Agg
The result is in the following table:
Continent
Year
Sales
N.America
1994
120,000
N.America
1995
200,000
N.America
1996
400,000
N.America
1997
600,000
S.America
1995
10,000
S.America
1996
30,000
S.America
1997
70,000
Asia
1997
200,000
(We assume, that the table ‘Sales_Agg’ is an aggregation table, so we don’t need to do GROUP BY’s and we receive already aggregated data).

Axis as enumeration

OK, so far we know how to build a basic MDX query and specify axes containing all members of the dimension or level. However, sometimes, we want to be more restrictive in our axes. Dimensions or levels often contain thousands of members and we don’t wanna see all of them. Some members might be irrelevant for our current analysis. For example, if we look into our query, we find out that it produces information about all years. And as we know, Wall Street always looks into last two years for comparison. So, we only want to see sales by continent for 1996 and 1997. Fortunately, MDX designers (see their names in appendix) thought about such scenarios, and we can specify a list of members in curly braces as an axis definition. For example:
{ dim.member1, dim.member2, … , dim.membern}
Let’s rewrite our MDX query to take advantage of this syntax and represent only years 1996 and 1997:
SELECT
 { Years.[1996], Years.[1997] } ON COLUMNS,
  Regions.Continent.MEMBERS      ON ROWS
FROM Sales
You may notice, that names of the members 1996 and 1997 are enclosed in square brackets. This quoting is done to prevent MDX parser confusion over the nature of 1996 and 1997. Without quotes, MDX parser might decide that they are numbers rather than member names. In general, square brackets quoting allows you to have member names containing any symbol such as spaces, dots etc. It is always good idea to enclose your member name in square brackets. It will help you to eliminate some syntax errors, that might take a lot of time to troubleshoot.
MDX syntax allows you to define empty axis by writing { }. This is precisely legal, however it is hard to think about useful example with explicit empty set.

Slicing

Slicing is one of the basic operations in data analysis. We’ll explain the concept and it’s implementation in MDX through examples. Let’s take our previous query. So far only two dimensions, Regions and Years, were involved. So what happens if we add other dimensions? Perspicacious reader might guess the answer.
We already discussed what happens with measures if no measure was explicitly specified in the query. But measures are treated in MDX exactly the same as any other dimension. So for all dimension that were omitted from the MDX query, the default members will be assumed. When a dimension has level ‘All’ (i.e. has top level consisting of only one member which aggregates the whole dimension) this member will be chosen as default. Otherwise, the first member of the first level will be chosen as default (we’ll discuss ordering of members in the level later). So going back to our query we realize, that it gave us sales figures for all products, all customers etc. Suppose, now we are very interested only in sales of computers. Of course, we can build new axis containing only one element:
{ Products.[Product Group].Computers }
And then we can add this axis to the query (we said query can contain any number of axes). However, this is not preferable solution. The problem is that then the query will contain three axes, and not all OLAP tools will be able to display it. Luckily, designers of MDX had some experience in OLAP, and they recognized the importance of slicing in OLAP applications. Therefore, MDX supports slices through the syntax of WHERE keyword. To rewrite our query using slice on computers, we’ll use
SELECT
 { Years.[1996], Years.[1997] } ON COLUMNS,
  Regions.Continent.MEMBERS ON ROWS
FROM Sales
WHERE ( Products.[Product Group].[Computers] )
Attention SQL gurus ! WHERE clause in SQL means filtering of rows by specified criteria. The WHERE clause in MDX means slicing the multidimensional query. While these concepts are somehow similar, they by no means are equivalent.
OK, now let’s present the formal syntax of WHERE clause
WHERE ( member-of-dim1, member-of-dim2, …, member-of-dimn )
As you can see, you can have more than one member in slice. However, you should remember, that these members should come from different dimensions. This is an obvious restriction. You can not slice by two different products, say, computers and printers. You can only slice by one of them.  Let’s build a query slieces by specific product (computers) and  specific customer (AT&T).
SELECT
 { Years.[1996], Years.[1997] }  ON COLUMNS,
 Regions.Continent.MEMBERS       ON ROWS
FROM Sales
WHERE ( Products.[Product Group].[Computers],Customers.[AT&T] )
To summarize again: This query shows the sales of computers to AT&T in years 1996 and 1997 by all continents.
What if we don’t want to see sales of computers, but rather number of units shipped ? This requires us to change the default measure ‘Sales’ to another measure, ‘Units’. And how we are gonna do that ? Yes, again using slice. We just need to add to the slice measure units.
SELECT
     { Years.[1996], Years.[1997] } ON COLUMNS,
     Regions.Continent.MEMBERS ON ROWS
FROM Sales
WHERE
( 
  Products.[Product Group].[Computers],
  Customers.[AT&T],
  Measures.[Units]
)

First exercises.

Well, we have learned the basic MDX constructions. We know how to construct axes in different ways, we know how to slice. Even though it was relatively simple, it already gives you ability to answer many interesting questions. So, take a challenge, and try to answer the following questions:
1.      Build the MDX query which shows sales of products over the years.
2.      Build the MDX query which shows sales of products to customers in 1996.
3.      Build the MDX query which compares the number of printers shipped in Asia against Europe over the years.

Tuples

We mentioned earlier that axes can contain members or tuples, and promised to explain about tuples later. Well, it's time to fulfill that promise. Consider the following example: You want to see sales of computers and printers in Europe and Asia over years 1996 and 1997. Well, we can do that by writing an MDX query with three axes:
SELECT
  { Continent.[Europe], Continent.[Asia] }      ON AXIS(0),
  { Product.[Computers], Product.[Printers] }   ON AXIS(1),
  { Years.[1996], Years.[1997] }                ON AXIS(2)
FROM Sales
This will work, but the result of this query is not a two-dimensional table, but three-dimensional cube. This might be fine, if this query was used internally by data mining application. But what if our boss asked you to report these numbers. How you are going to print this three-dimensional thing? The solution would be to rewrite this query in such a way, that even though it will encapsulate three dimensions, it will still contain two axes. We want the query which will produce the following result:

1996
1997
Europe
Computers


Printers


Asia
Computers


Printers



One special thing about this table is that its rows axis contains two dimensions. The dimension Products is nested into dimension Regions, such that we produced all four combinations between {Europe, Asia} and {Computers, Printers}. Every such a combination is called tuple in MDX. So the tuple is a combination of dimension members, coming from different dimensions. Syntax for specifying a tuple in MDX is
( member-of-dim1, member-of-dim2, …,
member-of-dimk )
Now, don’t you have some kind of déjà vu here? Doesn’t it remind you something you’ve already seen in this book ? Yeah, sure. We have had exactly the same construction in WHERE clause. As a matter of fact, official syntax of MDX specification defines tuple to follow WHERE clause. We are already familiar with tuples. We used them in the WHERE clause. We just didn’t call them tuples. A tuple can contain only one member. Unlike axis, tuples cannot be empty. However, you have to clearly distinguish between member and tuple containing only one member.
member is not equivalent to member )
We’ll talk more about member and their properties later. Now, let’s write our MDX query to use tuples.
SELECT
  { Year.[1996], Year.[1997] } ON COLUMNS,
  {
    ( Continent.Europe, Product.Computers ),
    ( Continent.Europe, Product.Printers  ),
    ( Continent.Asia,   Product.Computers ),
    ( Continent.Asia,   Product.Printers  )
  } ON ROWS
FROM Sales
It is important to remember, that just like the case when we put the same dimension into the axis members, when we construct an axis from tuples, all these tuples have to be of the same structure. It means, all tuples have to contain the same number of members, and members on the same positions have to be from the same position. For example the following tuples pairs don’t have same structure:
( Europe, Computers ) and ( Europe, Computers, Sales )
( Europe, Computers ) and ( Europe, Sales )
( Europe, Computers ) and ( Computers, Europe )
Well, thoughtful reader may say now: “Everything is fine, but in this example we build the axis which contains only four tuples, and it was relatively easy to enumerate them. What if you want to see all products against all continents against all years. Now what ? MDX has an answer for this problem. MDX has a lot of built-in functions which allow to manipulate with axes and produce new axes. We’ll discuss these function in one of the following chapters.

Calculated members

So far we saw the various ways to query the existing information inside the cube. However almost any OLAP application needs the ability to perform some calculations over the existing data. Let’s take classic example. We have predefined measures Cost and Sales which describe how much did the product cost to us and for how much did we sell it. The next natural thing that one wants to ask is, well, what was our profit out of it. If the cube creator included measure Profit inside the cube – we are safe. However, what we are gonna do if such a measure doesn’t exist ? Luckily, MDX allows us to extend the cube by defining new members to dimensions, which values are calculated out of existing ones. The syntax for calculated members is to put the following construction in front of SELECT statement:
WITH MEMBER parent.name AS 'expression'
Here, “parent” refers to the parent of the new calculated member “name”. Since dimension is organized as a tree, when we add new member, we should specify its position inside this tree. Fast-thinking reader will immediately point out to the fact, that not all dimensions are organized as classic trees, but rather as multiroot trees. Indeed, even in our example, we wanted to add new member to the measures dimension, and measures don’t have common root. Well, in such a situation, the dimension itself serves as virtual root. I.e. dimension name will be specified as parent. So, let’s go ahead and create calculated member Profit in the dimension Measures.
WITH MEMBER Measures.Profit AS 'Measures.Sales – Measures.Cost'
As we just discussed, the parent of the new member is measures dimension itself, the name is Profit, and the expression is pretty straight-forward: Subtract your costs from your sales, and you’ll get the profit. Let’s put the whole query now, which allows to see the profits of products along years:
WITH
  MEMBER Measures.Profit AS 'Measures.Sales – Measures.Cost'
SELECT
  Products.MEMEBERS ON COLUMNS,
  Year.MEMEBERS ON ROWS
FROM Sales
WHERE ( Measures.Profit )
As you can see, the calculated members is treated in exactly same manner as normal members. They can be used anywhere, where the normal member can be used, i.e. in axis definition, in WHERE condition, you can even define calculated members using the other calculated members. As example, let’s add to the calculated member Profit another calculated member – ProfitPercent, which will exhibit the profit as percent of the costs. To calculate it, we’ll take the profit and divide it by cost.
WITH
 MEMBER Measures.Profit AS 'Measures.Sales – Measures.Cost'
 MEMBER Measures.ProfitPercent AS 'Measures.Profit / Measures.Cost',  FORMAT_STRING = '#.#%'
SELECT
  { Measures.Profit, Measures.ProfitPercent } ON COLUMNS
FROM Sales
We see several interesting things in this example. First is the fact, that to define calculated member ProfitPercent we used another calculated member. Second is the FORMAT_STRING statement right after the definition of ProfitPercent expression. As a matter of fact, MDX allows you to specify several properties of calculated member. The expression is one, probably most important one (and the only mandatory one). However there are other. One of the most useful is FORMAT_STRING. It allows you to specify how do you want to get the result cell value formatted for the purposes of nice display. While this property has nothing to do with semantics of MDX, we feel it is still quite important for OLAP applications, so we included it into this book. MDX standard doesn’t define how this format strings should look like. Developers of Microsoft MDX implementation have chosen to use familiar VB format standard. Since ProfitPercent is a fraction which meaning is percentage, we used here #.#% string which automatically converts fraction to percent (i.e. multiplies by 100).
Well, after these examples, you may get wrong feeling, that calculated members are used only in measures dimension. Of course, this is not true. Suppose we want to compare the performance of the company between year 1997 and 1998 and we want to see the changes. One way to do it is to build a query which has an { [1997], [1998] } axis, and always look into the pair of numbers for every measure. Another way, though, is to define calculated member in the level Year, parallel to 1997 and 1998, which will hold the difference between them. I.e.
WITH MEMBER Time.[97 to 98] AS 'Time.[1998] – Time.[1997]'
Now if we follow that by statement
SELECT 
  { Time.[97 to 98] } ON COLUMNS,
  Measures.MEMBERS ON ROWS 
FROM Sales
Note, that the row of Sales will show the difference in sales (likely positive), and the row of Cost will show difference in cost (hopefully negative). Member [97 to 98] belongs to the time dimension and behaves exactly as any other member in the time dimension. If we want to see difference between months December and October of 1997, we’ll create calculated member under member [1997], i.e.
WITH MEMBER Time.[1997].[Dec To Oct] AS 'Time.[1997].[12] – Time.[1997].[10]'
We’ll see many other interesting examples in following chapters as we advance in studying powerful MDX functions and not just simple math. Now, we would like to go a little bit back and recall our first example with Profit. Suppose we want to see how profit has been changed from 97 to 98. Well, the natural thing to write will be
WITH 
  MEMBER Measures.Profit AS 'Measures.Sales – Measures.Cost' 
  MEMBER Time.[97 to 98] AS 'Time.[1998] – Time.[1997]'
SELECT
  { Measures.Sales, Measures.Cost, Measures.Profit } ON COLUMNS,
  { Time.[1997], Time.[1998], Time.[97 to 98] }      ON ROWS
FROM Sales
We’ll get 3x3 matrix:

Sales
Cost
Profit
1997
300
220
80
1998
350
210
140
97 to 98
50
-10
60
It looks OK for the first glance, but if we look deeper we’ll see a certain problem here. There is no doubt about what we get in every cell of this grid, except for the low right corner. The cell in the low right corner is in the intersection of two calculated members: Profit and [97 to 98]. So what is the meaning of this cell: Is it grow of the profit between 97 and 98, or is it difference between relative sales and relative cost. In this simple example it doesn’t really matters, you can look into it one way or another, and the result will be still the same. But let’s imagine, that [97 to 98] was defined as
WITH MEMBER Time.[97 to 98] AS 'Time.[1998] / Time.[1997]'

Now it is clear, that the decision of what formula is applied to this cell really matters. Indeed to we want to calculate
([97 to 98],[Sales]) – ([97 to 98],[Cost])

or
([1998], [Profit]) / ([1997], [Profit])

Well, the MDX parser cannot read your mind. Sometimes you will want to do former, sometimes later. That’s why MDX allows to specify explicitly what do you want. For every calculated member you define, you can specify optional property SOLVE_ORDER This property allows you to control which calculated member formula will be applied if you have conflict on the cell, just like in the example above. The higher value you assign to SOLVE_ORDER property, the higher priority of this member. I.e. when several formulas pretend to be calculated for single cell, the one with highest priority is chosen. (As a matter of fact, the author of this book had very strong arguments with other spec writers to name this propertySOLVE_PRIORITY rather than SOLVE_ORDER, because it is much easier to understand it that way). Well, let’s get back to our example. Since we want to see difference between profits, we’ll give to the [97 to 98] calculated member higher priority. The standard says, that if you omit SOLVE_ORDER property, the default value will be zero. So, it’ll be enough to say
WITH
 MEMBER Measures.Profit AS 'Measures.Sales – Measures.Cost'
 MEMBER Time.[97 to 98] AS 'Time.[1998] / Time.[1997]', FORMAT_STRING = '#.00%', SOLVE_ORDER = 1
The interesting question is what will happen if both calculated members have the same solve order property. Well, MDX standard specifies, that than the outcome is implementation dependent, and generally unpredictable. To summarize: It has been long, but very important chapter. Calculated members are extremely useful in any OLAP application, and we’ll see them a lot in following chapters.

Members and related functions

As you know by now, in MDX we operate in the multidimensional hierarchized space. I.e. we have many dimensions, and every dimension has hierarchical structure (MDX allows to same dimension to have more than one hierarchical structure, but we’ll talk about it later). Obviously we need set of functions to navigate in this hierarchical multidimensional space. By defining the query’s axes, you specify lower-dimensionality subspace of your query (optionally sliced by WHERE condition). You also want to have functions to navigate in this subspace, knowing where you are, moving to the right direction etc. This chapter is dedicated to this type of functions. If you have dimension member, you may apply several methods on it: Parent, NextMember, PrevMember, FirstSibling, LastSibling, FirstChild, LastChild etc. The outcome will be another dimension member with obvious relationship to original one. For example the writing [WA].Parent  is equivalent to write [USA]. Of course nobody will use [WA].Parent instead of [USA], and we’ll see much more useful application of Parent in a moment. Often, you want to know the coordinates of the cell by some specific dimensions inside the query’s subspace. This can be achieved by function
<dimension name>.CurrentMember
<level name>.CurrentMember
Now, we can use CurrentMember and Parent to solve the following problem: Suppose we want to calculate the percentage of the sales in every city relative to its state. To calculate this percentage, we’ll divide the sales in the city by sales in its state. And how will we deduct the state by city ? Of course by using Parent !
WITH
 MEMBER Measures.PercentageSales AS '(Regions.CurrentMember,   Sales) /  (Regions.CurrentMember.Parent, Sales)'  ,FORMAT_STRING = '#.00%'
SELECT
   { Sales, PercentageSales } ON COLUMNS,
   Regions.Cities.MEMBERS ON ROWS
FROM Sales
This query effectively allows to see the required percentage.
While the functions Parent, FirstChild, LastChild operate in hierarchy in vertical manner, i.e. applying such a function changes level, functions PrevMember, NextMember, FirstSibling, LastSibling move you horizontally, inside the same level. PrevMember returns the member preceding to the given one in thelevel.MEMBERS collection, possibly crossing hierarchy lines. Example:
[Aug].PrevMember returns [Jul]. Both members belong to [Qtr 3]
[Oct].PrevMember returns [Sep]. We crossed the hierarchy changing parent from [Qtr 4] to [Qtr 3].
Let’s see interesting example in using PrevMember. We want to define new calculated member – Sales Growth, which will show the growth or decline in the sales compared with previous month, quarter or year. Naturally, the formula definition will look like:
WITH MEMBER Measures.[Sales Growth] AS '(Sales) – (Sales, Time.PrevMember)'
Now, we can use this definition in the query like that:
SELECT
     { [Sales], [Sales Growth] } ON COLUMNS,
     Month.MEMBERS ON ROWS
FROM Sales
Another example will be to see breakdown of sales for different products and how did these sales grew for every product in the last month.
SELECT
     { [Sales], [Sales Growth] } ON COLUMNS,
   Product.MEMBERS ON ROWS
FROM Sales
When we’ll talk about set functions, we’ll see many interesting things to do with this query, like filtering only those products, for which sales declined, or for which we had unusual growth. In the calculated members properties chapter we’ll see techniques of implementing what is known as exceptions coloring or ‘traffic-lights’. I.e. we will be able to change presentation color of negative sales growth to red, and those that beat expectations to green. And all this only using MDX.

NULL members and EMPTY cells

Let’s recall the example from the previous chapter.
WITH MEMBER Measures.[Sales Growth] AS '(Sales) – (Sales, Time.PrevMember)'
SELECT
   { [Sales], [Sales Growth] } ON COLUMNS,
  Month.MEMBERS ON ROWS
FROM Sales
We understand now, that this query will calculate for every month it’s growth in sales compared to the previous month. I.e. for September we will have difference in sales between September and August. However, one may wonder what happens with the first month? There is no previous month for it in our cube (that’s why we call it first month). So what’s gonna happen with MDX? Is it gonna return an error? Well, even though it looks like reasonable thing from the mathematical point of view to return an error in such a case, it would be complete disaster from pragmatic, business point of view. Let’s try to understand, what system analyst expects to see for Sales Growth in first month. Well, in many the cases he just doesn’t really care. He want to see how the business is doing over the time, it is not important what to do for the first one. For sure, he doesn’t want the query to have an error because of the first month. One of the most common things is to assume that Sales Growth is equal to Sales in first month, because Sales were non-existent (i.e. zero) before the first month. MDX designers had experience in real world OLAP implementations, and they were aware of  this problem of non-existing members. Therefore, rather then return an error, MDX has notion of NULL members which represents non-existing members. And here is very important definition: By non-existing members we mean here multidimensional coordinates have that are out of scope of multidimensional subspace defined by the cube. I.e. invalid member name, such as [Dummy] in level of months doesn’t make it NULL member. If you’ll use such a name in MDX query, it will fail with error. However if by navigating in multidimensional space you suddenly run out of boundaries of the cube (like saying [January].PrevMember), you will get NULL member. The semantic of NULL members is following:
1.      When the cell contains as one of its coordinates NULL member, the numeric value of the cell will be zero. Therefore (Sales, [January].PrevMember) = 0
The reasoning behind this is very simple: If the cell has non-existing coordinates, this cell doesn’t exist in this cube. In the business language it means that there were no sale at that moment of time, at this place of this product etc. And if there were no sale, than we sold zero units, received zero dollars etc. Of course, if MDX had support for VARIANTs, it would be logical to define the value of NULL member as VARIANT’s NULL value. I heard rumors, that the next version of MDX will support variants.
2.      Any member function applied to NULL returns NULL.
This means, that NULLs don’t remember their history. I.e. one might expect, that [January].PrevMember.NextMember would return [January] back. No. Once NULL – forever NULL. If you left the cube’s subspace and entered the outer space, you are lost there. There is no way back to the cube. For my opinion, this is not perfect behavior, but given the difficulty in implementing “history” of  NULL members (think about [January].PrevMember.PrevMember.Parent.FirstChild.NextMember.NextMember), and little business value associated with implementing such a history, we decided about the described above behavior.
3. When NULL member is included in set, it is just ignored. Therefore
{[September], [January].PrevMember} = {[September]}
This rule seems to be obvious, but it has strong motivation behind it. Jumping a little bit ahead, suppose you were to calculate moving average of last three months. The straight-forward approach is to define
([Time] + [Time].PrevMember + [Time].PrevMember.PrevMember) / 3

(remember, that CurrentMember is default property of dimension !).
The formula looks OK, but think what will happen for January and February (first two months). Suppose, that sales in January were 90 and in February 120. Then, the moving average for January will be (90+0+0)/3 = 30 instead of 90/1 = 90, and for February it will be (120+90+0)/3 = 70 instead of (120+90)/2 = 105. To solve this problem, MDX has statistical function called AVG, which calculates the average of the members in the set. Then you can apply
AVG( {[Time], [Time].PrevMember, [Time].PrevMember.PrevMember},  Sales)
Since sets automatically eliminate NULL members, AVG will do the right thing, because it asks set about count of elements in it.
Of course, PrevMember function is not the only one which can result in NULL member. Asking child (first or last) from member on the lowest level, asking Parent from member on highest level, doing NextMember on last member in the level – all those are example of NULL members.
The notion of NULL members is tightly connected to the notion of EMPTY cells. As a matter of fact, we already talked about EMPTY cells, when we talked about cell which contains NULL member at least in one of its coordinates. This cell is out of the scope of the cube, therefore it is empty, and numeric value calculated for this cell will be zero. However, EMPTY cells may occur not only when out of boundaries of the cube. One of the most broad problems in OLAP is data sparsity problem. I.e. real life cubes tend to populate only tiny fraction ( <0.01% ) of their theoretical volume. More the dimensionality of the cube, more the sparsity of data. The problem gets worse when dimensions are not orthogonal - for example there is strong correlation between “Customer Location” and “Store Location” dimensions, since people tend to buy where they live. But even when dimensions are orthogonal, sparsity still exist – for example “Customer” dimension, “Product” dimension and “Time” dimension, since there are not too many customers who buy something every day (this is not true even for my wife J). And even if they do, then they don’t buy every possible product ! As a result, there are a lot of cells in the cube, which don’t contain any data, because this data just doesn’t exist. Naturally, we’ll call such cells EMPTY cells. The numeric value for such cells will be zero for the reasons we explained above. However, sometimes it is important to distinguish between the situation where the data exist and it’s value equal to zero, and situation where data doesn’t exist resulting in EMPTY cell whose value again will be evaluated to zero. To accomplish that, MDX supports predicate IsEmpty, which can be applied to cell, and returns boolean value of TRUE or FALSE. We’ll see how useful boolean expression can be in later chapters.

Algebra of sets

So far we talked about axes and in this chapter we are gonna talk about sets. As we will see, sets can be used in many different fashions in MDX. For now we just assume that set is just a synonym word for axis. As we all remember from high school’s class “Set Theory”, sets have all bunch of associated operations on them such as union, intersection, difference, cartesian composition and decomposition etc. MDX offers all these classical set functions. The input and/or output sets might be sets of members or sets of tuples. Main restriction on set is that all its elements have to be of the same structure. By structure we mean the following: If set is set of members, all members have to come from the same dimension (even though they can be from different levels). Therefore the following set { [1997], [August] } is valid set, but { [August], [Sales] } is not valid set. If the set is set of tuples, then the dimensionality should be the same and the corresponding members of the tuples have to be from the same dimension. I.e. { ([1997], [USA]), ([August], [WA]) } is valid set, but { ([August], [WA]), ([1997], [USA]) } is not valid, because order of dimensions in the tuple is reversed. One, who didn’t read the previous chapters may wonder why this restriction ? However, we remember, that primary use of sets is to axes in the query, and it doesn’t make any sense at all to mix dimensions on the axis. Most common OLAP analysis is to put set from one dimension against set from another dimension. Even when two dimensions on the same axis are required, it is as we saw for nesting purposes, and tuples are used to achieve that. There is no sense to put on axis tuples with different dimensionality or with incompatible structures.
Well, let’s start looking into set functions defining the set algebra. The first two functions are pretty self-explaining:
UNION( set1, set2 )
INTERSECT( set1, set2 )
While INTERSECT is relatively rarely used function, UNION is usually heavily used. It is common practice to combine more than two sets together, sometimes, one wants to union sets with members etc. In these situations UNION becomes clumsy to use, so there is another, nire easy way to unite sets and elements. As a mater of fact, we already know that way: the same way we enumerated members and tuples into set, we can also have union, i.e. the following is valid MDX:
{ set1, set2, …, setn }
More, than that. One can freely mix sets and elements in enumeration, i.e.
{ member1, set1, set2, member2,
…, memberk, setn }

We’ll see soon, that still function UNION offers something, that set enumeration doesn’t offer, but in most cases enumeration is more easy and intuitive (even though, people who used to program in the languages with strong typing, might feel uncomfortable with free mixing of scalars and arrays. But it is so convinient, so it was worth to define it this way). Let’s consider the following example: we want to see on the axis results for Europe, USA, all states in USA, all cities in state WA and for Asia (typical drilldown scenario). We can define the set as
{
 [Europe],
 [USA],
 [USA].Children,
 [WA].Children,
 [Asia]
}
The equivalent with UNION will look like
UNION(
  { [Europe], [USA] },
  UNION(
    [USA.Children],
    UNION(
      [WA].Children,
      { [Asia] }
    )
  )
)
Some MDX implementations may also define operator + (plus) as a way to do union of sets, i.e. to allow
set1 + set2 + … + setn

This might be nice algebraic extension, but it is not part of current MDX standard.

No comments:

Post a Comment