Monday, July 16, 2007

DW Data Model

This post is going to describe the data model for the OpenMRS data warehouse. It will be edited frequently to add documentation for the model and to modify it.

Star Schemas

Conventions used:
  • wk - (warehouse key) used for the surrogate (primary) keys of the dimension tables
  • nk - (natural key) used for the natural keys of the dimension tables
  • dd - (degenerate dimension) used for the degenerate dimensions in fact tables
  • bridge - used for tables that serve as bridge between dimension tables that have a many-to-many relationship

Observations fact table and associated dimensions


Orders fact table with associated dimensions


Encounter fact table and associated dimensions


Drug order fact tables and associated dimensions
Patient dimension and its subdimensions

Concept dimension with bridge tables and subdimensions
Experimental: Monthly cohort snapshot, using a sample cohort minidimension.



Concepts and definitions
  • Dimensions
    Dimension tables are used to describe, filter and aggregate the data that is collected for a specific activity or event. Practically they can be considered as the by words when defining the information content of a report. For example number of encounters by month by provider. Dimensions have relatively few rows compared to the fact tables, but have many descriptive textual attributes. Many times they contain hierarchies, for example location, province, country and are highly denormalized, storing even multiple hierarchies in a single table. A dimension has a primary key, that is usually a meaningless, unique integer, called a surrogate key. These keys are handled by the ETL process and are used to join dimensions to fact tables. In addition, a dimension has one or more fields that serve as the natural key of the table and are based on fields that come from the source system. When we track changes in a dimension there is a one-to-many relationship between a natural key and a surrogate key. Finally, a dimension contains many (hopefully) descriptive attributes.
  • Surrogate Keys
    Surrogate keys are used to join fact tables to dimensions. Surrogate keys isolate the data warehouse from changes to the operational systems, help in integrating data into dimensions coming from different sources and offer good performance when used in joins as they are simple integers. Lastly, they are used in tracking changes in dimension tables (see slowly changing dimensions).
  • Natural Keys
    The natural key is based on one or more fields from the source systems. The natural key identifies uniquely a row and it doesn`t change over time. A good example is the identifier in a patient record.
  • Date Dimensions
    The Date dimension is joined with most fact tables and it gives time-related information about the measurement process that is represented with the fact table. We can define date dimensions at different granularities, depending on our needs (for example day or month granularity). These date dimensions are defined as subsets (in terms of rows and columns) of each other to maintain consistency. Moreover, date dimensions are used multiple times in the same fact table with a different role (see dimensional roles).
  • Dimensional Roles
    Sometimes a dimension is linked with a fact table multiple times. Each of these is called a dimension role. Although the dimension is stored as one physical table, each of its roles is implemented as a separate view, ideally with unique column names. This enhances the usability of the schema when querying.
  • Outriggers (subdimensions)
    An outrigger (or subdimension) is a secondary dimension table linked to a dimension table. Although we try to have all the related attributes of a dimension in a single table, there are some sets of attributes that are used again and again in many dimensions. The same way that dimensions are used as an entry point in fact tables, subdimensions are used as an entry point in dimensions. The most usual subdimension is the date dimension, for example a last encounter date in a patient dimension.
  • Degenerate Dimensions
    We sometimes want to store identifiers coming from source systems transaction tables in the data warehouse, for purposes of grouping rows, tracking where the data come from or using them as primary keys in fact tables. We don`t create a separate dimension for these attributes as they would contain only one field, so we store them directly on the fact table and call them degenerate dimensions (for example observation id, encounter id).
  • Slowly Changing Dimensions
    We also need to handle changes to dimension attributes. There are three approaches:
    1. Overwrite the existing dimension record (type 1 slowly changing dimension). In this case we use an update else insert (merging) functionality.
    2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions). If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
    3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
  • Multivalued Dimensions
    In some cases, there are certain attributes that are associated with a many-to-many relationship with a dimension or simply take multiple values (e.g. patient identifiers, concept sets). When the number of these multiple values is open-ended, one way to handle this is to use a bridge table to associate them (for example a concept with a concept set).
  • Minidimensions
    In large dimensions (patient for example), there are certain attributes that are more interesting than others. When these attributes change frequently over time and we want to track these changes, we can use a technique called minidimension. We put these attributes (for example weight, age, cd4 count) in separate dimensions, when it makes sense to analyze them together. A row is created for each unique combination of these attributes that is found in the data (not one row for each patient). When the attributes take continuous values they are converted to banded ranges. The raw values are kept on the fact table. The fact table maintains foreign keys to both the main dimension and the minidimension(s) and is usually a periodic (weekly, monthly, e.t.c) snapshot fact table, so that we can track changes to these attributes.
  • Arbitrary Value Banding
    When there is a need to do value-band reporting on numeric facts, but the predefined bands (e.g. age group, weight group) don`t hold the ranges (lower and upper values) that we want, we can use a band definition table (here called band_group) that holds the needed sets of reporting bands. This way, we have the flexibility to redefine bands at query time.
  • Patient Study Group (Cohort)
    In addition to the minidimension technique, we could create patient study groups for more specialized analysis. To create a patient study group we run queries or execute an algorithm that uses a set of criteria to select a subset of the patients (e.g. HIV positive patients). The study group`s patient identifiers can be stored in a separate table, joined with the patient dimension table and hidden under a view (e.g. hiv_patients). We can then create reports combining this view with our fact tables.
  • Positional Attributes to Represent Text Facts
    In case we want to monitor the values of an important qualitative attribute through time (e.g. weight group or cd4 count group for HIV patients), in order to use it as a constraint to group patients based on this, we can store these values directly on the patient dimension. For example, we could have 15 attributes for a 15-month time series. This way, SQL queries are simplified as all the filtering can be done in one row. If we were to use a fact table to store these values, we would have to formulate constraints across records, which is difficult.

  • Fact Tables
    Fact tables store the measurements of the processes/activities/work flows/events of the system we are trying to model. There is one row in the fact table for each measurement. The fields of the fact table that hold the result of the measurements are called facts (or measures). Dimensions (and their attributes) describe these measurements and contain the context surrounding them. This is known as the grain of the fact table (e.g. a test to determine the temperature of a patient on certain day at a certain location). A fact table contains foreign keys that point to the dimension tables, have one or more numerical fields (the facts) and may contain one or more degenerate dimensions. The primary key of these tables is defined by a subset of the foreign keys.
  • Transaction Fact Tables
    These fact tables represent events that occurred at a specific point in time and space. They let us analyze data in extreme detail. Rows in these tables are found only if events take place. For example if there is no observation for a patient a given period then there is no information available for him and we won`t get any data for him in that period. Because of their unpredictable sparseness, we use other types of fact tables as described below.
  • Factless Fact Tables/Coverage Tables
    These tables have no facts. They are used to represent event tracking and coverage. Those that are used for events, relate different dimensions values at a point in time and space (e.g. encounters, orders). However, when designing such tables we add a dummy fact, which is usually called table_name_count that takes the value 1, just to make the SQL queries more clean. These tables can also be used to track coverage (for example have a fact table with one row for every test, location, day combination with a single fact denoting if it was used or not).
  • Periodic Snapshot Fact Tables
    When we need to track the values of some measurements at regular, predictable time intervals, we can use periodic snapshots. Instead of having a row for each event we take a picture (snapshot) of the activity at the end of the day, week, or month. Sometimes, periodic snapshots are just an aggregation of the transactions that occurred during a time period, however they have more predictable sparseness as there is one row of facts generated for each combination of the dimensions that it refers to. Periodic snapshots are suited for tracking long-running processes.
  • Accumulating Snapshot Fact Table
    These fact tables represent the complete life of an activity or process, that has a definite beginning and end. They have multiple date dimensions (roles) to refer to the events that occur during a process. These kind of tables are used for to the spans of time between different events (lag calculations).

  • Star Schema
    When fact tables and dimension tables are joined their structure looks like a star, so this why this kind of schema is called a star schema. Dimension attributes supply the report labeling, whereas the fact tables supply the report`s numeric values. The goal of this schema is simplicity, performance and extensibility. It is easier for users to understand and navigate then a normalized schema. Also, it is easier for database optimizers to handle these schemas, which are usually heavily indexed, resulting in improved performance. Finally, these schemas are extensible in these ways: we can add new dimensions, add new facts to fact tables, add attributes to dimension tables. These changes can be made using an ALTER TABLE command.
  • Dimensional Design Process
    There are four steps in designing a dimensional model are:
    1. Select the business process to model. A process is a natural business activity performed in an organization that is supported by a source system.
    2. Declare the grain of the business process. This means, specifying exactly what an individual fact table row represents and the level of detail associated with the fact table measurements.
    3. Choose the dimensions that apply to each fact table row. Dimensions represent all possible descriptions that take on single values in the context of each measurement.
    4. Identify the numeric facts that will populate each fact table row. Facts in a design must be true to the grain defined in step 2.
    In order to make decisions about the above steps, we need to consider both our users` requirements and the source data. Having an existing normalized ER diagram, the first step in converting it into a set of dimensional models is to separate the ER diagram into its discrete business processes and then model each one separately. The second step is to select those many-to-many relationships in the ER diagrams that contain numeric and additive nonkey facts and designate them as fact tables. The final step is to denormalize all the remaining tables into flat tables with keys that join directly to the fact tables. These tables become the dimension tables.
  • Dimensional Models for Medical Records
Dimension Tables
To do
Fact Tables
To do


References
  1. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball and Margy Ross
  2. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin by Ralph Kimball and Joe Caserta
  3. SQL Cookbook (Cookbooks (O'Reilly)) by Anthony Molinaro
  4. Oracle Data Warehousing Guide

3 comments:

Justin Miranda said...

Gjergji

Great start. Could you please describe each table in more detail. Discuss the terminology (fact table vs dimension table, band group, band, etc). Then describe the tables in each diagram, whether the table is a fact table or dimension table, how we might actually query the table, and what type of data we can get from each. You can probably concentrate on a single star schema where Patient and Observations are in the fact table and dimensions are Location and Date (from Encounter). We should discuss this some more over Skype.

Anonymous said...

I agree. Great start. I would like to work with you on the use of ontologies like SNOMED to link off the Concept table to provide hierarchical aggregation, etc. This would be through the concept_source_map table

Anonymous said...

An excellent job in explaining the concept. Very good job.