Monday, July 30, 2007

ETL Process

The ETL (Extract, Transform, Load) process is comprised of several steps and its architecture depends on the specific data warehouse system. In this post, an outline of the process will be given along with choices that are/could be used for OpenMRS.

Data sources, staging area and data targets

Data sources: The only data source for the moment is the OpenMRS database.
Staging area: This refers to an intermediate area between the source database and the DW database. This is where the extracted data from the source systems are stored and manipulated through transformations. At this time, there is no need for a sophisticated staging area, other than a few independent tables (called orphans), which are stored in the DW database.
Data Targets: The DW database. This is the database that will be used for querying, data extracting and the foundation for OLAP cubes.

Extraction

The extraction of the data is done using SQL queries on the OpenMRS db. These SQL scripts are mostly responsible for the flattening of the normalized data model, handling NULL values, which are generally avoided in the DW and simple conversions (e.g. boolean values to textual descriptions).
The extraction process, will extract only new and changed records. This is supported by the audit columns in most of the source tables. When the audit columns values are not properly maintained (e.g. there are NULL values), then we can use a MINUS set operator to compare the natural keys from the dimension in the data warehouse against the natural keys in the source system table. MINUS is not supported in MySQL, so we`ll use "NOT EXISTS" or an equivalent way to achieve the same result.

Cleaning and conforming

Although this step is where the most transformations are needed, usually, we`ll mostly use the ETL tool (Kettle) features.
During this step, data cleaning, data quality checks and data integration are performed. This means that individual columns are checked for valid data (e.g. ranges for numeric values) and business rules are enforced. Also, data from different sources are integrated. There should also be a policy established for error handling.
For this project these processes will be kept to a minimum. It is assumed that the OpenMRS system implements proper validation logic, so we won`t try to duplicate that. We use Kettle auditing features for auditing and reporting on the ETL process. There are two tables, one for transformation executions and one for job executions.
The errors that we will be facing will be mostly DB (MySQL) errors, caused probably by unexpected data, which should be handled manually, by an administrator. Another approach would be to handle these errors and keep error event tables to log the problematic records.
In addition, it is assumed that the data will be loaded from a single OpenMRS instance, so we don`t take into account possible issues that might come up when trying to integrate from multiple databases.

Delivering Dimension Tables

At this step the dimension tables are loaded with the new and changed data. Dimension tables consist of a primary key (which is a meaningless integer key - a surrogate key), which is used to join to fact tables, a natural key which is based on fields of the source system, that doesn`t change over time or across systems (e.g. for a patient dimension we use the patient identifier for natural key, not a patient id) and descriptive attributes.
We also need to handle changes to dimension attributes. There are three approaches:
1. Overwrite the existing dimension record (type 1 slowly changing dimension)
In this case we use an update else insert (merging) functionality.
2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions).
If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
For the moment Type 1 is used.
Other Issues:
  • Generating surrogate keys
  • Date Dimension
  • Dimensional Roles
Delivering Fact Tables
  • Structure
  • Referential Integrity
  • Surrogate Key Pipeline
  • Preparing the fact tables for loading
  • Incremental Loading
  • Aggregations
  • OLAP Cubes
Database Features for Improved Performance
To do

Scheduling
To do

Monday, July 16, 2007

DW Data Model

This post is going to describe the data model for the OpenMRS data warehouse. It will be edited frequently to add documentation for the model and to modify it.

Star Schemas

Conventions used:
  • wk - (warehouse key) used for the surrogate (primary) keys of the dimension tables
  • nk - (natural key) used for the natural keys of the dimension tables
  • dd - (degenerate dimension) used for the degenerate dimensions in fact tables
  • bridge - used for tables that serve as bridge between dimension tables that have a many-to-many relationship

Observations fact table and associated dimensions


Orders fact table with associated dimensions


Encounter fact table and associated dimensions


Drug order fact tables and associated dimensions
Patient dimension and its subdimensions

Concept dimension with bridge tables and subdimensions
Experimental: Monthly cohort snapshot, using a sample cohort minidimension.



Concepts and definitions
  • Dimensions
    Dimension tables are used to describe, filter and aggregate the data that is collected for a specific activity or event. Practically they can be considered as the by words when defining the information content of a report. For example number of encounters by month by provider. Dimensions have relatively few rows compared to the fact tables, but have many descriptive textual attributes. Many times they contain hierarchies, for example location, province, country and are highly denormalized, storing even multiple hierarchies in a single table. A dimension has a primary key, that is usually a meaningless, unique integer, called a surrogate key. These keys are handled by the ETL process and are used to join dimensions to fact tables. In addition, a dimension has one or more fields that serve as the natural key of the table and are based on fields that come from the source system. When we track changes in a dimension there is a one-to-many relationship between a natural key and a surrogate key. Finally, a dimension contains many (hopefully) descriptive attributes.
  • Surrogate Keys
    Surrogate keys are used to join fact tables to dimensions. Surrogate keys isolate the data warehouse from changes to the operational systems, help in integrating data into dimensions coming from different sources and offer good performance when used in joins as they are simple integers. Lastly, they are used in tracking changes in dimension tables (see slowly changing dimensions).
  • Natural Keys
    The natural key is based on one or more fields from the source systems. The natural key identifies uniquely a row and it doesn`t change over time. A good example is the identifier in a patient record.
  • Date Dimensions
    The Date dimension is joined with most fact tables and it gives time-related information about the measurement process that is represented with the fact table. We can define date dimensions at different granularities, depending on our needs (for example day or month granularity). These date dimensions are defined as subsets (in terms of rows and columns) of each other to maintain consistency. Moreover, date dimensions are used multiple times in the same fact table with a different role (see dimensional roles).
  • Dimensional Roles
    Sometimes a dimension is linked with a fact table multiple times. Each of these is called a dimension role. Although the dimension is stored as one physical table, each of its roles is implemented as a separate view, ideally with unique column names. This enhances the usability of the schema when querying.
  • Outriggers (subdimensions)
    An outrigger (or subdimension) is a secondary dimension table linked to a dimension table. Although we try to have all the related attributes of a dimension in a single table, there are some sets of attributes that are used again and again in many dimensions. The same way that dimensions are used as an entry point in fact tables, subdimensions are used as an entry point in dimensions. The most usual subdimension is the date dimension, for example a last encounter date in a patient dimension.
  • Degenerate Dimensions
    We sometimes want to store identifiers coming from source systems transaction tables in the data warehouse, for purposes of grouping rows, tracking where the data come from or using them as primary keys in fact tables. We don`t create a separate dimension for these attributes as they would contain only one field, so we store them directly on the fact table and call them degenerate dimensions (for example observation id, encounter id).
  • Slowly Changing Dimensions
    We also need to handle changes to dimension attributes. There are three approaches:
    1. Overwrite the existing dimension record (type 1 slowly changing dimension). In this case we use an update else insert (merging) functionality.
    2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions). If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
    3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
  • Multivalued Dimensions
    In some cases, there are certain attributes that are associated with a many-to-many relationship with a dimension or simply take multiple values (e.g. patient identifiers, concept sets). When the number of these multiple values is open-ended, one way to handle this is to use a bridge table to associate them (for example a concept with a concept set).
  • Minidimensions
    In large dimensions (patient for example), there are certain attributes that are more interesting than others. When these attributes change frequently over time and we want to track these changes, we can use a technique called minidimension. We put these attributes (for example weight, age, cd4 count) in separate dimensions, when it makes sense to analyze them together. A row is created for each unique combination of these attributes that is found in the data (not one row for each patient). When the attributes take continuous values they are converted to banded ranges. The raw values are kept on the fact table. The fact table maintains foreign keys to both the main dimension and the minidimension(s) and is usually a periodic (weekly, monthly, e.t.c) snapshot fact table, so that we can track changes to these attributes.
  • Arbitrary Value Banding
    When there is a need to do value-band reporting on numeric facts, but the predefined bands (e.g. age group, weight group) don`t hold the ranges (lower and upper values) that we want, we can use a band definition table (here called band_group) that holds the needed sets of reporting bands. This way, we have the flexibility to redefine bands at query time.
  • Patient Study Group (Cohort)
    In addition to the minidimension technique, we could create patient study groups for more specialized analysis. To create a patient study group we run queries or execute an algorithm that uses a set of criteria to select a subset of the patients (e.g. HIV positive patients). The study group`s patient identifiers can be stored in a separate table, joined with the patient dimension table and hidden under a view (e.g. hiv_patients). We can then create reports combining this view with our fact tables.
  • Positional Attributes to Represent Text Facts
    In case we want to monitor the values of an important qualitative attribute through time (e.g. weight group or cd4 count group for HIV patients), in order to use it as a constraint to group patients based on this, we can store these values directly on the patient dimension. For example, we could have 15 attributes for a 15-month time series. This way, SQL queries are simplified as all the filtering can be done in one row. If we were to use a fact table to store these values, we would have to formulate constraints across records, which is difficult.

  • Fact Tables
    Fact tables store the measurements of the processes/activities/work flows/events of the system we are trying to model. There is one row in the fact table for each measurement. The fields of the fact table that hold the result of the measurements are called facts (or measures). Dimensions (and their attributes) describe these measurements and contain the context surrounding them. This is known as the grain of the fact table (e.g. a test to determine the temperature of a patient on certain day at a certain location). A fact table contains foreign keys that point to the dimension tables, have one or more numerical fields (the facts) and may contain one or more degenerate dimensions. The primary key of these tables is defined by a subset of the foreign keys.
  • Transaction Fact Tables
    These fact tables represent events that occurred at a specific point in time and space. They let us analyze data in extreme detail. Rows in these tables are found only if events take place. For example if there is no observation for a patient a given period then there is no information available for him and we won`t get any data for him in that period. Because of their unpredictable sparseness, we use other types of fact tables as described below.
  • Factless Fact Tables/Coverage Tables
    These tables have no facts. They are used to represent event tracking and coverage. Those that are used for events, relate different dimensions values at a point in time and space (e.g. encounters, orders). However, when designing such tables we add a dummy fact, which is usually called table_name_count that takes the value 1, just to make the SQL queries more clean. These tables can also be used to track coverage (for example have a fact table with one row for every test, location, day combination with a single fact denoting if it was used or not).
  • Periodic Snapshot Fact Tables
    When we need to track the values of some measurements at regular, predictable time intervals, we can use periodic snapshots. Instead of having a row for each event we take a picture (snapshot) of the activity at the end of the day, week, or month. Sometimes, periodic snapshots are just an aggregation of the transactions that occurred during a time period, however they have more predictable sparseness as there is one row of facts generated for each combination of the dimensions that it refers to. Periodic snapshots are suited for tracking long-running processes.
  • Accumulating Snapshot Fact Table
    These fact tables represent the complete life of an activity or process, that has a definite beginning and end. They have multiple date dimensions (roles) to refer to the events that occur during a process. These kind of tables are used for to the spans of time between different events (lag calculations).

  • Star Schema
    When fact tables and dimension tables are joined their structure looks like a star, so this why this kind of schema is called a star schema. Dimension attributes supply the report labeling, whereas the fact tables supply the report`s numeric values. The goal of this schema is simplicity, performance and extensibility. It is easier for users to understand and navigate then a normalized schema. Also, it is easier for database optimizers to handle these schemas, which are usually heavily indexed, resulting in improved performance. Finally, these schemas are extensible in these ways: we can add new dimensions, add new facts to fact tables, add attributes to dimension tables. These changes can be made using an ALTER TABLE command.
  • Dimensional Design Process
    There are four steps in designing a dimensional model are:
    1. Select the business process to model. A process is a natural business activity performed in an organization that is supported by a source system.
    2. Declare the grain of the business process. This means, specifying exactly what an individual fact table row represents and the level of detail associated with the fact table measurements.
    3. Choose the dimensions that apply to each fact table row. Dimensions represent all possible descriptions that take on single values in the context of each measurement.
    4. Identify the numeric facts that will populate each fact table row. Facts in a design must be true to the grain defined in step 2.
    In order to make decisions about the above steps, we need to consider both our users` requirements and the source data. Having an existing normalized ER diagram, the first step in converting it into a set of dimensional models is to separate the ER diagram into its discrete business processes and then model each one separately. The second step is to select those many-to-many relationships in the ER diagrams that contain numeric and additive nonkey facts and designate them as fact tables. The final step is to denormalize all the remaining tables into flat tables with keys that join directly to the fact tables. These tables become the dimension tables.
  • Dimensional Models for Medical Records
Dimension Tables
To do
Fact Tables
To do


References
  1. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball and Margy Ross
  2. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin by Ralph Kimball and Joe Caserta
  3. SQL Cookbook (Cookbooks (O'Reilly)) by Anthony Molinaro
  4. Oracle Data Warehousing Guide

Thursday, June 14, 2007

Pentaho Evaluation for OpenMRS

This post lists some of the Pentaho tools, how they can be used to support building a data warehouse system (a reporting database) and what is required for one to know in order to create reports and analyze data.

- ETL (Kettle)
Kettle can be used in OpenMRS in two ways:
  1. To load the data warehouse tables
  2. On the fly transformations and exports of data
1. For the moment, our only data source is the OpenMRS database. For that reason we can use SQL to extract the needed data. We also, need a way to extract only the changed data. That means we should fetch only records that are new or updated since the last successful loading of the DW (since the last time that the transformation ran without error). We can use the audit columns in the source database (date_created, date_changed) and the logging feature of Kettle to handle this. Kettle lets you specify a table that logs the details of tranformation execution. This table contains fields to log the number of records read from the source database, the number of records loaded in the DW and and the timestamp of the last successful transformation execution. The last feature can be used to publish reports (a web page for example) with information on the loading process of the DW.
Most transformations can also be handled with SQL. For example we will use SQL to filter results, join tables, handle NULL values, map abbreviated field values to user-friendly values, converting from one data type to another, formatting, map continuous values to pre-defined bands and generally prepare the data for querying.
After the previous steps are done, we will use Kettle to handle surrogate keys (these keys are used as primary keys in the DW and are different from the source system keys), when loading dimension tables. Moreover, Kettle will be used to merge the extracted data in the DW. A mapping should be defined between the input fields to the DW fields and the records that changed in the source system will be updated in the DW and the new records will be inserted.

Depending on the requirements Kettle can be used, in the future, to handle "slowly changing dimensions", integrate external data and execute other processes (jobs).

2. Kettle makes it easy to export a dataset in various formats as Excel, Access, delimeted files e.t.c.

After preparing the data for querying, we need query tools to write reports and generally explore and analyze the data. These should be able to present the query results in tables, crosstabs and charts. The formats that should generally be supported are: HTML, pdf and Excel. The user should be able to access the reports through web applications, portals, desktop clients (e.g. Excel) and/or email.

In Pentaho the above use cases translate into something like this:

The user`s point of view
The user uses a browser that points to a Pentaho web application.
The user chooses a "solution" which can be seen as a collection of folders.
Each folder contains one or more "reports" for a specific subject area. These reports can be one of the following:
  • JPivot pages, that let the user navigate data on a cube. The user can drill down/up, pivot (swap axes) and export the data in pdf and excel.
  • JFreeReport pages, where the user enters one or more parameters to filter the results and then gets a report in HTML, Excel or pdf format.

  • Dashboard pages, which combine gauges, maps, charts and tables in order to to show multiple results together.


The developer`s point of view

- Action Sequences
In order to create these "solutions", or in other words the screens from which the user will access the reports, the developer must create a set of XML documents. These XML documents define action sequences, which are used by the Pentaho engine (or framework if you prefer) to call back-end services and components (e.g. run report, genenerate PDF).
To help in creating these XML documents, there is Design Studio, which is a GUI tool based on Eclipse.

Some of the available actions and their corresponding components are:

- JFreeReport
In order to create a report, the developer must specify the SQL query (or MDX query if the data source is an OLAP server) to fetch the report`s data and the report definition, which can be created through a GUI tool, like the Report Design Wizard or Report Designer for reports with more complex layouts.

- Kettle
This action lets you call a Kettle tranformation or job, which is a good example of how Kettle can be embedded in an application.

- JPivot
The developer can create a JPivot page using a PivotViewComponent. Apart from the XML that specifies the action sequence, he/she must write the MDX query that will fetch the data from the OLAP server.

- Dashboards
To create a dashboard one should combine actions and components that produce charts and reports. This is done through a JSP page that the developer must create.

Monday, June 4, 2007

Another approach for reporting: A Data Warehouse System

The motivation for a data warehouse system

Why would we want to build a data warehouse system? We might consider doing this for some of the following reasons:
  • It is difficult to get out of the current system the information that users need. In order for example to find the number of encounters between given dates, categorized by provider and/or encounter type and maybe see the percentage of change of this measure between time periods, one would have to be expert in SQL and write complex and huge SQL queries. And what`s more, we would have to do that for every single question that a user might ask.
  • It takes a lot of time for the system to return results of queries (reports). When someone executes a query that uses many joins or aggregations (sum, avg e.t.c.) in a database that contains hundred of thousands/millions of records it takes a lot of time for the DBMS to return a result. This is because the data model that is usually used for transaction processing applications is a normalized one (3NF), which is optimized for collecting data and maintaining their integration and consistency, not for querying and analyzing huge amounts of data.
  • We want to use external data sources to help us analyze our current data. For example, we could use statistical or demographic data for a region and combine that with OpenMRS data. More specifically, relate patient age groups with general information for that age group, or find a ratio of number of health providers to population, death causes for the general population to death causes for a clinic e.t.c.
  • We want to keep historical information. For example, a patient`s attributes might change over time, concepts, drugs, diseases names might change their names or their categories names and it could be useful to take into account these changes when we analyze our data over time.
  • We want to include information in our reports, that is produced from complex domain logic. For example, we could associate patient data with rules (as described in the Logic Service Project), by running the domain logic once during the ETL process of the data warehouse. Then, this information would be easily accessible to a report writer.
  • We want to handle cleaning of data and checking their quality, before creating reports. It is possible that some fields of the source database could have missing data or values out of range. A data warehouse typically handles this, through the ETL process or by marking the offending rows using an audit dimension.
An overview of the data warehouse

How can the above requirements be met? What are the main components of such a system?

The approach chosen in this project is to build a data warehouse. Here is the definition given by R. Kimball on the "The Data Warehouse ETL Toolkit":

"A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making."

The data warehouse makes a copy of transaction and other source data, in a separate database that is designed using dimensional models (or star schemas). The goals of this design are simplicity/user understandability and query performance. There are many tools that require this design like OLAP servers, query tools, report writers, dashboards e.t.c., which let users get access to the needed information more easily.

From a user`s point of view, the data warehouse, it is used to watch the performance of an organization and its activities through time. For example, how many encounters did we have this month and how does this compare to the previous month. From a developer`s point of view, it deals with a large number of rows each time to get a result to the user, as opposed to operational systems which mostly deal with a few rows (or objects if you prefer) each time.

Some of the general use cases that the data warehouse is good at:
  • Summaries/aggregations of data, categorized and grouped in every way possible (slicing and dicing).
  • Navigating/exploring data in various detail levels (drilling up/down)
  • Track changes of interesting entities/attributes (slowly changing dimensions)
  • Use a query/report tool to quickly create reports (instead of building your own)
  • Can be the foundation for decision support systems and data mining/statistical applications
Some of the components of the system are:
  • The data sources (OpenMRS database and maybe other external data).
  • The ETL process. The ETL (Extract, Transform, Load) process, extracts data from the data sources, ensures data quality, integrates different sources, modifies the structure of data so they can be loaded in star schemas and generally does all the necessary preparation to make the data ready for querying. The ETL process is usually developed with the help of a tool (Kettle in our case).
  • The relational implementation (star schemas). This may be a separate schema on the same DBMS or on a different physical server. This includes the dimensionally modeled tables that are the target of the ETL process.
  • OLAP cubes. There are OLAP (Online Analytical Processing) servers (Mondrian in our case), that offer better performance (using aggregates and special indexes) and are better suited for analysis using specialized languages (e.g. MDX instead of SQL). The source for OLAP cubes are the dimensional tables.
  • Data Access Tools. These include: canned, prebuilt parameter-driven reports/data exports accessible through the OpenMRS web application, ad hoc query tools (could be a desktop tool or another web application - e.g. Pentaho BI), sophisticated analysis tools ( e.g. SAS).

The plan of the project
  • Define reporting / analysis use cases for OpenMRS that would benefit most from a data warehouse system. In this step we will define the needs of the users and will try to categorize it in use cases and/or specific reports.
  • Design star schemas using requirements/uses cases and operational data model as input. In this step we will design star schemas (fact tables + dimensions) that can satisfy the requirements.
  • Develop ETL functionality. The ETL functionality will be implemented within a module or through various scripts that run in a separate, automated and scheduled process. It will be used to load the star schemas and to implement data transformations in a generalized way for the operational system. This step will be the most code intensive part of the project.
  • Implement the data warehouse using relational implementation (MySQL). In this step the necessary physical database objects will be created and loaded with data from the ETL process.
  • Implement the data warehouse using OLAP implementation (Mondrian + MySQL). In this step Mondrian will be deployed and used for querying the database that was built on the previous step.
  • Data Access Tools. Various open source tools will be evaluated (starting from Pentaho) for report writing. Also, it may be necessary to custom code some needed reports.
The goal of the project is by the end of the summer to have a system integrated with OpenMRS that delivers visible results for users. In order to achieve this, the steps outlined above will be followed in an iterative way.

Friday, May 25, 2007

About OpenMRS

Our world continues to be ravaged by a pandemic of epic proportions, as over 40 million people are infected with or dying from HIV/AIDS -- most (up to 95%) are in developing countries. Prevention and treatment of HIV/AIDS on this scale requires efficient information management, which is critical as HIV/AIDS care must increasingly be entrusted to less skilled providers. Whether for lack of time, developers, or money, most HIV/AIDS programs in developing countries manage their information with simple spreadsheets or small, poorly designed databases...if anything at all. To help them, we need to find a way not only to improve management tools, but also to reduce unnecessary, duplicative efforts.

As a response to these challenges, OpenMRS formed in 2004 as a open source medical record system framework for developing countries -- a tide which rises all ships. OpenMRS is a multi-institution, nonprofit collaborative led by Regenstrief Institute, Inc. (http://regenstrief.org), a world-renowned leader in medical informatics research, and Partners In Health (http://pih.org), a Boston-based philanthropic organization with a focus on improving the lives of underprivileged people worldwide through health care service and advocacy. These teams nurture a growing worldwide network of individuals and organizations all focused on creating medical record systems and a corresponding implementation network to allow system development self reliance within resource constrained environments. To date, OpenMRS has been implemented in several African countries, including South Africa, Kenya, Rwanda, Lesotho, Uganda, and Tanzania. This work is supported in part by organizations such as the World Health Organization (WHO), the Centers for Disease Control (CDC), the Rockefeller Foundation, and the President's Emergency Plan for AIDS Relief (PEPFAR).