Monday, June 4, 2007

Another approach for reporting: A Data Warehouse System

The motivation for a data warehouse system

Why would we want to build a data warehouse system? We might consider doing this for some of the following reasons:
  • It is difficult to get out of the current system the information that users need. In order for example to find the number of encounters between given dates, categorized by provider and/or encounter type and maybe see the percentage of change of this measure between time periods, one would have to be expert in SQL and write complex and huge SQL queries. And what`s more, we would have to do that for every single question that a user might ask.
  • It takes a lot of time for the system to return results of queries (reports). When someone executes a query that uses many joins or aggregations (sum, avg e.t.c.) in a database that contains hundred of thousands/millions of records it takes a lot of time for the DBMS to return a result. This is because the data model that is usually used for transaction processing applications is a normalized one (3NF), which is optimized for collecting data and maintaining their integration and consistency, not for querying and analyzing huge amounts of data.
  • We want to use external data sources to help us analyze our current data. For example, we could use statistical or demographic data for a region and combine that with OpenMRS data. More specifically, relate patient age groups with general information for that age group, or find a ratio of number of health providers to population, death causes for the general population to death causes for a clinic e.t.c.
  • We want to keep historical information. For example, a patient`s attributes might change over time, concepts, drugs, diseases names might change their names or their categories names and it could be useful to take into account these changes when we analyze our data over time.
  • We want to include information in our reports, that is produced from complex domain logic. For example, we could associate patient data with rules (as described in the Logic Service Project), by running the domain logic once during the ETL process of the data warehouse. Then, this information would be easily accessible to a report writer.
  • We want to handle cleaning of data and checking their quality, before creating reports. It is possible that some fields of the source database could have missing data or values out of range. A data warehouse typically handles this, through the ETL process or by marking the offending rows using an audit dimension.
An overview of the data warehouse

How can the above requirements be met? What are the main components of such a system?

The approach chosen in this project is to build a data warehouse. Here is the definition given by R. Kimball on the "The Data Warehouse ETL Toolkit":

"A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making."

The data warehouse makes a copy of transaction and other source data, in a separate database that is designed using dimensional models (or star schemas). The goals of this design are simplicity/user understandability and query performance. There are many tools that require this design like OLAP servers, query tools, report writers, dashboards e.t.c., which let users get access to the needed information more easily.

From a user`s point of view, the data warehouse, it is used to watch the performance of an organization and its activities through time. For example, how many encounters did we have this month and how does this compare to the previous month. From a developer`s point of view, it deals with a large number of rows each time to get a result to the user, as opposed to operational systems which mostly deal with a few rows (or objects if you prefer) each time.

Some of the general use cases that the data warehouse is good at:
  • Summaries/aggregations of data, categorized and grouped in every way possible (slicing and dicing).
  • Navigating/exploring data in various detail levels (drilling up/down)
  • Track changes of interesting entities/attributes (slowly changing dimensions)
  • Use a query/report tool to quickly create reports (instead of building your own)
  • Can be the foundation for decision support systems and data mining/statistical applications
Some of the components of the system are:
  • The data sources (OpenMRS database and maybe other external data).
  • The ETL process. The ETL (Extract, Transform, Load) process, extracts data from the data sources, ensures data quality, integrates different sources, modifies the structure of data so they can be loaded in star schemas and generally does all the necessary preparation to make the data ready for querying. The ETL process is usually developed with the help of a tool (Kettle in our case).
  • The relational implementation (star schemas). This may be a separate schema on the same DBMS or on a different physical server. This includes the dimensionally modeled tables that are the target of the ETL process.
  • OLAP cubes. There are OLAP (Online Analytical Processing) servers (Mondrian in our case), that offer better performance (using aggregates and special indexes) and are better suited for analysis using specialized languages (e.g. MDX instead of SQL). The source for OLAP cubes are the dimensional tables.
  • Data Access Tools. These include: canned, prebuilt parameter-driven reports/data exports accessible through the OpenMRS web application, ad hoc query tools (could be a desktop tool or another web application - e.g. Pentaho BI), sophisticated analysis tools ( e.g. SAS).

The plan of the project
  • Define reporting / analysis use cases for OpenMRS that would benefit most from a data warehouse system. In this step we will define the needs of the users and will try to categorize it in use cases and/or specific reports.
  • Design star schemas using requirements/uses cases and operational data model as input. In this step we will design star schemas (fact tables + dimensions) that can satisfy the requirements.
  • Develop ETL functionality. The ETL functionality will be implemented within a module or through various scripts that run in a separate, automated and scheduled process. It will be used to load the star schemas and to implement data transformations in a generalized way for the operational system. This step will be the most code intensive part of the project.
  • Implement the data warehouse using relational implementation (MySQL). In this step the necessary physical database objects will be created and loaded with data from the ETL process.
  • Implement the data warehouse using OLAP implementation (Mondrian + MySQL). In this step Mondrian will be deployed and used for querying the database that was built on the previous step.
  • Data Access Tools. Various open source tools will be evaluated (starting from Pentaho) for report writing. Also, it may be necessary to custom code some needed reports.
The goal of the project is by the end of the summer to have a system integrated with OpenMRS that delivers visible results for users. In order to achieve this, the steps outlined above will be followed in an iterative way.

3 comments:

Simon Kelly said...

HOORAY for this project! Literally minutes before I read this post I started downloading some of the Pentaho and JasperSoft products. We are facing some of the problems you mentioned and would benefit greatly from this work.

I look forward to seeing some results and can offer some use cases and the requirements we have.

Gjergji Strakosha said...

Simon,
Feel free to send me any ideas you might have, so that I can take them into account for the design phase. You can find contact information in
this page
.

stock management software systems said...

I can't wait to try this project. I read the whole article and I do believe that this system will really work in my warehouse system.