OpenMRS: ETL/Data Warehouse/Reporting: June 2007

This post lists some of the Pentaho tools, how they can be used to support building a data warehouse system (a reporting database) and what is required for one to know in order to create reports and analyze data.

- ETL (Kettle)
Kettle can be used in OpenMRS in two ways:

To load the data warehouse tables
On the fly transformations and exports of data

1. For the moment, our only data source is the OpenMRS database. For that reason we can use SQL to extract the needed data. We also, need a way to extract only the changed data. That means we should fetch only records that are new or updated since the last successful loading of the DW (since the last time that the transformation ran without error). We can use the audit columns in the source database (date_created, date_changed) and the logging feature of Kettle to handle this. Kettle lets you specify a table that logs the details of tranformation execution. This table contains fields to log the number of records read from the source database, the number of records loaded in the DW and and the timestamp of the last successful transformation execution. The last feature can be used to publish reports (a web page for example) with information on the loading process of the DW.
Most transformations can also be handled with SQL. For example we will use SQL to filter results, join tables, handle NULL values, map abbreviated field values to user-friendly values, converting from one data type to another, formatting, map continuous values to pre-defined bands and generally prepare the data for querying.
After the previous steps are done, we will use Kettle to handle surrogate keys (these keys are used as primary keys in the DW and are different from the source system keys), when loading dimension tables. Moreover, Kettle will be used to merge the extracted data in the DW. A mapping should be defined between the input fields to the DW fields and the records that changed in the source system will be updated in the DW and the new records will be inserted.

Depending on the requirements Kettle can be used, in the future, to handle "slowly changing dimensions", integrate external data and execute other processes (jobs).

2. Kettle makes it easy to export a dataset in various formats as Excel, Access, delimeted files e.t.c.

After preparing the data for querying, we need query tools to write reports and generally explore and analyze the data. These should be able to present the query results in tables, crosstabs and charts. The formats that should generally be supported are: HTML, pdf and Excel. The user should be able to access the reports through web applications, portals, desktop clients (e.g. Excel) and/or email.

In Pentaho the above use cases translate into something like this:

The user`s point of view
The user uses a browser that points to a Pentaho web application.
The user chooses a "solution" which can be seen as a collection of folders.
Each folder contains one or more "reports" for a specific subject area. These reports can be one of the following:

JPivot pages, that let the user navigate data on a cube. The user can drill down/up, pivot (swap axes) and export the data in pdf and excel.

JFreeReport pages, where the user enters one or more parameters to filter the results and then gets a report in HTML, Excel or pdf format.

Dashboard pages, which combine gauges, maps, charts and tables in order to to show multiple results together.

The developer`s point of view

- Action Sequences
In order to create these "solutions", or in other words the screens from which the user will access the reports, the developer must create a set of XML documents. These XML documents define action sequences, which are used by the Pentaho engine (or framework if you prefer) to call back-end services and components (e.g. run report, genenerate PDF).
To help in creating these XML documents, there is Design Studio, which is a GUI tool based on Eclipse.

Some of the available actions and their corresponding components are:

- JFreeReport
In order to create a report, the developer must specify the SQL query (or MDX query if the data source is an OLAP server) to fetch the report`s data and the report definition, which can be created through a GUI tool, like the Report Design Wizard or Report Designer for reports with more complex layouts.

- Kettle
This action lets you call a Kettle tranformation or job, which is a good example of how Kettle can be embedded in an application.

- JPivot
The developer can create a JPivot page using a PivotViewComponent. Apart from the XML that specifies the action sequence, he/she must write the MDX query that will fetch the data from the OLAP server.

- Dashboards
To create a dashboard one should combine actions and components that produce charts and reports. This is done through a JSP page that the developer must create.

The motivation for a data warehouse system

Why would we want to build a data warehouse system? We might consider doing this for some of the following reasons:

It is difficult to get out of the current system the information that users need. In order for example to find the number of encounters between given dates, categorized by provider and/or encounter type and maybe see the percentage of change of this measure between time periods, one would have to be expert in SQL and write complex and huge SQL queries. And what`s more, we would have to do that for every single question that a user might ask.
It takes a lot of time for the system to return results of queries (reports). When someone executes a query that uses many joins or aggregations (sum, avg e.t.c.) in a database that contains hundred of thousands/millions of records it takes a lot of time for the DBMS to return a result. This is because the data model that is usually used for transaction processing applications is a normalized one (3NF), which is optimized for collecting data and maintaining their integration and consistency, not for querying and analyzing huge amounts of data.
We want to use external data sources to help us analyze our current data. For example, we could use statistical or demographic data for a region and combine that with OpenMRS data. More specifically, relate patient age groups with general information for that age group, or find a ratio of number of health providers to population, death causes for the general population to death causes for a clinic e.t.c.
We want to keep historical information. For example, a patient`s attributes might change over time, concepts, drugs, diseases names might change their names or their categories names and it could be useful to take into account these changes when we analyze our data over time.
We want to include information in our reports, that is produced from complex domain logic. For example, we could associate patient data with rules (as described in the Logic Service Project), by running the domain logic once during the ETL process of the data warehouse. Then, this information would be easily accessible to a report writer.
We want to handle cleaning of data and checking their quality, before creating reports. It is possible that some fields of the source database could have missing data or values out of range. A data warehouse typically handles this, through the ETL process or by marking the offending rows using an audit dimension.

An overview of the data warehouse

How can the above requirements be met? What are the main components of such a system?

The approach chosen in this project is to build a data warehouse. Here is the definition given by R. Kimball on the "The Data Warehouse ETL Toolkit":

"A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making."

The data warehouse makes a copy of transaction and other source data, in a separate database that is designed using dimensional models (or star schemas). The goals of this design are simplicity/user understandability and query performance. There are many tools that require this design like OLAP servers, query tools, report writers, dashboards e.t.c., which let users get access to the needed information more easily.

From a user`s point of view, the data warehouse, it is used to watch the performance of an organization and its activities through time. For example, how many encounters did we have this month and how does this compare to the previous month. From a developer`s point of view, it deals with a large number of rows each time to get a result to the user, as opposed to operational systems which mostly deal with a few rows (or objects if you prefer) each time.

Some of the general use cases that the data warehouse is good at:

Summaries/aggregations of data, categorized and grouped in every way possible (slicing and dicing).
Navigating/exploring data in various detail levels (drilling up/down)
Track changes of interesting entities/attributes (slowly changing dimensions)
Use a query/report tool to quickly create reports (instead of building your own)
Can be the foundation for decision support systems and data mining/statistical applications

Some of the components of the system are:

The data sources (OpenMRS database and maybe other external data).
The ETL process. The ETL (Extract, Transform, Load) process, extracts data from the data sources, ensures data quality, integrates different sources, modifies the structure of data so they can be loaded in star schemas and generally does all the necessary preparation to make the data ready for querying. The ETL process is usually developed with the help of a tool (Kettle in our case).
The relational implementation (star schemas). This may be a separate schema on the same DBMS or on a different physical server. This includes the dimensionally modeled tables that are the target of the ETL process.
OLAP cubes. There are OLAP (Online Analytical Processing) servers (Mondrian in our case), that offer better performance (using aggregates and special indexes) and are better suited for analysis using specialized languages (e.g. MDX instead of SQL). The source for OLAP cubes are the dimensional tables.
Data Access Tools. These include: canned, prebuilt parameter-driven reports/data exports accessible through the OpenMRS web application, ad hoc query tools (could be a desktop tool or another web application - e.g. Pentaho BI), sophisticated analysis tools ( e.g. SAS).

The plan of the project

Define reporting / analysis use cases for OpenMRS that would benefit most from a data warehouse system. In this step we will define the needs of the users and will try to categorize it in use cases and/or specific reports.
Design star schemas using requirements/uses cases and operational data model as input. In this step we will design star schemas (fact tables + dimensions) that can satisfy the requirements.
Develop ETL functionality. The ETL functionality will be implemented within a module or through various scripts that run in a separate, automated and scheduled process. It will be used to load the star schemas and to implement data transformations in a generalized way for the operational system. This step will be the most code intensive part of the project.
Implement the data warehouse using relational implementation (MySQL). In this step the necessary physical database objects will be created and loaded with data from the ETL process.
Implement the data warehouse using OLAP implementation (Mondrian + MySQL). In this step Mondrian will be deployed and used for querying the database that was built on the previous step.
Data Access Tools. Various open source tools will be evaluated (starting from Pentaho) for report writing. Also, it may be necessary to custom code some needed reports.

The goal of the project is by the end of the summer to have a system integrated with OpenMRS that delivers visible results for users. In order to achieve this, the steps outlined above will be followed in an iterative way.

OpenMRS: ETL/Data Warehouse/Reporting

Thursday, June 14, 2007

Pentaho Evaluation for OpenMRS

Monday, June 4, 2007

Another approach for reporting: A Data Warehouse System

About Me

Blog Archive

Java Resources

DW Resources