Thursday, June 14, 2007

Pentaho Evaluation for OpenMRS

This post lists some of the Pentaho tools, how they can be used to support building a data warehouse system (a reporting database) and what is required for one to know in order to create reports and analyze data.

- ETL (Kettle)
Kettle can be used in OpenMRS in two ways:
  1. To load the data warehouse tables
  2. On the fly transformations and exports of data
1. For the moment, our only data source is the OpenMRS database. For that reason we can use SQL to extract the needed data. We also, need a way to extract only the changed data. That means we should fetch only records that are new or updated since the last successful loading of the DW (since the last time that the transformation ran without error). We can use the audit columns in the source database (date_created, date_changed) and the logging feature of Kettle to handle this. Kettle lets you specify a table that logs the details of tranformation execution. This table contains fields to log the number of records read from the source database, the number of records loaded in the DW and and the timestamp of the last successful transformation execution. The last feature can be used to publish reports (a web page for example) with information on the loading process of the DW.
Most transformations can also be handled with SQL. For example we will use SQL to filter results, join tables, handle NULL values, map abbreviated field values to user-friendly values, converting from one data type to another, formatting, map continuous values to pre-defined bands and generally prepare the data for querying.
After the previous steps are done, we will use Kettle to handle surrogate keys (these keys are used as primary keys in the DW and are different from the source system keys), when loading dimension tables. Moreover, Kettle will be used to merge the extracted data in the DW. A mapping should be defined between the input fields to the DW fields and the records that changed in the source system will be updated in the DW and the new records will be inserted.

Depending on the requirements Kettle can be used, in the future, to handle "slowly changing dimensions", integrate external data and execute other processes (jobs).

2. Kettle makes it easy to export a dataset in various formats as Excel, Access, delimeted files e.t.c.

After preparing the data for querying, we need query tools to write reports and generally explore and analyze the data. These should be able to present the query results in tables, crosstabs and charts. The formats that should generally be supported are: HTML, pdf and Excel. The user should be able to access the reports through web applications, portals, desktop clients (e.g. Excel) and/or email.

In Pentaho the above use cases translate into something like this:

The user`s point of view
The user uses a browser that points to a Pentaho web application.
The user chooses a "solution" which can be seen as a collection of folders.
Each folder contains one or more "reports" for a specific subject area. These reports can be one of the following:
  • JPivot pages, that let the user navigate data on a cube. The user can drill down/up, pivot (swap axes) and export the data in pdf and excel.
  • JFreeReport pages, where the user enters one or more parameters to filter the results and then gets a report in HTML, Excel or pdf format.

  • Dashboard pages, which combine gauges, maps, charts and tables in order to to show multiple results together.


The developer`s point of view

- Action Sequences
In order to create these "solutions", or in other words the screens from which the user will access the reports, the developer must create a set of XML documents. These XML documents define action sequences, which are used by the Pentaho engine (or framework if you prefer) to call back-end services and components (e.g. run report, genenerate PDF).
To help in creating these XML documents, there is Design Studio, which is a GUI tool based on Eclipse.

Some of the available actions and their corresponding components are:

- JFreeReport
In order to create a report, the developer must specify the SQL query (or MDX query if the data source is an OLAP server) to fetch the report`s data and the report definition, which can be created through a GUI tool, like the Report Design Wizard or Report Designer for reports with more complex layouts.

- Kettle
This action lets you call a Kettle tranformation or job, which is a good example of how Kettle can be embedded in an application.

- JPivot
The developer can create a JPivot page using a PivotViewComponent. Apart from the XML that specifies the action sequence, he/she must write the MDX query that will fetch the data from the OLAP server.

- Dashboards
To create a dashboard one should combine actions and components that produce charts and reports. This is done through a JSP page that the developer must create.

No comments: