Data sources, staging area and data targets
Data sources: The only data source for the moment is the OpenMRS database.
Staging area: This refers to an intermediate area between the source database and the DW database. This is where the extracted data from the source systems are stored and manipulated through transformations. At this time, there is no need for a sophisticated staging area, other than a few independent tables (called orphans), which are stored in the DW database.
Data Targets: The DW database. This is the database that will be used for querying, data extracting and the foundation for OLAP cubes.
Extraction
The extraction of the data is done using SQL queries on the OpenMRS db. These SQL scripts are mostly responsible for the flattening of the normalized data model, handling NULL values, which are generally avoided in the DW and simple conversions (e.g. boolean values to textual descriptions).
The extraction process, will extract only new and changed records. This is supported by the audit columns in most of the source tables. When the audit columns values are not properly maintained (e.g. there are NULL values), then we can use a MINUS set operator to compare the natural keys from the dimension in the data warehouse against the natural keys in the source system table. MINUS is not supported in MySQL, so we`ll use "NOT EXISTS" or an equivalent way to achieve the same result.
Cleaning and conforming
Although this step is where the most transformations are needed, usually, we`ll mostly use the ETL tool (Kettle) features.
During this step, data cleaning, data quality checks and data integration are performed. This means that individual columns are checked for valid data (e.g. ranges for numeric values) and business rules are enforced. Also, data from different sources are integrated. There should also be a policy established for error handling.
For this project these processes will be kept to a minimum. It is assumed that the OpenMRS system implements proper validation logic, so we won`t try to duplicate that. We use Kettle auditing features for auditing and reporting on the ETL process. There are two tables, one for transformation executions and one for job executions.
The errors that we will be facing will be mostly DB (MySQL) errors, caused probably by unexpected data, which should be handled manually, by an administrator. Another approach would be to handle these errors and keep error event tables to log the problematic records.
In addition, it is assumed that the data will be loaded from a single OpenMRS instance, so we don`t take into account possible issues that might come up when trying to integrate from multiple databases.
Delivering Dimension Tables
At this step the dimension tables are loaded with the new and changed data. Dimension tables consist of a primary key (which is a meaningless integer key - a surrogate key), which is used to join to fact tables, a natural key which is based on fields of the source system, that doesn`t change over time or across systems (e.g. for a patient dimension we use the patient identifier for natural key, not a patient id) and descriptive attributes.
We also need to handle changes to dimension attributes. There are three approaches:
1. Overwrite the existing dimension record (type 1 slowly changing dimension)
In this case we use an update else insert (merging) functionality.
2. A new dimension record is created with a new surrogate key, but the same natural key (type 2 slowly changing dimensions).
If a dimension contains attributes that are handled as above, we`ll add two fields: a row effective datetime (exact datetime of change) and a row end datetime (exact datetime of next change). This way, we know the interval a record was valid.
3. Two columns are created for each attribute that we want to track changes. The first column contains the most recent value and the second column contains the old value.
For the moment Type 1 is used.
Other Issues:
- Generating surrogate keys
- Date Dimension
- Dimensional Roles
- Structure
- Referential Integrity
- Surrogate Key Pipeline
- Preparing the fact tables for loading
- Incremental Loading
- Aggregations
- OLAP Cubes
To do
Scheduling
To do