Data Warehouse Development

Data Warehouse Data Flow

The traditional approach to data warehouses is to use ETL scripts to convert source data to warehouse data. This approach has limitations. One issue is that it is costly to write ETL code, test the ETL code, and deliver it. The whole process is also slow, especially in responding to changing business requirements. Furthermore, ETL code is opaque – business users cannot directly see the logic and have to trust that ETL developers have prepared the proper data processing. There is a better way for building a data warehouse.

We use a faster, more flexible approach that we call agile analytics. The idea is to receive business requirements and still model the data. Then we quickly convert source data to warehouse data using SQL code. Carefully formatted SQL is partially self-documenting – skilled business analysts can see the underlying logic, detect errors, and suggest improvements. Since SQL can be readily changed, requirements can be quickly iterated-upon and fine-tuned. We can determine which warehouse structures are most useful to the business without a lengthy build. Business payoff can prioritize development. It is roughly 10 times faster to write SQL than to perform traditional ETL.

Once we determine the proper data representation and processing, the SQL can be hardened and converted to traditional ETL code or kept as SQL. In principle, core dimensions and facts that are especially important and heavily used are candidates for hardening. Peripheral objects can be left as SQL. Our experience is that about 25% of warehouse objects are best to harden. The other 75% can be left as SQL. The net result is that the combination of ETL + SQL code can speed warehouse development by a factor of 2.

See our O’Reilly video course for more information about data warehouse development.