picture by Sir.I via Flickr
We often perform database reverse engineering as part of our consulting work. We have found that it’s common for databases to contain derived data. Derived data is data that can be computed from other base data. Often, the storing of derived data is a mistake and it would have been better if developers instead computed on the fly.
Consider a bank account. Money is repeatedly deposited and withdrawn. Logically, one could compute the account balance by tallying transactions over time. However, the balance is such a fundamental piece of data that it makes sense to store it. Furthermore, many bank procedures involve account balances and it could be expensive to compute the balance each time it is needed. This example has good reasons for storing derived data.
Now consider a sales organization. Sales can be summarized at various organization levels, for different geographies, by product categories, and so forth. A developer could define attributes to store summary data, but in most cases it would be better to calculate summaries as needed. One reason is that there can be an arbitrary number of summaries. Also it’s messy to keep stored derived data in sync, if the underlying data is revised.
Derived Data Trade-offs
One advantage of storing derived data is faster reading. There is no calculation and values are simply accessed. Stored data also simplifies database queries; this is important for complex analytical queries.
However there are major downsides of storing derived data. Database writes are slower as base and derived data must be kept in sync. More importantly, development is more complex. There is a serious risk of bugs slipping into the database.
Derived data differs from database normalization, but our warning about being careful is in the same spirit. The whole premise of normalization is to store data once. Multiple copies increase the potential for error. You’re more likely to store data just once, if you forego derived data.
Analytical vs. Operating Applications
Many data warehouses have derived data. This simplifies the schema so that business users can write queries. The derived data in a warehouse does complicate writing. But the writing complexity is manageable as warehouses are populated by tightly controlled ETL scripts. End users do not update the warehouse and do not see update complexity.
In contrast, careless use of derived data can ruin a day-to-day operational application. End users concurrently read and write to the database from various interfaces. In addition, applications have ongoing maintenance and revisions that disrupt the code base. In these situations, derived data is just asking for trouble. Often the desired performance boost is illusory while the quality detriment is severe.
There are occasional exceptions where derived data makes sense for operational applications. Bank accounts are such an example.
The bottom line is that, except for data warehouses, you should store derived data sparingly. Instead store only fundamental, base data. It’s difficult to keep stored derived data in sync with base data. Any changes to base data must be propagated. If application logic misses updates, then you have bugs waiting to become manifest. Also it’s easier to extend applications as you have fewer side effects to consider.
Our experience is that it’s usually not worth the hassle and risk of storing derived data. Instead, computing on the fly often delivers adequate performance, while reducing development work. The calculation efficiency of the SQL engines on the market (such as SQL Server, Oracle, and DB2) is often quite good.
Sometimes it’s helpful to wrap a view around a table. The view definition can include derived data calculations. Then applications and interfaces can access views for a consistent implementation.