picture by nerovlvo via Flickr
Relational databases are often advertised as easy to use – that’s true for a single table. However a database of hundreds or thousands of interconnected tables is complex and difficult to use. It is not helpful for marketers to set such misleading expectations – all it does is lead to frustration when projects overrun on cost and time and sometimes fail.
Databases are an advanced topic at universities. First students have to learn about programming. Then they learn about data structures. And only then can they learn about databases. Furthermore, there are two aspects to databases – first understanding what they are and where they are useful and second understanding how to create and design them. Thus the teaching of use of databases has several prerequisites. Many curriculum overemphasize programming and slight databases.
In practice, there is a chasm between programmers and database practitioners. Most programming languages are imperative in contrast to the declarative query language of databases. The two fields have their own style, culture, and terminology. They both have complexity but this complexity occurs in different ways. Unfamiliarity often causes programmers to underestimate databases.
With large applications programming complexity comes from preparing many lines of code, coordinating multiple programmers, and mastering constituent libraries and subsystems. It is difficult to write programming code that is maintainable, extensible, and free from errors. In contrast, it is less difficult to build a database for an application. Rather database complexity comes from coordinating multiple applications and their databases. The same data multiply appears and there is a business need to keep it consistent. In addition data is long lived so an error in database logic is more insidious than a programming error.
The SQL language used to define and access databases has many subtleties, including:
- Referential integrity
- Transaction logic
- Stored procedures
- Access control
Further evidence of database complexity is provided by reverse engineering. Reverse engineering is the process of taking existing software artifacts (such as database schema, database code, and documentation) and working backwards – to understand the underlying intent as well as assess the quality of the database. Years ago I wrote two papers with extensive data – A Retrospective on Industrial Database Reverse Engineering Projects, Part 1 and Part 2, 8th working conference on reverse engineering, Stuttgart Germany, 2001.
These papers presented statistics that were gleaned from 35 databases that I reverse engineered during a decade of work. 14 databases were from in-house development efforts and 21 were from commercial products. The data shows the actual database technology being practiced, rather than the textbook ideal.
Using a typical academic grading scale of ‘A’ to ‘F’ (‘A’ is superior and ‘F’ is failing), the papers found that underlying database models and the database design, on average, graded about a ‘C’ A grade of ‘C’ means that the database has major flaws that are difficult to fix. Examples of flaws included missing primary key definitions, haphazard indexing, mismatching data types for foreign keys, and data modeling errors. The flaws are important because they lead to bugs, slow performance, and difficult maintenance. Clearly, database modeling and design are not easy or the grades would be higher.
Relational databases (the most common kind of database) consist of a network of interconnected tables. A foreign key in one table refers to a primary key in another. Robert Hillard has likened database connectivity to a graph and used graph analysis to estimate database complexity. I discuss Hillard’s work in my recent book – UML Database Modeling Workbook.
Hillard measures two aspects of complexity.
- Average number of connections for a table. To compute a table’s connections, add its foreign key count to the number of times it is referenced by other tables. Tables with many connections offer more choices about how to write queries and hence more complexity.
- Average traversal length. To compute a table’s traversal path, take each possible table pairing and compute the shortest foreign-key-to-primary-key path between them. A long traversal implies more code to write for queries and hence more complexity.
Thus graph theory lets you measure the complexity of a database. It is worth rethinking a database design to try to reduce the number of tables as well as its complexity score.
It helps to be aware of the complexity of databases when building applications. You can better allow for the difficulty in project plans estimating cost, personnel, and duration. Also your project will be more likely to deliver useful software.