picture by strange little woman on stream via Flickr
A many-to-many relationship is an intersection of two entities. A person may own stock in multiple companies. A company can have multiple stockholders. The combination of a person and a company stock has an attribute of the number of shares owned. A many-to-many relationship depends for its existence on the underlying entities.
Relational databases and SQL have robust support for many-to-many relationships. Referential integrity keeps relationships consistent with their underlying entities and SQL can efficiently combine them in queries. In contrast, programming languages lack support for keeping references consistent with referents and providing easy traversal. As a result programmers have to use workarounds to handle many-to-many relationships in their code. Sometimes they make the bad decision to contaminate a database with their programming representation.
Here are some of the problems that we have encountered.
Give a Many-to-Many Relationship Identity of its Own
Consider the following figure. The most natural representation is shown at the top. A many-to-many relationship simply combines the identifiers of its underlying entities and uses them as the primary key. This is the representation that most database developers use.
The middle model is more cumbersome from a database perspective but still correct. The stock_holding_ID is a single field that identifies individual records for programming convenience. The model maintains the dependence on the underlying entities and enforces that the entities combination is unique. This model meets both database and programming needs.
The bottom model is the incorrect design that we too often see. It satisfies programmers but has lost an important constraint – that the combination of Person and Company is unique. The bottom model permits multiple records for a combination of Person and Company which was not the intent of the Stock_holding relationship.
Combine the Primary Key of one Entity with a Sequence Number
Here’s another botched approach to many-to-many relationships that we sometimes find. The relationship identifier combines the primary key of one entity with an arbitrary sequence number.
There’s much to dislike about this approach. There’s an undesirable asymmetry – there is no reason to favor person as the basis for identity rather than company. Again there is no uniqueness constraint on the combination of person_id and company_id. The sequence number adds bulk to the model and database but no meaning. Briefly stated, there’s no advantage to this model and many drawbacks.
Use Parallel References
Another bad idea is to use parallel references. Programming often involves collections and some programmers try to force fit them into a database. They don’t realize how they can naturally obtain collections with SQL.
An obvious problem is that the model handles at most three stock holdings for a person. We could add more references, but the model still has an arbitrary upper limit. The model is a mess if we want to compute all the persons owning a particular stock. For example, we have to check company1_ID for GE stock, company2_ID for GE stock, and company3_ID for GE stock. The SQL is lengthy, error prone, and performs poorly.
Other Misguided Approaches
We believe there are additional misguided techniques. We would like to hear comments from readers.
We did not invent the approaches presented here. Rather we found them when performing database reverse engineering. We’ve learned to never underestimate the perverse creativity of some developers. If a mistake is possible, most certainly someone will make it.