“Big Data” has become a popular buzzword. Big Data is data that is so massive that it is difficult to manage. For example, the volume of search engine queries, online retail sales, and Twitter messages exceed the capabilities of traditional databases.
There’s a complement to “Big Data” that we call “Big Schema”. Today’s data can not only have vast quantities and fast rates, but can also have diverse structure. Big Schema can arise with enterprise data models, large data warehouses, and scientific data.
Enterprise Data Models
An enterprise data model (EDM) describes the essence of an organization – it abstracts multiple apps, combining and reconciling their content. EDMs have many purposes such as integrating app data, driving consistency across apps, documenting enterprise scope, finding functional gaps and overlaps, and providing a vision for future apps. When you consider that many enterprises have dozens of apps the size of schema can be large.
The UK financial software vendor Avelo has been using an EDM to coordinate and integrate apps. Avelo was formed by the merger of four predecessor companies so its apps align poorly. They have different abstractions, naming approaches, and development styles. As a result, it was difficult to construct an EDM.
We limited the scope of Avelo’s EDM to cope with the schema size. We started by seeding the EDM via rapid reverse engineering. We browsed each app’s schema to find core concepts – the tables with the most foreign key connections – and used only the top ten concepts. Business experts helped us reconcile the concepts to create a high-level EDM.
Large Data Warehouses
Data warehouses can also involve Big Schema. A data warehouse combines data from day-to-day operational apps and places it on a common basis for analysis and reporting. A large enterprise can have much data to analyze leading to many data warehouse tables.
We can’t do much to restrain the size of a large data warehouse. But by using agile data modeling, we can make sure that payoff occurs incrementally as the warehouse is constructed.
We recently worked on a large data warehouse encompassing multiple departments that illustrates both good and bad approaches. One department focused on building their portion of the warehouse and deferring usage. After many months of work they are still building. Another department chose to build incrementally according to business demand. This latter approach has been more successful and easier to justify for continued funding.
Scientific data is a third source of Big Schema. Scientific apps have much complexity such as time series, complex data types, and deep dependencies and constraints. Scientific schema is often not only large but also difficult to represent.
Many years ago we worked on the PDXI project sponsored by the AIChE. The purpose of PDXI was to produce a data model to serve as the basis for a data exchange standard for chemical engineering apps. Chemical plants have a wide variety of equipment, complex mixtures of substances, and a range of operating conditions so there is much data to represent. The PDXI model was several hundred pages. This was too much to manage, too much to explain, and too much to understand.
In retrospect, we now realize that we should have used more generic data structures. For example, the PDXI model had fifty pages for equipment, such as tanks, reactors, pumps, and distillation columns. A better model would have avoided all this detail by combining data and metadata. Then the fine particulars of each kind of equipment could have been specified elsewhere.
So when you build applications think not only about Big Data, but also Big Schema. Where there is Big Data there is often Big Schema. And Big Schema can also arise by itself.