In a blog last year we discussed database archaeology, which is another name for database reverse engineering. Reverse engineering is the inverse to normal development. We start with an application and work backwards to understand the software and infer its content.
This month we’ll take a further look at database reverse engineering, from the perspective of a simple case study. We’ll reverse engineer the database beneath WordPress and populated with a snapshot of the data for this website. The case study illustrates mechanics and the kinds of insights that reverse engineering can provide.
WordPress is free and open-source software that makes it easy to create blogs and simple websites. WordPress is a framework in that it generalizes the process of building a website and parameterizes content. WordPress has an underlying MySQL database and data model that we will investigate.
Accessing WordPress Schema and Data
The first step was to access the host for the superdataguy website. We logged in, clicked on phpMyAdmin, and then clicked on Export. This downloaded a localhost.sql file with WordPress schema populated by data for the superdataguy website.
The localhost file is unformatted and effectively unreadable. We finessed this problem by importing the file into a local MySQL database. We then exported schema only from the local database. We manually edited the schema to delete incidental MySQL keywords that ERwin did not recognize and then reverse engineered the schema. Here is the initial ERwin model for WordPress tables. (Plugins can have additional tables that we do not consider.)
We’ve found that it’s often helpful to consider record counts. The counts indicate tables that are heavily used, tables that may hold descriptor data, and tables that are not used at all. All counts are for the data of the superdataguy website.
Note that the WordPress database has only twelve tables. There’s even less schema once we realize that the four ‘meta’ tables elaborate the corresponding base tables. This is striking and an architectural insight into WordPress. Clearly WordPress stores only some data in the database. For example, GUI details such as fonts and graphics are programmed into themes and not stored in the database. Images are stored as files.
The ‘posts’ table stores pages and blog posts. The table stores not only current posts, but also has a record for each previous version that was saved. The ‘comments’ table stores website comments – some meaningful and others yet to be deleted spam. The ‘term’ tables store categories for finding posts. Superdataguy currently has one defined user that is stored in the ‘users’ table.
Our next step was to add foreign keys. The WordPress schema does not define foreign keys. However, we can manually infer foreign keys based on name similarity and study of the database structure. We confirmed our foreign key suppositions with data analysis. Data analysis does not prove a supposition, but it does increase the likelihood. Here’s a typical query checking for foreign-key-to-primary-key dependency.
SELECT * FROM wp_fqir_commentmeta WHERE comment_ID NOT IN
(SELECT comment_ID FROM wp_fqir_comments);
We conjecture that wp_fqir_posts.post_author refers to wp_fqir_users.ID though we would need more data to confirm as there is only one ‘users’ record. The attribute wp_fqir_comments.user_id refers to wp_fqir_users.ID if the user is known; otherwise user_id is set to 0. The wp_fqir_term_relationships table is a many-to-many relationship between posts and assigned terms.
Here’s a revised ERwin model.
We learned a lot from this simple case study.
We found that WordPress has a very small database (apart from the additions of plug-ins). There are only twelve tables. This surprised us as we had expected 30-40 tables based on the breadth of WordPress functionality. Some WordPress functionality (such as GUI settings and images) are not in the database accounting for its small size.
WordPress lacks referential integrity (not a good idea) but we did not find any data errors. Clearly WordPress compensates with careful programming to avoid dangling data references. This matches our impression of WordPress as high-quality software.
WordPress has a clean naming style and a simple structure so it was easy to figure out the foreign keys. However, foreign keys can be difficult to determine for complex databases.