[ACCEPTED]-single fixed table with multiple columns vs flexible abstract tables-data-modeling

Accepted answer
Score: 90

Certain issues need to be clarified and 117 resolved before we can enter into a reasonable 116 discussion.

Pre-requisite Resolution

  1. Labels
    In a profession that demands 115 precision, it is important that we use precise 114 labels, to avoid confusion, and so that 113 we can communicate without having to use 112 long-winded descriptions and qualifiers.
    .
    What 111 you have posted as FixedTables, is Unnormalised. Fair 110 enough, it may be an attempt at Third Normal 109 form, but in fact it is a flat file, Unnormalised 108 (not "denormalised). What you have posted 107 as AbstractTables is, to be precise, Entity-Attribute-Value, which 106 is almost, but not quite, Sixth Normal form, and 105 is therefore more Normalised than 3NF. Assuming 104 it is done correctly, of course.

    • The Unnormalised 103 flat file is not "denormalised". It is 102 chock full of duplication (nothing has been 101 done to remove repeating groups and duplicate 100 columns or to resolve dependencies) and 99 Nulls, it is a performance hog in many ways, and 98 prevents concurrency.

    • In order to be Denormlaised, it 97 has to first be Normalised, and then the 96 Normalisation backed off a little for some 95 good reason. Since it is not Normalised 94 in the first place, it cannot be Denormalised. It 93 is simply Unnormalised.

    • It cannot be said 92 to be denormalised "for performance", because 91 being a performance hog, it is the very 90 antithesis of performance. Well, they need 89 a justification for the lack of formalised 88 design], and "for performance" is it. Even 87 the smallest formal scrutiny exposed the 86 misrepresentation (but very few people can 85 provide, so it remains hidden, until they 84 get an outsider to address, you guessed 83 it, the massive performance problem).

    • Normalised 82 structures perform far better than Unnormalised 81 structures. More normalised structures 80 (EAV/6NF) perform better than less normalised 79 structures (3NF/5NF).

    • I am agreeing with 78 the thrust of OMG Ponies, but not their 77 labels and definitions

    • rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
      .
  2. Wiki
    The entries re 76 Normal Forms and Normalisation are a complete 75 joke. Specifically, the definitions are 74 incorrect; they confuse the Normal Forms; they 73 are clueless re the process of Normalisation; and 72 they give equal weight to absurd or questionable 71 NFs which have been debunked long ago. The 70 result is, Wiki adds to an already confused 69 and rarely understood subject. So don't 68 waste your time.
    .
    However, in order to 67 progress, without that reference posing 66 a hindrance, let me say this.

    • The definition of 3NF is stable, and has not changed.
    • There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
    • The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wiki, and even that in confusing terms.
      .
  3. Since you 65 have been able to understand and implement 64 the EAV in your post, you will have no problem 63 understanding the following. Of course 62 a true Relational Model is pre-requisite, strong 61 keys, etc. Fifth Normal Form is, since we are skipping the 60 Fourth:

    • Third Normal Form
      • which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
      • and to no other non-key columns
    • Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal sans the formal process)
    • no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
      .
  4. Sixth Normal Form is of course Fifth Normal Form, plus:

    • Elimination of missing data (columns). This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
      .
  5. EAV vs Sixth Normal Form
    All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented one true 6NF database. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example. The difference is this: because it is loose, and because implementors do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
    .
    Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
    • 6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
    • 6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
      .
      Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.

Now, we 59 can start the

Discussion

"Of course it can be more abstract if value's are predefined (Example: specialities could have their own list)"

Sure. But do not get too 58 "abstract". Maintain consistency and implement 57 such lists in the same EAV (or 6NF) manner 56 as other lists.

"If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins. But I don't know if this affects the performance, executing these 'more complex' queries."

  1. Joins are pedestrian in Relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
  2. EAV and 6NF databases have more Joins, which just as pedestrian, no more no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
  3. The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
  4. It is a common myth that Joins somehow have a cost. Totally false. The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles. The issue is the size of tables being joined, not the cost of the Join between those same tables. Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices (Unique on the parent[FK] side; Unique on the Child side) is instantaneous; ; where the Child index is not unique, but at least the leading column is valid, it is slower; where there is no useful index, of course that is very slow. None of it has to do with Join cost. Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
  5. Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.

I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.

  1. 5NF (or 3NF for those who 55 have not made the progression) is the easiest 54 and best, in terms of implementation, ease 53 of use (developers as well as users), maintenance. The 52 drawback is, every time you add a column, you 51 have to change the database structure (table 50 DDL). That is fine is some cases, but not 49 in most cases, due to change control in 48 place, quite onerous. Second, you have 47 to change existing code (code handling the 46 new column does not count, because that 45 is an imperative): where good standards 44 are implemented, that is minimised; where 43 they are absent, the scope is unpredictable.

  2. EAV 42 (which is what you have posted), allows 41 columns to be added without DDL changes. That 40 is the single reason people choose it. (code 39 handling the new column does not count, because 38 that is an imperative). If implemented 37 well, it will not affect existing code; if 36 not, it will. But you need EAV-capable 35 developers. When EAV is implemented badly, it 34 is abominable, a worse mess than 5NF done 33 badly, but not any worse than Unnormalised 32 which is what most databases out there are 31 (misrepresented as "Denormalised for performance"). of 30 course, it is even more important (than 29 in 5NF/3NF) to hold a strong Transaction 28 context, because the columns are far more 27 distributed. Likewise, it is essential 26 to retain Declarative Referential Integrity: the 25 messes I have seen were due in large part 24 to the developers removing DRI because it 23 became "too hard to maintain", the result 22 was, as you can imagine, one mother of a 21 data heap with duplicate 3NF/5NF rows and 20 columns all over the place. And inconsistent 19 Null handling.

  3. There is no difference in 18 performance, assuming that the server has 17 been reasonably configured for the intended 16 purpose. (Ok, there are specific optimisations 15 that are possible only in 6NF, which are 14 not possible in other NFs, but I think that 13 is outside the scope of this thread.) And 12 again, EAV done badly can cause unnecessary 11 bottlenecks, no more so than Unnormalised.

  4. Of 10 course, if you go with EAV, I am recommending 9 more formality; buy the full quid; go with 8 6NF; implement a catalogue; utilities to 7 produce SQL; Views; handle Missing Data 6 consistently; eliminate Nulls altogether. This 5 reduces your vulnerability to the quality 4 of your developers; they can forget about 3 the EAV/6NF esoteric issuses, use Views, and 2 concentrate on the app logic.

Pardon the 1 long post.

Score: 9

In your question, you have presented at 45 least two major issues at the same time. Those 44 two issues are E-A-V and gen-spec.

First, let's 43 talk about E-A-V. Your last table (object_id, field_id, value) is 42 essentially an E-A-V. There is an upside 41 to E-A-V and a downside to E-A-V. The upside 40 is that the structure is so generic that 39 it can accomodate almost any body of data 38 describing almost any subject matter. That 37 means that you can proceed to design and 36 implementation with no data analysis and 35 no understanding of the subject matter, and 34 not worry about wrong assumptions. The 33 down side is that at retrieval time, you 32 have to do the data analysis that you skipped 31 over before building the data base, in order 30 to come up with queries that mean anything. This 29 is much more serious than just retrieval 28 efficiency. But you are also going to have 27 terrible problems with retrieval efficiency. There 26 are only two ways to learn about this pitfall: live 25 through it or read about it from those who 24 have. I recommend the reading.

Second, you 23 have a gen-spec case. Your table (object_id, type_id) captures 22 a gen-spec (generalization-specialization) pattern, along 21 with the related tables. If I had to generalize 20 between hotels and restaurants, I might 19 call it something like "public accomodations" or 18 "venues". But I'm not sure I 17 understand your case, and you may be driving 16 for something even more general than those 15 two names suggest. After all, you've included 14 "events" in your list, and an 13 event is not a type of venue in my mind.

I've 12 referred other people to readings on gen-spec 11 and the relational model in previous responses.
When two tables are very similar, when should they be combined?

But 10 I hesitate to send you off in the same direction, because 9 it's not clear to me that you want to come 8 up with a relational model of the data before 7 building your database. A relational model 6 of a body of data and an E-A-V model of 5 the same data are almost totally at odds 4 with each other. It seems to me you have 3 to make that choice before you even explore 2 how to express gen-spec in the relational 1 model of data.

Score: 3

When you start to require a large number 4 of different entities (or even before...), a 3 nosql solution would be vastly simpler than 2 either choice. Just store each entity/record 1 with the exact fields you require.

{
   "id": 1,
   "type":"Restaurant",
   "name":"Messy Joe",
   "address":"1 Main St.",
   "tags":["asian","fusion","casual"]
}
Score: 2

The "abstract" approach is better known 5 as "Normalization", looks like 3rd Normal 4 Form (3NF).

The other one is called "Denormalized", and 3 can be a valid performance option... when 2 you've encountered speed issues using the 1 Normalized approach, not before.

Score: 1

How do you have the listings represented 32 in code? I'd guess Listing as a supertype, with 31 Shop, Restuarant, etc. as subtypes?

Assuming so, this is 30 a case of how to map subtypes to a relational 29 database. There are generally three choices:

  • Option 1: single table per subtype, with common attributes repeated in each table (name, id, etc).
  • Option 2: single table for all objects (your single table approach)
  • Option 3: table for the supertype and one for each subtype

There's 28 no universally correct solution. My preference 27 is generally to start with option 3; it 26 provides an intituitive structure to work 25 with, is pretty well normalised and can 24 easily be extended. It means a single join 23 for retrieving each instance - but RDBMS 22 are well optimised for doing joins so it 21 doesn't really cause performance problems 20 in practice.

Option 2 can be more performant 19 for queries (no joins) but causes problems 18 if other tables need to refer to all supertype 17 instances (proliferation of foreign keys).

Option 16 1 appears at first sight to be the most 15 performant, although 2 caveats: (1) It's 14 not resilient to change. If you add a new 13 subtype (and so different attributes) you'll 12 need to change the table structure and migrate 11 it. (2) It can be less efficient than it 10 seems. Because the table population is 9 sparse, some DBs don't store it particularly 8 efficiently. As a consequence it can be 7 less efficicent than option 1 - since the 6 query engine can do joins faster than it 5 can search bloated sparse table spaces.

Which 4 to choose really comes down to knowing details 3 of your problem. I'd suggest reading up 2 a bit on the options: this article is a good place to 1 start.

hth

More Related questions