mysqldatabasedatabase-design

Should I avoid duplicate entries in a database?


I'm designing a database that has a table called projects(pro_id,pro_name) and a table called categories(cat_id,cat_name). Project names will certainly be unique. A given project may have a number of categories associated with it and there will be a third table called procat(procat_id,pro_id,cat_id) to do this. There could be several projects that have categories with the same name. So projects 1 and 7 might both have a category called "Bathroom".

Should my categories table reflect just 1 entry for "Bathroom" or 2 separate entries with 2 ids?

These categories and what they do in the database will be different from each other. Project 1's bathroom may have one set of data associated with it (color=blue, size=400, etc) while Proect 7's bathroom may have different data associated with it (all recorded in different tables).

I'm leaning towards duplicate entries in the categories table, but I don't want to miss something that will cause problems later.


Solution

  • Whether values are duplicated in a column is determined by what the rows in the table mean and what possible table values can arise according to the business rules. Eg: Suppose you decide that every thing of some sort has its own unique id among things of that sort. Then ids will be unique in a table that has a row for each things with a given properties. But suppose rows in a table contain id & colour values that make "thing ID comes in colour COLOUR" into a true statement. Then if things come in only one colour then colour will be unique but if things can come in multiple colours then column colour will not be unique. Although id-colour pairs will be unique.

    You as DBA must decide on such a meaning or predicate for each table that determines that table's rows in a given situation. Ie a statement template parameterized by columns, where the table holds rows that make the template into a true statement.

    Your collection of tables must have sufficient predicates to be able to describe any situation that arises, while you don't want rows in a table or tables to be stating the same thing. Determine good designs by learning & applying an information modeling method & normalization.

    The thing is, there could be several projects that have categories with the same name.

    Make sure you clearly separate two distinct notions and don't call both of thm "category": a thing that exists separately in its own project (which is how you used "category" in the quote) vs a thing that is a kind of one of those, with its own property like a name, with multiple projects possibly being associated the same kind, but with other properties on a project-kind basis. It seems that you picked the word "category" as a synonym for kind (as is department) so maybe this is what you really mean. Ie there is a kind/category of room named "bathroom" and different projects have the same kind/category of room but different project-kind/category pairs, ie rooms out there in the world, have different sets of properties & different values for the same property.

    My question is should my categories table reflect just 1 entry for "Bathroom" or 2 separate entries with 2 ids?

    Suppose you have unique category & project ids. Consider table c(id, name) with predicate "category ID is called by name NAME". If a category is a thing associated with only has one name, id will be unique. Otherwise a thing can be associated with multiple names so id-name subrows will be unique. You have said that different categories can have the same name. On the other hand from what you have said a table for pc(pid, cid, name) "in project PID category CID is called by name CNAME" then since projects don't share categories CID is unique, although PID is not. Note that in this case c's predicate "category ID is called by name NAME" also means "for some project PID, in project PID category CID is called by name NAME". So c is just select cid as id from pc. So you don't need c if you have pc.

    On the other hand if you decided that category ids are not unique but instead categories are uniquely identified by a project-category id pair then in c name will not be unique and in cp PID-CID pairs will be unique but CID & PID won't be.