database-designsizevarcharnvarchar

What size to pick for a (n)varchar column?


In a slightly heated discussion on TDWTF a question arose about the size of varchar columns in a DB.

For example, take a field that contains the name of a person (just name, no surname). It's quite easy to see that it will not be very long. Most people have names with less than 10 characters, and few are those above 20. If you would make your column, say, varchar(50), it would definately hold all the names you would ever encounter.

However for most DBMS it makes no difference in size or speed whether you make a varchar(50) or a varchar(255).

So why do people try to make their columns as small as possible? I understand that in some case you might indeed want to place a limit on the length of the string, but mostly that's not so. And a wider margin will only be beneficial if there is a rare case of a person with an extremely long name.


Added: People want references to the statement about "no difference in size or speed". OK. Here they are:

For MSSQL: http://msdn.microsoft.com/en-us/library/ms176089.aspx

The storage size is the actual length of data entered + 2 bytes.

For MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html

L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes

I cannot find documentation for Oracle and I have not worked with other DBMS. But I have no reason to believe it is any different there.


Solution

  • I can only speak for Oracle. A VARCHAR2(50) and a VARCHAR2(255) take up exactly the same amount of space and perform identically, if you enter the value 'SMITH'.

    However, the reason why it is generally not a good idea to go around declaring all your textual columns as VARCHAR2(4000) is that column length is, effectively, another constraint. And constraints are database implementation of business rules, so they are definitely something that should be defined on the database side of things.

    As a for-example. You define a CHECK constraint on a column so that the values it can accept are only 'Y' and 'N'. That saves your application from having to deal with 'y' and 'n' or even '1' and '0'. The check constraint ensures your data conforms to expected standards. Your application code can then make valid assumptions about the nature of the data it has to deal with.

    Column length definition is in the same boat. You declare something to be a VARCHAR2(10) because you don't want it accepting an entry of 'ABC123ZYX456' (for whatever reason!)

    In Australia, I define STATE columns to be a varchar2(3) because I don't want people typing in 'New South Wales' or 'South Australia'. The column definition pretty much forces them to be entered as 'NSW' and 'SA'. In that sense, a VARCHAR2(3) is almost as much a check constraint as actually specifying a CHECK IN ('NSW','SA','VIC' etc) constraint.

    In short, proper column lengths are a way of encoding business rules. They're another form of constraint. They bring all the advantages of constraints (and suffer from many of the same drawbacks). And they ensure, to a small extent, a degree of 'data cleanliness' that "proper" constraints help with, too.

    I don't buy the argument, either, that it's best to stick these sorts of things in the client app because it's easier to change there. You have 20,000 people using an app, that's 20,000 updates. You have one database, that's one update. The 'easier to change the client app' argument, if true, would potentially mean the database just gets treated as a giant bit bucket with all the clever logic being handled in client code. It's a big discussion to have, but since all RDBMSes let you define constraints and so on in the database itself, it's pretty clear that there's at least a worthwhile case to be made that such fundamental logic belongs in the backend.