There are an "official benchmark" or a simple rule of thumb to decide when space or performance will be affected?
My table have many simple and indexed fields,
CREATE TABLE t (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
xcontent xml, -- the NULL use disk space?? cut performance?
...
UNIQUE(name)
);
and it is a kind of "sparse content", many xcontent
values will be NULL... So, these XML NULLs consumes some disk space?
I can normalize, the table t
now will be nt
,
CREATE TABLE nt (
id serial PRIMARY KEY,
name varchar(250) NOT NULL,
...
UNIQUE(name)
);
CREATE TABLE nt2 (
t_id int REFERENCES nt(id),
xcontent xml NOT NULL
);
CREATE VIEW nt_full AS
SELECT nt.*, nt2.xcontnt FROM nt LEFT JOIN nt2 ON id=t_id;
So, I need this complexity? this new table arrange will consume less disk spacess. The use of
SELECT id, name FROM nt WHERE name>'john'; -- Q1A
SELECT id, name FROM nt_full WHERE name>'john'; -- Q1B
SELECT id, name FROM t WHERE name>'john'; -- Q1C
SELECT id, xcontent FROM nt_full WHERE name>'john'; -- Q2A
SELECT id, xcontent FROM t WHERE name>'john'; -- Q2B
So, in theory, all the performances of Q1A vs Q1B vs Q1C will be the same?
And Q2A vs Q2B?
The answer to the question "how much space does a null value take" is: no space at all - at least not in the "data" area.
For each nullable column in the table there is one bit in the row header that marks the column value as null (or not null). So the "space" that the null values takes is already present in the row header - regardless whether the column is null or not.
Thus the null "value" does not occupy any space in the data block storing the row.
This is documented in the manual: http://www.postgresql.org/docs/current/static/storage-page-layout.html
Postgres will not store long string values (xml, varchar, text, json, ...) in the actual data block if it exceeds a certain threshold (about 2000 bytes). If the value is longer than that, it will be stored in a special storage area "away" from your actual data. So splitting up the table into two tables with a 1:1 relationship doesn't really by you that much. Unless you are storing a lot of rows (hundreds of millions), I doubt you will be able to notice the difference - but this also depends on your usage patterns.
The data that is stored "out-of-line" is also automatically compressed.
Details about this can be found in the manual: http://www.postgresql.org/docs/current/static/storage-toast.html
One reason why the separate table might be an advantage is the necessary "vacuum" cleanup. If you update the XML data a lot but the rest of the table hardly ever changes, then splitting this up in two tables might improve the overall performance because "XML table" will need less "maintenance" and the "main" table won't be changed at all.