I'm starting to work with Netezza on Aginity to handle millions of records per table. I've been reading up online on how to optimize tables so that loading the data and running queries on it will be relatively quick and I have a few questions.
1) One thing I read was that the distribution you pick when creating a table will impact how fast you can query the table. For the most part, is it best to distribute on the primary key? And let's say there are no primary keys but instead there are foreign keys in a table, is it good to distribute on both foreign keys or just one?
2) I read about the way in which you can see the skew of your table (which i guess is based alot on your distribution) by running this query:
SELECT count(*), datasliceid from SCHEMA.TableName group by datascliceid ordery by datasliceid
So in doing this, I noticed that the minimum number of data per datasliceid was 21,530 and the max was 22,456. There are 240 datasliceid's that was returned. Is this good or should the data be distributed much more evenly?
3) Last but not least, I'm trying to run:
generate statistics on SCHEMA.TABLENAME
inside of Aginity but when I do, I just get the message The command completed successfully
but nothing shows up. How do I view this information?
I’ll give it a try: 1) only worry about distribution for REALLY large tables (go RANDOM on all the small/medium ones) Apart from that: only distribute on one column, and only if you join on it a LOT 2) that distribution in not skewed in my opinion. Don’t worry. 3) when you update statistics in a database it’s primarily for the benefit of the database optimizer to make better choices when you run (complex) sql against it. You can however read that info back from the catalog tables afterwards. Queries for that can be found online :)