IBM PureData-Netezza Developer Network (NDN)
DeepashriKrishnaraja 270001C7Y3 Tags:  join broadcast plan jit execution database netezza generate statistics performance nzadmin optimizer 3 Comments 9,048 Views
Imagine that you have embarked on a journey and all that you know is the destination that you have to reach with no clue of the directions to reach it..At this point, your savior is the GPS that will give you the updated directions which will aid you in reaching the destination in the shortest possible route.. What GPS does to directions is what statistics will do to the Netezza optimizer…Confused??
Let me explain…The Netezza Optimizer which is the primary reason for the enhanced performance of Netezza internally creates an execution plan for any query you wish to execute. The main input to this Optimizer along with the actual query is the statistics. If you have the right updated statistics your optimizer will create the most optimal query execution plan resulting in improved performance. Now connecting our example above, statistics is like your GPS which gives you the right directions (optimal query plan) resulting in high performance (destination).
In this context many questions come to one’s mind. What are statistics? Types of statistics and who generates them? What do statistics contain? When and how are the statistics generated? Let me answer them one by one…
In the literary sense, database statistics are basically dynamic metadata which assist the query optimizer in making better decisions. User generated statistics are the most efficient form of statistics which is obtained by explicitly executing the nzsql GENERATE STATISTICS command or through NzAdmin GUI Database View . GENERATE STATISTICS reads every row in the table to determine the total number of rows in the table plus each table column’s maximum and minimum values,dispersion values, duplicate values, null values and unique values. It provides the most accurate and highest quality statistics enabling the Optimizer to choose the best execution plan from all possible plans. Statistics can be generated on all the columns for all the tables of the database(GENERATE STATISTICS),on all the columns for a specific table(GENERATE STATISTICS ON <table>) or on a set of columns for a specific table(GENERATE STATISTICS ON <table> (<col>, <col>,…).Stats are updated in system catalog tables.
Some can and cannot's you should keep in mind :
Along with the user, Netezza database has its own ways of generating and maintaining database and dispersion statistics when certain database operations are performed. Of course, these automatic statistics are not as accurate as the ones generated by GENERATE STATISTICS command, however when you perform some database operations like CREATE TABLE AS,INSERT and UPDATE Operations,GROOM TABLE /nzreclaim, TRUNCATE TABLE you can be a little relieved even if you have forgotten to give the GENERATE STATISTICS command as the database has automatically generated it for you. Don’t be surprised to see the output of GENERATE STATISTICS even if you have not run it, as Netezza has the capability of running this for small tables with fewer rows automatically.
Well, the Netezza optimizer cannot be left behind. The optimizer creates Just In Time (JIT) statistics on the fly during planning using sample scan functionality and zone map information. Based on estimations , JIT statistics improve selectivity performance for tables with data skew or complex column restrictions,and also avoid broadcast of large tables that were estimated to be small.
What are the other conditions when the Optimizer automatically runs JIT statistics?
– Tables that contain more than five million records
– Queries that contain at least one column restriction
– Restrictions that do not contain sub-query or sub-plan expressions
– Tables that participate in a join or have an associated materialized view
You might be wondering, that with JIT collecting statistics so intensively why do you need to run GENERATE STATISTICS?... JIT Stats do not compute dispersion values which are utmost necessary to compute JOIN selectivity and also do not run on system tables, external tables, or virtual tables. GENERATE STATISTICS on the other hand, runs on any kind of table along with the usual database user table. Always remember, if you don't have the updated directions to your destination chances are higher that you make the wrong choices..similarly to gain the highest performance you need the best query plan and that is only possible with the most accurate user generated statistics.
Here are some tips on when you should generate statistics which ensure you high performance :
At the end of the day, use the right statistics and enjoy the Netezza performance boost!!!
DeepashriKrishnaraja 270001C7Y3 Tags:  collocated key netezza distribution slice skew redistribute unique ampp data distribute random performance join response hash time 6,727 Views
Performance, thy name is Netezza. One of the key factors which contribute to this aspect is the intelligent data distribution owing to the AMPP architecture of Netezza. This efficient data distribution is determined by a good distribution key.
First of all why do we need an efficient data distribution? When the system creates records, it assigns them to a logical data slice based on their distribution key value. When you create a table and then load the data into the system, the rows of the table will be distributed among all the data slices. If some data slices have more rows of a table than others, the data slices with more data and the S-Blades that manage them have to work harder, longer, and need more resources and time to complete their jobs. These data slices and the S-Blades that manage them become a performance bottleneck for your queries.
Netezza system follows two distribution methods, hash and random (round robin method).Distribution key comes into picture in the hash distribution method. For every table, the distribution key is hashed to a numerical value. All rows with the same hash value are distributed to the same data slice.
Now consider the below example of a table for students of a class
CREATE TABLE sample (idnum integer, gender Boolean, Student-Name varchar(50)) distribute on gender;
Here the rows corresponding to gender column can have only 2 values - Male(M) or Female(F).So the system would map these rows with same values to the same hash values and in turn to the same data slices. So here we would end up distributing rows to only two data slices resulting in a data skew.
Now consider another example of a table for the new-joinees of a company with their joining date
CREATE TABLE sample1 (idnum integer, joiningdate date, Varchar(50) name) distribute on joiningdate;
Say 60% of the new-joinees joined in the month of June. With DATE as distribution key, rows may be distributed evenly across all S-Blades. However, most analysis (queries) is performed on a date range. In this example, all of the records to be processed for a given date range(June) are located on a single or a few data slices resulting in processing skew on that data slice.
Response time of a query is affected by the completion time for all of the data slices. In both the above cases, even though some data slices would have finished processing their data, the others with data and processing skews increase the response time thus degrading the performance.
So now the question is how do I choose a good distribution key? Here are some tips
You might ask me that with so many conditions, why do I need a distribution based on distribution key? Why not distribute using the RANDOM distribution method? It easily distributes data evenly across data slices using a random algorithm and hurray my objective is achieved..Well behold!!! We have a problem. What would be the situation when we have 2 tables joined on a common key with data distributed across several different data slices…Can you imagine the response time when the query which does the join will have to fetch data from these hundreds of data slices!! This is the catch in the random distribution method where an efficient distribution key in the hash distribution method gains the upper hand.
Collocated joins are the next front-runners for increased performance. For tables that have a relationship and are commonly joined (dimension tables and fact table of a star schema for example) always use the join keys as distribution keys. This will result in Collocated Joins where data being joined from both tables will reside in the same data slice thus reducing the data fetch cycles leading to higher degrees of performance.
In some cases it may not be possible to distribute both tables on the relationship key. If your database design is such that only one of the tables can be distributed by the join key you still don’t need to worry. Netezza will redistribute the needed columns of the other table to the suitable data slices to enable a collocated join. This process named as single redistribute will have a performance cost associated with it but is way ahead better than random distribution.
Few more hash distribution methods which boost performance in joins are double redistribute( when both tables cannot be distributed on join key) and broadcast (small tables are broadcasted to large tables when both joined tables have different distribution keys) which again rely that your distribution key is just the right one…Remember!! If your foundation is not right, your building cannot stand up tight.
Here are some good practices which one should never forget when setting the right distribution key.
Many a times we come across situations where in we want to accomplish a task for which there is an already built in function or a procedure. Most of the database vendors provide numerous such functions and procedures to achieve such tasks. But there are also occasions that none of the existing functions meet the actual goal that one desires. In such cases a user ends up in writing a function. Herein this article, I try to highlight as to how one can come up with his/her own C++ function which can be invoked using the Netezza SQL.
A user can create custom functions, aggregates, and shared libraries using the user-defined extensions feature. The custom functions can be run on the Netezza Performance Server (NPS) systems and used to perform specific types of analysis for business reporting and data queries. These objects leverage the massively parallel processing (MPP) environment to accelerate analysis of data. User-defined functions enable data processing directly on the system, leading to elimination or reduction of data movement to other systems for analysis, which ensures performance gains.
In other words, User defined Extensions (UDX) enhances Netezza functionality by enabling a user to write a C++ coded modules which can be invoked using the SQL queries.
There different types of UDX’s supported by Netezza are –
A typical UDX development cycle will include –
Let’s take a look on how UDF is created and used. UDF takes 0 or more input parameters and returns one output value. It is called once for every row and can be used in all SQL queries where built in functions are supported.
How to create the C++ file for UDF?
The file must include the header file udxinc.h in addition to any standard header file. The UDX classes and functions for API version 2 are defined in a namespace called nz:udx_ver2.
using namespace nz::udx_ver2;
Create a class object derived from the udf base class. The code will look something like this :
virtual nz::udx_ver2::ReturnValue evaluate()
// Function Logic
The instantiate method creates the object dynamically.
This CPP file will have to be compiled using the nzudxcompile utility.
The following object files are created
Once the UDF is compiled, it needs to be registered with the NPS system. The same utility can be used
Alternatively, the UDF can be registered using SQL, For ex –
These registered functions can be invoked like any function in a SQL statement.
Apoorv Kapse 270003FSX5 Tags:  connectivity for system odbc jdbc client oledb analytics puredata 3,349 Views
This post brings the insights on client connectivity for ‘PureData System for Analytics’ at one place. It does not cover the information on installation and configuration of client drivers, details of which can be found in:
Let us categorize the details as follows
Before moving on to the details, here’s a figure that shows the analogy between different types of drivers like ODBC, JDBC and OLEDB.
Coming back to the categories, let’s discuss in details
A session is a handshake protocol for negotiation between the client and server that starts with authentication. Session parameters like database, user, password, protocol etc, can be configured while creating the Data sources in these drivers or can be provided as part of connection string. There are few advanced options available which can be configured while initiating the connection. To take ODBC as example, refer below link
On the server (PureData System for Analytics) side, nzsession command can be used to view and manage sessions. Refer link below for details
Transactions refer to unit of work that operates on database-related objects, data or both.
Transactions provide the following benefits:
The transaction processing task contains five steps:
Client drivers provide explicit APIs to start/end transactions (e.g. SQLTransact, SQLEndTran), control on isolation level( e.g. SQLSetStmtAttr, SQLSetConnectAttr), query cancellation(e.g. SQLCancel), etc.
Once again, nzsession on server is handy to show and manage transactions.
Query execution from client driver involves multiple phases as listed below:
Minimal parsing is done is client drivers just to identify the type of query (parameter markers, multi-statement batch), normalization (white spaces, comments) and processing of escape sequences.
Preparation of statement involves processing of vendor specific escape clauses, translation based on server supported syntax, creating the access plan and getting some metadata information for result-sets (column information – types, lengths and others). These steps are relevant for SELECT queries and no-op for rest of the SQL.
This step deals with binding of data and actual execution of the query on the server. In case of multi-row operations (supported for INSERT/UPDATE/DELETE), column-wise or row-wise binding is done. This also involves any codepage conversion and fetching the result-sets (if generated) on client. Once the result-sets are available, it can be fetched in the application into bound variables.
Client drivers provides APIs to execute (e.g. SQLExecute, SQLExecDirect), fetch (e.g. SQLFetch, SQLFetchScroll), getting the row-counts (e.g. SQLRowCount) etc.
Everything is executed synchronously as there is wait/block for response from server. Also the server does not support for multiple queries to be active on a single session.
Metadata information refers to data about data. It describes the specific characteristics of an individual data item like table, column, stored procedure, etc which is usually stored in the system catalogs.
Client drivers provide APIs to retrieve metadata information, e.g.
There is no client API for creating/modifying the metadata. Server provides it’s own set of system views like _v_odbc_tables1/2/3, _v_jdbc_tables1/2/3 and _v_oledb_tables1/2/3 to be utilized by client having different version suffix for cross db and schema support.
As far as linguistics are concerned, client drivers provide full unicode support for queries and data with UTF8, UTF16 or UTF32 encoding though internally everything is in UTF8 (same as server encoding).
For locale specific encoding to unicode, drivers make use of ICU libraries (except JDBC).
Error messages are in english only which means that drivers are fully internalized but not localized.
Protocol can be talked about in two ways here
This refers to the custom protocol for initial negotiations while creating sessions.
This refers to protocol for any information to sent/received within a session. State transitions are part of it. It is partly inherited from postgres protocol and consists of two parts
Both the types of protocols are versioned for extensibility.
Hope this helps – questions/comments are welcome.
In it's simplest terms, a Zone Map is a persistent table maintained by the Netezza system that contains information about the data in a table created by a user. A single row entry in the Zone map table would correspond to an extent (Storage) of the User table. By default, Netezza would record information about integeral and temporal columns from the user's table in it's corresponding zone map. For each qualified column, Netezza would maintain the maximum and minimum value of the column that is stored in an extent (3MB).
Other than integral and temporal types, Netezza also maintains character column info (first 8 bytes) in zone maps if that column is mentioned in "order by" clause of a materalized view. If your table has hundreds of columns, remember that Netezza would maintain zone map info only for the first 200 qualified columns (its configurable).
The debug info would be written into the following virtual tables which you could query (thanks to Kapil Maheshwari for guiding me to the right tables) :
For older systems, this info could be accessed through the following virtual tables :
GauravSaxena 270003BV8Q 1,565 Views
Using Netezza ? If answer is yes, you have done the first step in choosing a powerful system for your analytics requirements. Speed, durability, resilience, time to value and most importantly “Simplicity” is built into it.
When you have such an appliance, you would want to take care of it more and use it in right way so as to continue benefitting from the power of Netezza. Although most of the things are automatically taken care of but still here are some tips that you might want to take care of in the following cases:
A) when everything is alright (will you not want it to remain that way):
Although, you might not want to worry about all of these points but my suggestion would be to at least follow points 6,7 and 8 as normally in Netezza type of system you must be having really big data.
B) when something you feel is wrong:
Well, should not happen but if it happens, your next actions really would be driven by the type of problem that you might encounter. More often than not, the best way is to contact IBM support if you think that you are following best practices and problem is directly related to netezza.
But yes, sometimes in my experience, problems that you encounter may be elsewhere and in that case it is always a good idea to review your analytics setup as a whole which mostly comprises of:
Sequencing of jobs,
Data Model in use,
SQL itself, and
All of these areas, play important role for performance and should be suspected when a problem related to performance occurs.
I will just focus on Netezza in this blog and my worry is that this blog is becoming too big so will try to wrap it up also. Key things in Netezza which should be focused on, are:
Affects auto vacuum which is run at frequent intervals by Nz and also my suggestion would be to do nz_manual_vacuum. In the case of nz_manual_vacuum not only the vacuum of data happens but also reindexing of the entire database happens. A regular vacuum does not and cannot do a reindex. That is why nz_manual_vacuum script is provided. Over a period of time the index files accumulate holes. A reindex cleans this up.
Check in QHD if you have one setup if the queries were running faster earlier or getting to know time window in which something looks wrong. Basically review the data present with you.
Also, at regular intervals, remove old data from the history database based on your need to keep the history
- nzhistcleanupdb -d histdb -t "<date>".
As mentioned earlier for few of the above, these are the main things that if not correctly used, affect query performance.
Review your Workload Management policy based on the users and groups that would want system access in a particular time window.
There might be some system variable and hardware related factors involved but one should not be reviewing/changing those as IBM support person needs to do that.
I will stop here for this topic – you might want to read my other blog for more information related to Nz:
santosh_hegde 060002755N Tags:  externaltable udf function transform user load etl extract defined 1 Comment 3,868 Views
Usually, for moving data into a Data Warehouse system like Netezza, there is an extract, transform and loading phase involved commonly referred to as ETL.
Extraction is extracting data from various transactional sources, transformation involves operations like computing new values based on some source value or translating coded values, selecting subset of data and loading is moving the data to the Data Warehouse.There are several tools available in the market to do ETL, however if extraction of data is already complete from source systems and the requirement is to perform transformations and loading to Netezza, Netezza provides a very flexible capability called External Tables.
External Tables provide a SQL interface to the data that exists in a flat file or a pipe. Since SQL is a very rich language with several transformational capabilities, External Table inherits most of the transformational capabilities of SQL. For example using External Tables, users can select only certain columns of data from the file, perform operations like casting, truncation, mathematical operations on the selected data and even join this data with other existing table before loading it into the data warehouse.
Further External Tables in conjunction with User Defined Functions(UDF) can be used to manipulate the data read in from the data file with custom user code. This custom user code can be similar to the rules specified in the ETL tools to transform the data.Because of these strong transformational capabilities, External Table is a very good choice to do the Transformation & Loading of an ETL process. In no way External Tables can solve all the ETL problems , however it may satisfy the needs of several applications which may find the ETL tools an overkill.
Let us consider an example of selecting certain columns, performing decoding transformations on that data, deriving new values from some of the columns and then loading this data into a Netezza Table.
CREATE EXTERNAL TABLE customer_ET(NAME CHAR(20),AGE INT4,GENDER CHAR(1),BALANCE INT4,CITY CHAR(20),ADDRESS CHAR(10))USING (dataobject('/home/mycustomer.dat'));
/* Load the data into Netezza Table customer_DW from External Table customer_ET */
Below is a sample of how the evaluate method for citycode() UDF which returns city code based on city name will look like. Similarly UDFs for generation() and rewards() can be defined. Refer to Netezza UDF development guide for more details.
These UDFs can do complex pre-processing on your data before its loaded into Netezza.
/* evaluate method for citycode() UDF. The code under this evaluate method is executed for invocation of citycode UDF. */
if (memcmp("Chicago",datas,7) == 0)
If your requirement is to perform some transformations on the data before loading it into Netezza, then I would recommend you to explore External Tables with UDF to perform these transformations along with loading the data. You would be surprised to see how powerful a simple SQL based interface can be for doing such transformations.
Too many '/'s in the title. That just denotes that this is not a new idea but a practice that is more often than not, forgotten. We have essentially become lazy and very dependent on what IT industry offers us but we forget that with very little effort, one can optimally use the product and really make efficient use of investment that was put in.
We have seen/practiced many methods of using the software/appliance/next to come, kind of tools to be used for easing our day to day operations and also telling us about what to do next (analytics), and handle big data to our use, as for past few years we are in a habit of generating a lot of data and not knowing what to do with it - nevertheless, all for our benefit. This statement kind of sums up the IT industry (or not ?).
Now coming to the topic of appliance and in particular Netezza, true claim is obviously to just plug and play and you get analytics at unbelievable speeds and when I myself tried Netezza, I find the claim to be true to the cent, BUT is where I want to focus in this blog. When you are using anything, there are set of unsaid rules that are to aid easy use and to never get below expectations and same is true for any appliance for example, microwave oven - you need to put it on right setting to get the right dish out of it. So there are some etiquette that one should follow in order to get most out of the appliance and I intend to tell just some more things about Netezza in next few paragraphs. I would however put a disclaimer that as in appliance do not change any parameters (based on the list of things I am going to tell you) yourself without IBM support's help, otherwise the warranty is void and stuff as is true in any other appliance.
Etiquettes (Nz DBA):
A) Do not let the NZ system get treated as OLTP system.
This simply means you knowing that your system is for real analytics and not for transaction processing. A lot can be written regarding this topic but I think I have conveyed my point. If not, please ask and I will respond.
B) Do run nz_responders frequently (mostly when problem is reported).
DBA's very good friend is "nz_responders".
Ask your support engineer for a script to get you information about possible problems.
Monitoring GUIs that Nz has also show enough information at high level. Please use this CUI method when you think a problem might occur and need point in time analysis.
E) Good old toolkit mostly installed in /nz/support/bin (directory structure my be different in different cases, ask your support to provide more info) -> please note that only files starting with "nz_" are script files and pretty self explanatory based on filenames itself. For example:
F) Know that you have best AMPP system with good backup mechanism in that your data is mirrored on a separate disk so that it's easier to get the data back even when a disk fails.
G) Best thing last:
There is a script for best practices nz_best_practices which is self explanatory. You would really like the output. If in doubt, ask.
I hope this blog helped in some way - there is some information that you will need to dig yourself and that is my intention. Also I will call Nz users and developers to add to the points I have mentioned but not going into details as after all NZ is an appliance.
Do you ever have to segment customers by decile? How about generate a histogram for a large set of data?
Typically processes like these are custom coded in ETL, or better yet answered by about 1000 queries auto generated by a BI tool. In either case, the multi-pass SQL for these types of questions can take hours to run (okay, maybe a couple of minutes).
What if there were pre-built function already in the database to execute the rules for these processes and generate the same results, but in minutes and seconds instead of hours and minutes?
In-database analytics are not just for "analytics". They can be used to solve everyday processing and query performance too. Don't disregard the in-database analytics that come with your system, take a look and start applying them to everyday problems, today.