Collect Stats in Teradata
Collect Stats in Teradata collect the statistics for one or even multiple columns of the base table specified. These statistics include the hash index, a join index, etc. Primarily its the collection of the statistical profile of the required columns and the same gets stored in a data dictionary. These statistics are usually utilized to optimize the SQL’s submitted on these tables, during the Execution plan by the Parsing Engine.
What is Collect stats in Teradata?
Collect Stats is used to gather the statistics on various columns as per the requirement. Thereafter the Teradata’s optimizer creates an execution strategy that is based on these statistics collected by the COLLECT STATS command.
This COLLECT STATS command gathers data demographics and environment information which the optimizer utilizes in order to optimize the plan for the SQL’s based on these tables.
Environment information consists of the below set of info:
- The amount of memory required
- The number of nodes
Data demographics consist of the below set of info:
- The row size of the table
- Number of rows in that table
- The entire range of values present in that table
- The number of rows available per table
- Number of Null values in that table
There is a variety of approaches available to collect statistics over a table. Some of those are mentioned below:
- Using the sample option: This includes unique index columns, nearly unique indexes or columns
- Full statistic collection methodology: This usually includes Non-Indexed columns, Partition for all tables whether permitted or not, Collection of full stats over relevant columns.
- Most NUPI’S, That is Non-unique primary indexed columns or UPI’s that is the unique primary indexed columns
- Apart from these, it also includes the single column join constraints along with NUSI’s
- The Random AMP Sampling: It involves the USI’s or UPI’s if only used with equality predicates.
How to Collect Stats in Teradata?
Below is the syntax of COLLECT STATS statement in Teradata:
COLLECT [SUMMARY] STATISTICS
Here the keyword SUMMARY is optional and the user may skip it if not required.
- Column names on which the statistics are to be saved are specified ahead of the keyword COLUMN, within the round parenthesis.
- The table on which the statistic is to be collected is specified ahead of the ON
Let’s take up some examples to understand how to collect stats in Teradata in detail:
COLLECT STATISTICS COLUMN(roll_number) ON Student_Table;
- This will collect stats on the roll_number column of the student_table
- When the above query is executed, the below kind of output is produced
- 10 Rows mean that the table had 10 records over which the stats have been collected over the column roll_number
- Thereafter the total elapsed time for the entire process to take place is mentioned
Once the stats have been collected, the subsequent SQL’s can be optimized by the Parsing Engine based on this table. Let’s take an example to understand the optimization in detail.
Suppose we have two tables table1 and table2.
table1 contains the details of the students like id, name, age, marks, etc
whereas table2 contains the geographic info of the students like address, location along with the primary key as ID
From table1 as a
left join table2 as b
on a.id = b.id
Let’s consider two cases wherein the above-mentioned query gets executed.
CASE 1: When we do not have any information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be more costly.
CASE2: When we do have the specific required information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be less costly.
Reason being, during the join which is based on the id column from table1 and table2, needs to be on the same AMP in order to join the data based on this column from table1 and table2
Suppose table1 contain 100 records having ID from 1 to 100 distributed evenly over 10 AMP in the below fashion.
- The records having the ID from 1 to 10 in AMP1
- The records having the ID from 11 to 20 in AMP2
- The records having the ID from 21 to 30 in AMP3
- The records having the ID from 31 to 40 in AMP4
- and so on…
- The records having the ID from 91 to 100 in AMP10
And table2 has only 80 records having ID from 1 to 100 with 20 missing ID’s
Suppose the data is distributed as below:
- The records having the ID from 1 to 8 in AMP1
- The records having the ID from 9 to 10 and 15 to 20 in AMP2
- The records having the ID from 21 to 28 in AMP3
- The records having the ID from 29 to 36 in AMP4
- and so on…
- The records having the ID from 92 to 100 in AMP8
Now for the join to happen, the ID’s should be available in the SAME AMP
This will require a redistribution of data.
Generally, the data from the smaller table is redistributed. So here the redistribution will happen for table2
And this redistribution will be faster if the statistics are collected beforehand for these two tables.
How can we see the statistics collected for the tables in Teradata?
The collected stats can be seen using the below query:
HELP STATISTICS <table_name>
Let’s see the stats collected on Student_table
HELP STATISTICS student_table
When the above query is executed, The result will be something like this:
- COLLECT STATS collects the statistics about the tables, that can be utilized further on to optimize the queries based on these tables.
- These collected stats can be viewed as well using the HELP STATISTICS command.
- Statistics can be collected on the required number of columns for the required tables.
This is a guide to Collect Stats in Teradata. Here we discuss What is Collect stats in Teradata? and How to Collect Stats in Teradata?. You can also go through our other related articles to learn more –