Topic
  • 10 replies
  • Latest Post - ‏2012-04-26T18:36:54Z by smithha
thilsn
thilsn
4 Posts

Pinned topic How do I size the IADB repository

‏2011-11-04T16:03:23Z |
Hi,
I am running IA 8.5 on Sql Server.How do I size the IADB repository

for example
NO_OF_COLUMNS TABLE_SIZE_KB NUM_ROWS
19 2960 7981

Appreciate your feedback
Updated on 2012-04-26T18:36:54Z at 2012-04-26T18:36:54Z by smithha
  • smithha
    smithha
    161 Posts

    Re: How do I size the IADB repository

    ‏2011-11-07T15:58:59Z  
    Sizing of the IADB depends on quite a number of factors including what type of processing you are mostly intending to do.
    Will you mostly be profiling data? Or analyzing/monitoring data using IA rules?

    From your question, I suspect you are looking at the former. The bulk of storage required from a profiling perspective is for column analysis where you are breaking down the records to look at column-by-column content. Profiling generates a frequency distribution (FD) table in the IADB for each column analyzed. Within each FD table you will have a unique row per distinct value found within the column. The minimum row size for each distinct value will be 116 bytes, the maximum row size will usually be 678 bytes.

    Calculation for FD Table
    Estimated Minimum Size = Number of Distinct Value * (116 bytes + average source column length * 2)
    Maximum Size = Number of Distinct Value * 678 bytes
    Number of Distinct Value = Number of Records of the Source Column * Cardinality

    In the case of the table you note, your row length averages about 390 bytes, and an equal distribution across columns would be ~19 bytes per field (though likely to be more variable).

    If all values for all columns were distinct you would get a requirement of 7981 * (116 + 19 * 2) = 2155KB per FD table (and you would need 19 FD tables if you profile all columns).

    If you were running rules, rather than profiling, then requirements tend to be much smaller as you typically output only exception data and only relevant fields for the rule output. If you expected 10% of your data to fail your set of rules, you would end up with 798 rows with a number of columns equal to 19 or less depending on what you chose to output. Of course, you potentially could have a number of distinct rules run with output tables for each, but you are unlikely to require more space for rules output than you start with.

    Hope this helps.
  • thilsn
    thilsn
    4 Posts

    Re: How do I size the IADB repository

    ‏2011-11-07T21:34:25Z  
    • smithha
    • ‏2011-11-07T15:58:59Z
    Sizing of the IADB depends on quite a number of factors including what type of processing you are mostly intending to do.
    Will you mostly be profiling data? Or analyzing/monitoring data using IA rules?

    From your question, I suspect you are looking at the former. The bulk of storage required from a profiling perspective is for column analysis where you are breaking down the records to look at column-by-column content. Profiling generates a frequency distribution (FD) table in the IADB for each column analyzed. Within each FD table you will have a unique row per distinct value found within the column. The minimum row size for each distinct value will be 116 bytes, the maximum row size will usually be 678 bytes.

    Calculation for FD Table
    Estimated Minimum Size = Number of Distinct Value * (116 bytes + average source column length * 2)
    Maximum Size = Number of Distinct Value * 678 bytes
    Number of Distinct Value = Number of Records of the Source Column * Cardinality

    In the case of the table you note, your row length averages about 390 bytes, and an equal distribution across columns would be ~19 bytes per field (though likely to be more variable).

    If all values for all columns were distinct you would get a requirement of 7981 * (116 + 19 * 2) = 2155KB per FD table (and you would need 19 FD tables if you profile all columns).

    If you were running rules, rather than profiling, then requirements tend to be much smaller as you typically output only exception data and only relevant fields for the rule output. If you expected 10% of your data to fail your set of rules, you would end up with 798 rows with a number of columns equal to 19 or less depending on what you chose to output. Of course, you potentially could have a number of distinct rules run with output tables for each, but you are unlikely to require more space for rules output than you start with.

    Hope this helps.
    Hi Smithha,
    Thank you for Quick feedback.Few clarificatiion.

    1.We would be using for Monitoring purpose as well.( we would be storing 12 snapshot for each table)----> assuming its done for 10 columns.how do we do sizing.
    2.with relate to

    "If all values for all columns were distinct you would get a requirement of 7981 * (116 + 19 * 2) = 2155KB per FD table (and you would need 19 FD tables if you profile all columns). "

    for 19 column am assuming = 2155kb * 19?

    Apprecaite if can confirm.

    Thanks
  • smithha
    smithha
    161 Posts

    Re: How do I size the IADB repository

    ‏2011-11-07T21:44:58Z  
    • thilsn
    • ‏2011-11-07T21:34:25Z
    Hi Smithha,
    Thank you for Quick feedback.Few clarificatiion.

    1.We would be using for Monitoring purpose as well.( we would be storing 12 snapshot for each table)----> assuming its done for 10 columns.how do we do sizing.
    2.with relate to

    "If all values for all columns were distinct you would get a requirement of 7981 * (116 + 19 * 2) = 2155KB per FD table (and you would need 19 FD tables if you profile all columns). "

    for 19 column am assuming = 2155kb * 19?

    Apprecaite if can confirm.

    Thanks
    On your additional questions:

    1) Usually for monitoring purposes, you would define a set of rules (for example you have 10 columns and perhaps 20 rules defining what the valid conditions are) with which you want to validate the data. For output, you would include those 10 columns (plus a record id and perhaps the system date). You then need to approximate typical % of data with issues.
    For 12 snapshots, you would calculate #rows based on % with exceptions * (# of columns * average column length). Since exceptions tend to be fairly low in a monitoring situation, the sizing should not be particularly high.

    2) Yes, you are correct that you would likely need 2155kb * 19 columns. The big factor will be cardinality of the respective columns -- the more unique the data across all columns, the more space you need.

    Harald
  • thilsn
    thilsn
    4 Posts

    Re: How do I size the IADB repository

    ‏2011-11-08T16:00:06Z  
    • smithha
    • ‏2011-11-07T21:44:58Z
    On your additional questions:

    1) Usually for monitoring purposes, you would define a set of rules (for example you have 10 columns and perhaps 20 rules defining what the valid conditions are) with which you want to validate the data. For output, you would include those 10 columns (plus a record id and perhaps the system date). You then need to approximate typical % of data with issues.
    For 12 snapshots, you would calculate #rows based on % with exceptions * (# of columns * average column length). Since exceptions tend to be fairly low in a monitoring situation, the sizing should not be particularly high.

    2) Yes, you are correct that you would likely need 2155kb * 19 columns. The big factor will be cardinality of the respective columns -- the more unique the data across all columns, the more space you need.

    Harald
    Thank you for quick Response. Information is very helpful.

    on side note
    if we are profiling 1-12 million records..I believe the IA jobs creating temporary file IA File directory(please confirm) ..is there File space needs to be planned for IA.

    Thanks
  • smithha
    smithha
    161 Posts

    Re: How do I size the IADB repository

    ‏2011-11-09T15:50:00Z  
    • thilsn
    • ‏2011-11-08T16:00:06Z
    Thank you for quick Response. Information is very helpful.

    on side note
    if we are profiling 1-12 million records..I believe the IA jobs creating temporary file IA File directory(please confirm) ..is there File space needs to be planned for IA.

    Thanks
    Yes, you will need to plan for file and sort space during execution of IA jobs.

    If you are running IA rules, the following article should be of use:
    http://www.ibm.com/developerworks/data/library/techarticle/dm-1010infoanalyzerresources/index.html

    If you are running Column Analysis, here is some guidance:

    During runtime, IA profiling job consumes system disk space on the Engine system to analyze source data and then stores the technical metadata results in the IADB database.

    Under default conditions, most of the runtime consumed space on the Engine system is retrieved back. However, if the user selects to retain scripts and or datasets, some files are persisted even after the job completion. It is generally not recommended to retain datasets but for job monitoring and logging purposes, scripts could be retained.

    Disk space is consumed in the following locations:

    Retain Scripts: When this option is checked, scripts are retained. These scripts are the driver of the underlying jobs and don’t consume significant space. Size of the scripts is typically 3-4 KB but can slightly vary with the number of columns to be profiled in a given job.

    When chosen, scripts are persisted on Engine system under: /IBM_HOME/Server/Projects/<DATASTAGE_IA_PROJECT>/RT_SC99 folder.
    Where, IBM_HOME: Information Server install location.
    <DATASTAGE_IA_PROJECT>: Datastage project for which IA is configured to execute jobs.
    RT_SC99: is an example folder name starting with RT_SC and appended with a sequential number.

    Retain datasets: Checking this option, retains the datasets. Datasets are the binary files where the source data is temporarily retained for analysis. If “Retain Datasets” options are not chosen, these files are deleted after job completion. Size of the datasets depends on the size of the source data and they could be large.

    Scratch Space: This is the space needed to sort data so that it is organized and interpreted before the technical metadata results are stored in the relational scratch database.

    Location of DataSets and Scratch space is specified on the engine machine’s IBM_HOME/Server/Configurations/DEFAULT.APT file. Under a multiple node system, for concurrent job processing, new configuration file could be created with multiple dataset and scratch space folder location for each node. New configuration file may be associated to Information Analyzer’s DataStage project using DataStage Administrator UI.

    The following are the two formulas that could be used to estimate disk space requirement at runtime:

    Space calculation of DataSets
    Estimated datasets Size per column = U * (R+4)
    Where:
    U = unique number of rows
    R = bytes per row
    If multiple columns are profiled then this formula needs to be added up for each one.

    Space calculation of Sort Space

    Estimated sort space per row = 2 * N * (2*R + 16)
    Where:
    R = bytes per row. Note that a row can have multiple columns.
    N = number of rows

    Example Calculation for DataSet and SortSpace
    As an example, say the user is profiling a single column with 28 million records in the table. And suppose that there are 12 million distinct values with average size of each record of 55 bytes.

    Estimated sort space:
    2 * N * (2*R + 16) = 2*28,000,000 * (2* 55 + 16) =~ 6.5 GB, plus

    Estimated dataset space:
    12,000,000 * (55+4) =~ 0.659 GB

    So, total space estimate at runtime to successfully execute this example job is: 6.5 + 0.659 = 7.159 GB.

    Harald
  • thilsn
    thilsn
    4 Posts

    Re: How do I size the IADB repository

    ‏2011-11-09T16:40:08Z  
    • smithha
    • ‏2011-11-09T15:50:00Z
    Yes, you will need to plan for file and sort space during execution of IA jobs.

    If you are running IA rules, the following article should be of use:
    http://www.ibm.com/developerworks/data/library/techarticle/dm-1010infoanalyzerresources/index.html

    If you are running Column Analysis, here is some guidance:

    During runtime, IA profiling job consumes system disk space on the Engine system to analyze source data and then stores the technical metadata results in the IADB database.

    Under default conditions, most of the runtime consumed space on the Engine system is retrieved back. However, if the user selects to retain scripts and or datasets, some files are persisted even after the job completion. It is generally not recommended to retain datasets but for job monitoring and logging purposes, scripts could be retained.

    Disk space is consumed in the following locations:

    Retain Scripts: When this option is checked, scripts are retained. These scripts are the driver of the underlying jobs and don’t consume significant space. Size of the scripts is typically 3-4 KB but can slightly vary with the number of columns to be profiled in a given job.

    When chosen, scripts are persisted on Engine system under: /IBM_HOME/Server/Projects/<DATASTAGE_IA_PROJECT>/RT_SC99 folder.
    Where, IBM_HOME: Information Server install location.
    <DATASTAGE_IA_PROJECT>: Datastage project for which IA is configured to execute jobs.
    RT_SC99: is an example folder name starting with RT_SC and appended with a sequential number.

    Retain datasets: Checking this option, retains the datasets. Datasets are the binary files where the source data is temporarily retained for analysis. If “Retain Datasets” options are not chosen, these files are deleted after job completion. Size of the datasets depends on the size of the source data and they could be large.

    Scratch Space: This is the space needed to sort data so that it is organized and interpreted before the technical metadata results are stored in the relational scratch database.

    Location of DataSets and Scratch space is specified on the engine machine’s IBM_HOME/Server/Configurations/DEFAULT.APT file. Under a multiple node system, for concurrent job processing, new configuration file could be created with multiple dataset and scratch space folder location for each node. New configuration file may be associated to Information Analyzer’s DataStage project using DataStage Administrator UI.

    The following are the two formulas that could be used to estimate disk space requirement at runtime:

    Space calculation of DataSets
    Estimated datasets Size per column = U * (R+4)
    Where:
    U = unique number of rows
    R = bytes per row
    If multiple columns are profiled then this formula needs to be added up for each one.

    Space calculation of Sort Space

    Estimated sort space per row = 2 * N * (2*R + 16)
    Where:
    R = bytes per row. Note that a row can have multiple columns.
    N = number of rows

    Example Calculation for DataSet and SortSpace
    As an example, say the user is profiling a single column with 28 million records in the table. And suppose that there are 12 million distinct values with average size of each record of 55 bytes.

    Estimated sort space:
    2 * N * (2*R + 16) = 2*28,000,000 * (2* 55 + 16) =~ 6.5 GB, plus

    Estimated dataset space:
    12,000,000 * (55+4) =~ 0.659 GB

    So, total space estimate at runtime to successfully execute this example job is: 6.5 + 0.659 = 7.159 GB.

    Harald
    Thank you Smithh
  • JohnDAL
    JohnDAL
    2 Posts

    Re: How do I size the IADB repository

    ‏2012-04-25T22:41:03Z  
    • smithha
    • ‏2011-11-09T15:50:00Z
    Yes, you will need to plan for file and sort space during execution of IA jobs.

    If you are running IA rules, the following article should be of use:
    http://www.ibm.com/developerworks/data/library/techarticle/dm-1010infoanalyzerresources/index.html

    If you are running Column Analysis, here is some guidance:

    During runtime, IA profiling job consumes system disk space on the Engine system to analyze source data and then stores the technical metadata results in the IADB database.

    Under default conditions, most of the runtime consumed space on the Engine system is retrieved back. However, if the user selects to retain scripts and or datasets, some files are persisted even after the job completion. It is generally not recommended to retain datasets but for job monitoring and logging purposes, scripts could be retained.

    Disk space is consumed in the following locations:

    Retain Scripts: When this option is checked, scripts are retained. These scripts are the driver of the underlying jobs and don’t consume significant space. Size of the scripts is typically 3-4 KB but can slightly vary with the number of columns to be profiled in a given job.

    When chosen, scripts are persisted on Engine system under: /IBM_HOME/Server/Projects/<DATASTAGE_IA_PROJECT>/RT_SC99 folder.
    Where, IBM_HOME: Information Server install location.
    <DATASTAGE_IA_PROJECT>: Datastage project for which IA is configured to execute jobs.
    RT_SC99: is an example folder name starting with RT_SC and appended with a sequential number.

    Retain datasets: Checking this option, retains the datasets. Datasets are the binary files where the source data is temporarily retained for analysis. If “Retain Datasets” options are not chosen, these files are deleted after job completion. Size of the datasets depends on the size of the source data and they could be large.

    Scratch Space: This is the space needed to sort data so that it is organized and interpreted before the technical metadata results are stored in the relational scratch database.

    Location of DataSets and Scratch space is specified on the engine machine’s IBM_HOME/Server/Configurations/DEFAULT.APT file. Under a multiple node system, for concurrent job processing, new configuration file could be created with multiple dataset and scratch space folder location for each node. New configuration file may be associated to Information Analyzer’s DataStage project using DataStage Administrator UI.

    The following are the two formulas that could be used to estimate disk space requirement at runtime:

    Space calculation of DataSets
    Estimated datasets Size per column = U * (R+4)
    Where:
    U = unique number of rows
    R = bytes per row
    If multiple columns are profiled then this formula needs to be added up for each one.

    Space calculation of Sort Space

    Estimated sort space per row = 2 * N * (2*R + 16)
    Where:
    R = bytes per row. Note that a row can have multiple columns.
    N = number of rows

    Example Calculation for DataSet and SortSpace
    As an example, say the user is profiling a single column with 28 million records in the table. And suppose that there are 12 million distinct values with average size of each record of 55 bytes.

    Estimated sort space:
    2 * N * (2*R + 16) = 2*28,000,000 * (2* 55 + 16) =~ 6.5 GB, plus

    Estimated dataset space:
    12,000,000 * (55+4) =~ 0.659 GB

    So, total space estimate at runtime to successfully execute this example job is: 6.5 + 0.659 = 7.159 GB.

    Harald
    Where do I set the "retain datasets" option mentioned? I see the checkbox for retain scripts on the Analysis Engine tab. This is using version 8.5.
  • RobertDickson
    RobertDickson
    137 Posts

    Re: How do I size the IADB repository

    ‏2012-04-26T11:08:10Z  
    • JohnDAL
    • ‏2012-04-25T22:41:03Z
    Where do I set the "retain datasets" option mentioned? I see the checkbox for retain scripts on the Analysis Engine tab. This is using version 8.5.
    Take a look at http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.administer.doc%2Ftopics%2Ft_pa_modify_analysis_eng_sett.html
  • JohnDAL
    JohnDAL
    2 Posts

    Re: How do I size the IADB repository

    ‏2012-04-26T17:08:21Z  
    Take a look at http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.administer.doc%2Ftopics%2Ft_pa_modify_analysis_eng_sett.html
    Thanks for the link. But these instructions lead me to the same Analysis Engine tab where it's possible to set the value for "retain scripts." I'm interested in setting "retain datasets" to no, which is possible according to the entry by smithha. But that doesn't appear to be visible in the UI. Is there some other way of setting that? Or am I just not seeing it in the UI?
  • smithha
    smithha
    161 Posts

    Re: How do I size the IADB repository

    ‏2012-04-26T18:36:54Z  
    • JohnDAL
    • ‏2012-04-26T17:08:21Z
    Thanks for the link. But these instructions lead me to the same Analysis Engine tab where it's possible to set the value for "retain scripts." I'm interested in setting "retain datasets" to no, which is possible according to the entry by smithha. But that doesn't appear to be visible in the UI. Is there some other way of setting that? Or am I just not seeing it in the UI?
    When submitting a job for execution in the UI (e.g. Run Column Analysis), there are a series of tabs with specific execution options. One of those is the ENGINE tab and that includes the specific options for Retain Scripts and Retain Datasets. By default, Retain Datasets is always set to 'no' and unless you need to do specific debugging there should never be a need to change that. The datasets are temporary and will be automatically removed.