Sparse and normal Lookup in DataStage®
In IBM DataStage, Lookup stage is used to enrich data by retrieving additional information from a reference dataset. Choosing the right lookup type, normal or sparse, can significantly impact performance and resource usage.
Normal Lookup stage
The normal Lookup stage loads the entire reference dataset into memory before it performs lookups. It is ideal for smaller datasets where speed is critical.
Key characteristics of normal Lookup stage:
- It provides high performance because lookups are done in-memory.
- It requires sufficient memory to hold the reference data.
- It is best suited for small, static reference datasets.
- It is ideal for high-speed batch processing where database usage is minimized.
Sparse Lookup stage
Sparse Lookup stage queries the database for each input record instead of loading the entire reference dataset into memory. It is recommended for large datasets.
Key characteristics of sparse Lookup stage:
- It performs lookups directly on the database (also called direct lookup).
- It sends individual SQL queries for each input record.
- The reference data remains in the database, not loaded into memory.
- It is recommended when the reference dataset is too large to fit in memory.
- Its performance depends on database response time and indexing.
- It is suitable for dynamic data change.
DataStage connectors that support sparse Lookup
- IBM Db2 for DataStage
- Oracle
- Netezza®
- Teradata
- Generic JDBC
- ODBC
- Apache Cassandra
- Google BigQuery
- Apache Hive
- Salesforce API for DataStage
- Snowflake