Data partitioning

When you create Datalake tables, you can optionally partition your table data. Datalake tables can be partitioned by one or more columns. You can also partition using expressions.

To partition a table, use the “PARTITIONED BY” clause in your “CREATE Datalake TABLE” statement. For additional information, see Create Datalake Table.

The goal of partitioning is to quickly reduce the amount of data that must be processed for queries in your workloads. Reducing the data that needs to be processed during query processing is called “partition elimination.” Partitioning keys should be chosen with the query workload in mind. Partition by one or more columns that will be referenced frequently in your SQL queries.

For example, if your table includes a DATA or TIMESTAMP column, and your SQL queries often filter your data based on some aspect of the DATA or TIMESTAMP, then consider partitioning by year, by month, or by day.

The number of partitions created (for a partitioned table) will depend on the number of unique values in the columns or expressions you specify in the PARTITIONED BY clause. In our example above, if you partition by year, you will have fewer partitions compared to partitioning by “year-month” or by “year-month-day".

Choose partitioning keys such that the resulting number of partitions is reasonable. Larger tables benefit from more partitions than small to medium tables. Very small tables may not benefit from partitioning.

Also, choose partitioning keys such that partitions will be roughly equal in size. There are tradeoffs to consider with table partitioning. Partitioning usually results in more files storing your table data. If the number of partitions is large compared to the table size, we end up with many small partitions, and many small files. Too many small files can adversely affect performance.

Iceberg Table Partitioning

Partitioning for Datalake Iceberg tables works differently than regular Datalake tables. Iceberg implements hidden partitioning. With hidden partitioning, there's no need to define a separate table column to store partition values. Iceberg generates partition values by taking a column value and optionally modifying it to create a partition value. It keeps track of the relationship between a column value and its partition value.

Hidden partitioning simplifies your queries allowing you to select the data you need while letting Iceberg determine which data files to skip when fetching the matching data. Queries on timestamp data with format YYYY-MM-DD hh:mm:ss partitioned by day, for example, do not have to specify the ho:mm:ss value on the query.