ComputeTableStats
TheComputeTableStats action collects statistics for an Iceberg table and writes them to Puffin files. These statistics help query engines make better optimization decisions.
Interface
Overview
Table statistics provide valuable information for query optimization, including:- Column value distributions
- Distinct value counts (NDV)
- Null counts
- Min/max values
- Data sketches for cardinality estimation
- Analyzes data files in the table
- Computes statistics for specified columns
- Stores results in Puffin format
- Associates statistics with specific snapshots
Methods
columns
Specify which columns to collect statistics for.columns- Variable number of column names to analyze
this for method chaining
Example:
If not specified, statistics are collected for all columns in the table.
snapshot
Specify which snapshot to compute statistics for.snapshotId- The ID of the snapshot to analyze
this for method chaining
Example:
If not specified, statistics are computed for the current snapshot.
Result
TheResult interface provides information about the computed statistics.
Methods
null if no statistics were collected.
Usage Examples
Basic Statistics Collection
Specific Columns
Specific Snapshot
With Column Selection
Periodic Statistics Update
Statistics File Format
The action stores statistics in Puffin files, which contain:- Blob metadata: Information about each statistic
- Statistics data: Actual statistical values and sketches
- Snapshot association: Link to the analyzed snapshot
Accessing Statistics
When to Compute Statistics
Consider running ComputeTableStats when:- After large data loads: New data may change value distributions
- Before important queries: Ensure optimizers have current information
- On a schedule: Regular updates for frequently changing tables
- After schema evolution: New columns need statistics
- Performance degradation: Outdated statistics may cause poor plans
Best Practices
- Select important columns: Focus on columns used in joins, filters, and aggregations
- Update regularly: Schedule periodic statistics updates
- Monitor statistics age: Track when statistics were last computed
- Consider cost vs. benefit: Statistics computation can be expensive for large tables
- Use with query optimization: Ensure your query engine uses Iceberg statistics
Performance Considerations
Costs
- Reads all data files for analyzed columns
- Computes aggregations and sketches
- Writes statistics files to storage
- Can be time-consuming for large tables
Optimization Tips
- Limit to frequently queried columns
- Run during off-peak hours
- Use snapshot parameter to avoid re-computing for same data
- Consider incremental statistics updates
Configuration Options
While not shown in the interface, implementations may support additional options:Statistics and Query Optimization
Statistics help query engines:- Estimate cardinalities: Choose optimal join orders
- Skip data files: Prune files based on min/max values
- Optimize aggregations: Pre-aggregate when beneficial
- Choose algorithms: Select hash vs. sort-based operations
Related
- Puffin File Format - Statistics file format specification
- Query Optimization - How statistics improve queries
- Table Metadata - Understanding table metadata structure
- RewriteDataFiles - Data optimization actions