ComputeTableStats

The ComputeTableStats action collects statistics for an Iceberg table and writes them to Puffin files. These statistics help query engines make better optimization decisions.

Interface

public interface ComputeTableStats extends Action<ComputeTableStats, ComputeTableStats.Result>

Overview

Table statistics provide valuable information for query optimization, including:

Column value distributions
Distinct value counts (NDV)
Null counts
Min/max values
Data sketches for cardinality estimation

The ComputeTableStats action:

Analyzes data files in the table
Computes statistics for specified columns
Stores results in Puffin format
Associates statistics with specific snapshots

Methods

columns

Specify which columns to collect statistics for.

ComputeTableStats columns(String... columns)

Parameters:

columns - Variable number of column names to analyze

Returns: this for method chaining Example:

// Compute stats for specific columns
action.columns("user_id", "event_type", "timestamp");

If not specified, statistics are collected for all columns in the table.

snapshot

Specify which snapshot to compute statistics for.

ComputeTableStats snapshot(long snapshotId)

Parameters:

snapshotId - The ID of the snapshot to analyze

Returns: this for method chaining Example:

// Compute stats for a specific snapshot
action.snapshot(1234567890L);

If not specified, statistics are computed for the current snapshot.

Result

The Result interface provides information about the computed statistics.

Methods

interface Result {
  StatisticsFile statisticsFile();
}

statisticsFile() Returns the statistics file containing the computed statistics, or null if no statistics were collected.

Usage Examples

Basic Statistics Collection

// Compute stats for all columns in current snapshot
ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .execute();

StatisticsFile statsFile = result.statisticsFile();
if (statsFile != null) {
  System.out.println("Statistics written to: " + statsFile.path());
  System.out.println("Blob count: " + statsFile.blobMetadata().size());
}

Specific Columns

// Compute stats for specific columns only
ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .columns("user_id", "product_id", "purchase_amount")
  .execute();

if (result.statisticsFile() != null) {
  System.out.println("Statistics computed for selected columns");
}

Specific Snapshot

// Compute stats for a historical snapshot
long snapshotId = table.currentSnapshot().snapshotId();

ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .snapshot(snapshotId)
  .execute();

With Column Selection

// Compute stats for frequently queried columns
ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .columns(
    "date",
    "customer_id",
    "region",
    "product_category"
  )
  .execute();

StatisticsFile statsFile = result.statisticsFile();
if (statsFile != null) {
  System.out.println("Stats file snapshot: " + statsFile.snapshotId());
  System.out.println("File size: " + statsFile.fileSizeInBytes() + " bytes");
}

Periodic Statistics Update

// Update statistics after significant data changes
public void updateTableStats(Table table, Actions actions) {
  Snapshot current = table.currentSnapshot();
  
  // Only compute if there are enough new records
  long recordsSinceLastStats = getRecordsSinceLastStats(table);
  if (recordsSinceLastStats > 1_000_000) {
    ComputeTableStats.Result result = actions
      .computeTableStats(table)
      .snapshot(current.snapshotId())
      .execute();
    
    if (result.statisticsFile() != null) {
      System.out.println("Statistics updated for " + recordsSinceLastStats + " new records");
    }
  }
}

Statistics File Format

The action stores statistics in Puffin files, which contain:

Blob metadata: Information about each statistic
Statistics data: Actual statistical values and sketches
Snapshot association: Link to the analyzed snapshot

Accessing Statistics

ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .columns("user_id")
  .execute();

StatisticsFile statsFile = result.statisticsFile();
if (statsFile != null) {
  System.out.println("Path: " + statsFile.path());
  System.out.println("Snapshot ID: " + statsFile.snapshotId());
  System.out.println("Size: " + statsFile.fileSizeInBytes());
  
  // Access blob metadata
  statsFile.blobMetadata().forEach(blob -> {
    System.out.println("  Type: " + blob.type());
    System.out.println("  Fields: " + blob.fields());
  });
}

When to Compute Statistics

Consider running ComputeTableStats when:

After large data loads: New data may change value distributions
Before important queries: Ensure optimizers have current information
On a schedule: Regular updates for frequently changing tables
After schema evolution: New columns need statistics
Performance degradation: Outdated statistics may cause poor plans

Best Practices

Select important columns: Focus on columns used in joins, filters, and aggregations

// Prioritize columns used in query predicates
action.columns(
  "partition_date",    // Partition key
  "customer_id",       // Join key
  "status",            // Filter column
  "amount"             // Aggregation column
);

Update regularly: Schedule periodic statistics updates
Monitor statistics age: Track when statistics were last computed
Consider cost vs. benefit: Statistics computation can be expensive for large tables
Use with query optimization: Ensure your query engine uses Iceberg statistics

Performance Considerations

Costs

Reads all data files for analyzed columns
Computes aggregations and sketches
Writes statistics files to storage
Can be time-consuming for large tables

Optimization Tips

Limit to frequently queried columns
Run during off-peak hours
Use snapshot parameter to avoid re-computing for same data
Consider incremental statistics updates

Configuration Options

While not shown in the interface, implementations may support additional options:

ComputeTableStats.Result result = actions
  .computeTableStats(table)
  .columns("user_id", "event_type")
  .option("statistics-output-location", "s3://my-bucket/stats/")
  .option("max-statistics-size", "100MB")
  .execute();

Check your specific Actions implementation for available options.

Statistics and Query Optimization

Statistics help query engines:

Estimate cardinalities: Choose optimal join orders
Skip data files: Prune files based on min/max values
Optimize aggregations: Pre-aggregate when beneficial
Choose algorithms: Select hash vs. sort-based operations

Combine table statistics with partition pruning and manifest filtering for best query performance.

Puffin File Format - Statistics file format specification
Query Optimization - How statistics improve queries
Table Metadata - Understanding table metadata structure
RewriteDataFiles - Data optimization actions

Documentation Index

​ComputeTableStats

​Interface

​Overview

​Methods

​columns

​snapshot

​Result

​Methods

​Usage Examples

​Basic Statistics Collection

​Specific Columns

​Specific Snapshot

​With Column Selection

​Periodic Statistics Update

​Statistics File Format

​Accessing Statistics

​When to Compute Statistics

​Best Practices

​Performance Considerations

​Costs

​Optimization Tips

​Configuration Options

​Statistics and Query Optimization

​Related

ComputeTableStats

Interface

Overview

Methods

columns

snapshot

Result

Methods

Usage Examples

Basic Statistics Collection

Specific Columns

Specific Snapshot

With Column Selection

Periodic Statistics Update

Statistics File Format

Accessing Statistics

When to Compute Statistics

Best Practices

Performance Considerations

Costs

Optimization Tips

Configuration Options

Statistics and Query Optimization

Related