Skip to main content

Overview

Iceberg provides stored procedures for table maintenance and management. Procedures are available when using Iceberg SQL extensions.
In Spark 4.0+, procedures are supported natively but are case-sensitive.

Using Procedures

Call procedures from any configured catalog using the CALL statement:
CALL catalog_name.system.procedure_name(arguments);

Argument Passing

CALL catalog_name.system.rollback_to_snapshot(
    table => 'db.sample',
    snapshot_id => 1234
);
Mixing positional and named arguments is not supported.

Snapshot Management

rollback_to_snapshot

Roll back a table to a specific snapshot:
CALL catalog.system.rollback_to_snapshot('db.sample', 1);
Arguments:
  • table (required) - Table name
  • snapshot_id (required) - Target snapshot ID

rollback_to_timestamp

Roll back to a snapshot at a specific time:
CALL catalog.system.rollback_to_timestamp(
    'db.sample',
    TIMESTAMP '2021-06-30 00:00:00.000'
);
Arguments:
  • table (required) - Table name
  • timestamp (required) - Target timestamp

set_current_snapshot

Set the current snapshot (not limited to ancestors):
CALL catalog.system.set_current_snapshot('db.sample', 1);

cherrypick_snapshot

Apply changes from a snapshot without removing the original:
CALL catalog.system.cherrypick_snapshot('my_table', 1);
Only append and dynamic overwrite snapshots can be cherry-picked.

fast_forward

Fast-forward a branch to another branch’s head:
CALL catalog.system.fast_forward(
    'my_table',
    'main',
    'audit-branch'
);

Metadata Management

expire_snapshots

Remove old snapshots and unreferenced data files:
CALL catalog.system.expire_snapshots('db.sample');
Arguments:
  • table (required) - Table name
  • older_than - Expiration timestamp (default: 5 days ago)
  • retain_last - Minimum snapshots to keep (default: 1)
  • max_concurrent_deletes - Thread pool size for deletions
  • stream_results - Stream results to prevent driver OOM
  • snapshot_ids - Specific snapshot IDs to expire
Output:
  • deleted_data_files_count
  • deleted_position_delete_files_count
  • deleted_equality_delete_files_count
  • deleted_manifest_files_count
  • deleted_manifest_lists_count

remove_orphan_files

Remove files not referenced in table metadata:
CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);
Arguments:
  • table (required) - Table name
  • older_than - Remove files older than this (default: 3 days ago)
  • location - Specific directory to scan
  • dry_run - Preview without deleting (default: false)
  • max_concurrent_deletes - Thread pool size
  • stream_results - Stream results to prevent OOM
Orphan file removal is irreversible. Always run with dry_run => true first.

rewrite_data_files

Compact small files and optimize data layout:
CALL catalog.system.rewrite_data_files('db.sample');
Common Options:
  • target-file-size-bytes - Target output file size (default: 512 MB)
  • min-file-size-bytes - Files below this are rewritten (default: 75% of target)
  • max-file-size-bytes - Files above this are rewritten (default: 180% of target)
  • min-input-files - Minimum files to trigger rewrite (default: 5)
  • rewrite-all - Force rewrite all files (default: false)
  • remove-dangling-deletes - Remove orphaned delete files (default: false)

rewrite_manifests

Optimize manifest files for better scan planning:
CALL catalog.system.rewrite_manifests('db.sample');

rewrite_position_delete_files

Compact position delete files and remove dangling deletes:
CALL catalog.system.rewrite_position_delete_files('db.sample');

Table Migration

snapshot

Create a lightweight copy for testing:
CALL catalog.system.snapshot(
    'db.sample',
    'db.snap',
    '/tmp/temptable/'
);
Snapshot tables share data files with the source table. Use DROP TABLE to clean up when done testing.

migrate

Replace a Hive/Spark table with an Iceberg table:
CALL catalog.system.migrate(
    'spark_catalog.db.sample',
    map('foo', 'bar')
);
Arguments:
  • table (required) - Table to migrate
  • properties - Properties for the new Iceberg table
  • drop_backup - Don’t retain original table (default: false)
  • backup_table_name - Custom backup name (default: table_BACKUP_)

add_files

Add files from external sources:
CALL spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => 'db.src_tbl',
    partition_filter => map('part_col_1', 'A')
);
Schema is not validated. Adding incompatible files will cause query failures.

register_table

Register an existing metadata file in a catalog:
CALL spark_catalog.system.register_table(
    table => 'db.tbl',
    metadata_file => 'path/to/metadata/file.json'
);
Registering the same metadata in multiple catalogs can cause data loss and corruption.

Change Data Capture

create_changelog_view

Create a view showing table changes:
CALL catalog.system.create_changelog_view(
    table => 'db.tbl',
    options => map(
        'start-snapshot-id', '1',
        'end-snapshot-id', '2'
    )
);
Query the changelog:
SELECT * FROM tbl_changes WHERE _change_type = 'INSERT';
CDC Metadata Columns:
  • _change_type - INSERT, DELETE, UPDATE_BEFORE, UPDATE_AFTER
  • _change_ordinal - Order of changes
  • _commit_snapshot_id - Snapshot where change occurred

Table Statistics

compute_table_stats

Calculate NDV statistics for columns:
CALL catalog.system.compute_table_stats('my_table');

compute_partition_stats

Compute partition statistics incrementally:
CALL catalog.system.compute_partition_stats('my_table');

Metadata Information

ancestors_of

Report snapshot ancestry:
CALL catalog.system.ancestors_of('db.tbl');

Best Practices

Run maintenance procedures on a schedule:
  • Daily: expire_snapshots for active tables
  • Weekly: rewrite_data_files for frequently updated tables
  • Monthly: remove_orphan_files for all tables
For tables with streaming writes:
  • Use longer trigger intervals (1+ minutes)
  • Regularly run rewrite_data_files to compact small files
  • Run rewrite_manifests to optimize metadata
Always use dry run first:
CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Next Steps

Writes

Learn about write operations and distribution

Configuration

Configure Spark for optimal performance

Queries

Query tables and inspect metadata

Structured Streaming

Maintain streaming tables