Spark Procedures - Apache Iceberg Documentation

Overview

Iceberg provides stored procedures for table maintenance and management. Procedures are available when using Iceberg SQL extensions.

In Spark 4.0+, procedures are supported natively but are case-sensitive.

Using Procedures

Call procedures from any configured catalog using the CALL statement:

CALL catalog_name.system.procedure_name(arguments);

Argument Passing

CALL catalog_name.system.rollback_to_snapshot(
    table => 'db.sample',
    snapshot_id => 1234
);

Mixing positional and named arguments is not supported.

Snapshot Management

rollback_to_snapshot

Roll back a table to a specific snapshot:

CALL catalog.system.rollback_to_snapshot('db.sample', 1);

Arguments:

table (required) - Table name
snapshot_id (required) - Target snapshot ID

rollback_to_timestamp

Roll back to a snapshot at a specific time:

CALL catalog.system.rollback_to_timestamp(
    'db.sample',
    TIMESTAMP '2021-06-30 00:00:00.000'
);

Arguments:

table (required) - Table name
timestamp (required) - Target timestamp

set_current_snapshot

Set the current snapshot (not limited to ancestors):

CALL catalog.system.set_current_snapshot('db.sample', 1);

cherrypick_snapshot

Apply changes from a snapshot without removing the original:

CALL catalog.system.cherrypick_snapshot('my_table', 1);

Only append and dynamic overwrite snapshots can be cherry-picked.

fast_forward

Fast-forward a branch to another branch’s head:

CALL catalog.system.fast_forward(
    'my_table',
    'main',
    'audit-branch'
);

Metadata Management

expire_snapshots

Remove old snapshots and unreferenced data files:

CALL catalog.system.expire_snapshots('db.sample');

Arguments:

table (required) - Table name
older_than - Expiration timestamp (default: 5 days ago)
retain_last - Minimum snapshots to keep (default: 1)
max_concurrent_deletes - Thread pool size for deletions
stream_results - Stream results to prevent driver OOM
snapshot_ids - Specific snapshot IDs to expire

Output:

deleted_data_files_count
deleted_position_delete_files_count
deleted_equality_delete_files_count
deleted_manifest_files_count
deleted_manifest_lists_count

remove_orphan_files

Remove files not referenced in table metadata:

CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Arguments:

table (required) - Table name
older_than - Remove files older than this (default: 3 days ago)
location - Specific directory to scan
dry_run - Preview without deleting (default: false)
max_concurrent_deletes - Thread pool size
stream_results - Stream results to prevent OOM

Orphan file removal is irreversible. Always run with dry_run => true first.

rewrite_data_files

Compact small files and optimize data layout:

CALL catalog.system.rewrite_data_files('db.sample');

Common Options:

target-file-size-bytes - Target output file size (default: 512 MB)
min-file-size-bytes - Files below this are rewritten (default: 75% of target)
max-file-size-bytes - Files above this are rewritten (default: 180% of target)
min-input-files - Minimum files to trigger rewrite (default: 5)
rewrite-all - Force rewrite all files (default: false)
remove-dangling-deletes - Remove orphaned delete files (default: false)

rewrite_manifests

Optimize manifest files for better scan planning:

CALL catalog.system.rewrite_manifests('db.sample');

rewrite_position_delete_files

Compact position delete files and remove dangling deletes:

CALL catalog.system.rewrite_position_delete_files('db.sample');

Table Migration

snapshot

Create a lightweight copy for testing:

CALL catalog.system.snapshot(
    'db.sample',
    'db.snap',
    '/tmp/temptable/'
);

Snapshot tables share data files with the source table. Use DROP TABLE to clean up when done testing.

migrate

Replace a Hive/Spark table with an Iceberg table:

CALL catalog.system.migrate(
    'spark_catalog.db.sample',
    map('foo', 'bar')
);

Arguments:

table (required) - Table to migrate
properties - Properties for the new Iceberg table
drop_backup - Don’t retain original table (default: false)
backup_table_name - Custom backup name (default: table_BACKUP_)

add_files

Add files from external sources:

CALL spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => 'db.src_tbl',
    partition_filter => map('part_col_1', 'A')
);

Schema is not validated. Adding incompatible files will cause query failures.

register_table

CALL spark_catalog.system.register_table(
    table => 'db.tbl',
    metadata_file => 'path/to/metadata/file.json'
);

Registering the same metadata in multiple catalogs can cause data loss and corruption.

Change Data Capture

create_changelog_view

Create a view showing table changes:

CALL catalog.system.create_changelog_view(
    table => 'db.tbl',
    options => map(
        'start-snapshot-id', '1',
        'end-snapshot-id', '2'
    )
);

Query the changelog:

SELECT * FROM tbl_changes WHERE _change_type = 'INSERT';

CDC Metadata Columns:

_change_type - INSERT, DELETE, UPDATE_BEFORE, UPDATE_AFTER
_change_ordinal - Order of changes
_commit_snapshot_id - Snapshot where change occurred

Table Statistics

compute_table_stats

Calculate NDV statistics for columns:

CALL catalog.system.compute_table_stats('my_table');

compute_partition_stats

Compute partition statistics incrementally:

CALL catalog.system.compute_partition_stats('my_table');

Metadata Information

ancestors_of

Report snapshot ancestry:

CALL catalog.system.ancestors_of('db.tbl');

Best Practices

Regular Maintenance

Run maintenance procedures on a schedule:

Daily: expire_snapshots for active tables
Weekly: rewrite_data_files for frequently updated tables
Monthly: remove_orphan_files for all tables

Streaming Tables

For tables with streaming writes:

Use longer trigger intervals (1+ minutes)
Regularly run rewrite_data_files to compact small files
Run rewrite_manifests to optimize metadata

Safe Orphan Removal

Always use dry run first:

CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Next Steps

Writes

Learn about write operations and distribution

Configuration

Configure Spark for optimal performance

Queries

Query tables and inspect metadata

Structured Streaming

Maintain streaming tables

Documentation Index

​Overview

​Using Procedures

​Argument Passing

​Snapshot Management

​rollback_to_snapshot

​rollback_to_timestamp

​set_current_snapshot

​cherrypick_snapshot

​fast_forward

​Metadata Management

​expire_snapshots

​remove_orphan_files

​rewrite_data_files

​rewrite_manifests

​rewrite_position_delete_files

​Table Migration

​snapshot

​migrate

​add_files

​register_table

​Change Data Capture

​create_changelog_view

​Table Statistics

​compute_table_stats

​compute_partition_stats

​Metadata Information

​ancestors_of

​Best Practices

​Next Steps

Writes

Configuration

Queries

Structured Streaming

Overview

Using Procedures

Argument Passing

Snapshot Management

rollback_to_snapshot

rollback_to_timestamp

set_current_snapshot

cherrypick_snapshot

fast_forward

Metadata Management

expire_snapshots

remove_orphan_files

rewrite_data_files

rewrite_manifests

rewrite_position_delete_files

Table Migration

snapshot

migrate

add_files

register_table

Change Data Capture

create_changelog_view

Table Statistics

compute_table_stats

compute_partition_stats

Metadata Information

ancestors_of

Best Practices

Next Steps