Skip to main content
Apache Hive supports ORC, Parquet, and Avro file formats that can be migrated to Iceberg. When migrating data to an Iceberg table, which provides versioning and transactional updates, only the most recent data files need to be migrated.

Overview

Iceberg supports all three migration actions for migrating from Hive tables to Iceberg tables:
  • Snapshot Table
  • Migrate Table
  • Add Files
Since Hive tables do not maintain snapshots, the migration process essentially involves creating a new Iceberg table with the existing schema and committing all data files across all partitions to the new Iceberg table.
After the initial migration, any new data files are added to the new Iceberg table using the Add Files action.

Enabling Migration from Hive to Iceberg

The Hive table migration actions are supported by the Spark Integration module via Spark Procedures. The procedures are bundled in the Spark runtime jar, which is available in the Iceberg Release Downloads.

Snapshot Hive Table to Iceberg

To snapshot a Hive table, users can run the following Spark SQL:
CALL catalog_name.system.snapshot('db.source', 'db.dest')
The snapshot action creates a new table with a different name, leaving the source Hive table unchanged. This allows for gradual migration with minimal disruption.

Example

-- Snapshot a Hive table to a new Iceberg table
CALL spark_catalog.system.snapshot('analytics.events', 'analytics.events_iceberg');
See Spark Procedure: snapshot for more details.

Migrate Hive Table To Iceberg

To migrate a Hive table to Iceberg in-place, users can run the following Spark SQL:
CALL catalog_name.system.migrate('db.sample')
The migrate action replaces the source table with an Iceberg table of the same name. All writers must be stopped before running this command.

Example

-- Migrate a Hive table to Iceberg in-place
CALL spark_catalog.system.migrate('analytics.events');
See Spark Procedure: migrate for more details.

Add Files From Hive Table to Iceberg

To add data files from a Hive table to a given Iceberg table, users can run the following Spark SQL:
CALL spark_catalog.system.add_files(
  table => 'db.tbl',
  source_table => 'db.src_tbl'
)
This is useful for catching up files that were added to the Hive table after the initial migration.

Example

-- Add new files from Hive table to existing Iceberg table
CALL spark_catalog.system.add_files(
  table => 'analytics.events_iceberg',
  source_table => 'analytics.events'
);
See Spark Procedure: add_files for more details.

Migration Workflow

Here’s a typical workflow for migrating Hive tables to Iceberg:
1

Prepare for migration

Identify the Hive tables to migrate and ensure you have the necessary permissions and access to both the source and destination catalogs.
2

Choose migration strategy

Decide between:
  • Snapshot: Create a new table with a different name (safer, allows gradual migration)
  • Migrate: Replace the existing table in-place (requires downtime)
3

Run migration

Execute the appropriate Spark procedure:
-- For snapshot
CALL spark_catalog.system.snapshot('db.source', 'db.dest');

-- OR for in-place migration
CALL spark_catalog.system.migrate('db.table');
4

Verify migration

Query both tables to verify the data was migrated correctly:
-- Compare row counts
SELECT COUNT(*) FROM db.source;
SELECT COUNT(*) FROM db.dest;

-- Verify schema
DESCRIBE EXTENDED db.dest;
5

Handle incremental updates

If new data is added to the Hive table after migration, use the add_files procedure:
CALL spark_catalog.system.add_files(
  table => 'db.dest',
  source_table => 'db.source'
);
6

Switch workloads

Gradually switch read and write workloads to the new Iceberg table.

Supported File Formats

Iceberg supports migrating Hive tables with the following file formats:

ORC

Optimized Row Columnar format

Parquet

Apache Parquet columnar format

Avro

Apache Avro row-based format

Best Practices

Before migrating production tables, test the migration process on a copy or subset of your data to ensure everything works as expected.
Large tables with many partitions may take longer to migrate. Monitor the migration progress and plan accordingly.
For production tables with active workloads, use the snapshot approach to minimize disruption. This allows you to verify the migration before switching over.
After successful migration and verification, clean up the old Hive tables to free up storage space. Ensure you have backups before deletion.