Overview
Iceberg supports all three migration actions for migrating from Hive tables to Iceberg tables:- Snapshot Table
- Migrate Table
- Add Files
Since Hive tables do not maintain snapshots, the migration process essentially involves creating a new Iceberg table with the existing schema and committing all data files across all partitions to the new Iceberg table.
Enabling Migration from Hive to Iceberg
The Hive table migration actions are supported by the Spark Integration module via Spark Procedures. The procedures are bundled in the Spark runtime jar, which is available in the Iceberg Release Downloads.Snapshot Hive Table to Iceberg
To snapshot a Hive table, users can run the following Spark SQL:Example
Migrate Hive Table To Iceberg
To migrate a Hive table to Iceberg in-place, users can run the following Spark SQL:Example
Add Files From Hive Table to Iceberg
To add data files from a Hive table to a given Iceberg table, users can run the following Spark SQL:This is useful for catching up files that were added to the Hive table after the initial migration.
Example
Migration Workflow
Here’s a typical workflow for migrating Hive tables to Iceberg:Prepare for migration
Identify the Hive tables to migrate and ensure you have the necessary permissions and access to both the source and destination catalogs.
Choose migration strategy
Decide between:
- Snapshot: Create a new table with a different name (safer, allows gradual migration)
- Migrate: Replace the existing table in-place (requires downtime)
Handle incremental updates
If new data is added to the Hive table after migration, use the
add_files procedure:Supported File Formats
Iceberg supports migrating Hive tables with the following file formats:ORC
Optimized Row Columnar format
Parquet
Apache Parquet columnar format
Avro
Apache Avro row-based format
Best Practices
Test migration on a copy first
Test migration on a copy first
Before migrating production tables, test the migration process on a copy or subset of your data to ensure everything works as expected.
Monitor table size and partition count
Monitor table size and partition count
Large tables with many partitions may take longer to migrate. Monitor the migration progress and plan accordingly.
Use snapshot for gradual migration
Use snapshot for gradual migration
For production tables with active workloads, use the snapshot approach to minimize disruption. This allows you to verify the migration before switching over.
Clean up old Hive tables
Clean up old Hive tables
After successful migration and verification, clean up the old Hive tables to free up storage space. Ensure you have backups before deletion.