Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/iceberg/llms.txt
Use this file to discover all available pages before exploring further.
RewriteDataFiles
TheRewriteDataFiles action rewrites data files according to a rewrite strategy to optimize file sizing and layout within a table. This is commonly used for compaction and data optimization.
Interface
Overview
Rewriting data files helps maintain optimal query performance by:- Combining small files into larger ones (compaction)
- Splitting large files into smaller ones
- Reorganizing data by sort order or Z-order
- Aligning data with new partition specifications
Configuration Options
Partial Progress
partial-progress.enabled (default: false)
Enable committing groups of files before the entire rewrite completes.
partial-progress.max-commits (default: 10)
Maximum number of commits when partial progress is enabled.
partial-progress.max-failed-commits
Maximum number of failed commits allowed.
File Grouping
max-file-group-size-bytes (default: 100 GB)
Maximum data size to compact in a single file group. Helps prevent resource exhaustion when rewriting large partitions.
max-concurrent-file-group-rewrites (default: 5)
Maximum number of file groups to rewrite simultaneously.
File Sizing
target-file-size-bytes
Target output file size. Defaults to the table’s write.target-file-size-bytes property.
Advanced Options
use-starting-sequence-number (default: true)
Use the sequence number from compaction start time instead of the new snapshot’s sequence number. This avoids conflicts with newer equality deletes.
remove-dangling-deletes (default: false)
Remove delete files that don’t apply to any live data files after compaction.
rewrite-job-order (default: none)
Order for processing file groups: bytes-asc, bytes-desc, files-asc, files-desc, or none.
output-spec-id
Partition specification ID for rewritten files. Used to reorganize data with a new partitioning scheme.
Methods
Strategy Selection
binPack
Use the BINPACK strategy to combine small files.sort
Use the SORT strategy with the table’s sort order.sort (with custom order)
Use the SORT strategy with a custom sort order.sortOrder- Custom sort order to use
zOrder
Use the Z-ORDER strategy for multi-dimensional clustering.columns- Column names to use for Z-ordering
Filtering
filter
Filter which files to consider for rewriting.expression- Iceberg expression to filter files
Result
TheResult interface provides statistics about the rewrite operation.
Methods
FileGroupRewriteResult
Usage Examples
Basic Compaction
Sort Optimization
Z-Order Clustering
Partition-Specific Rewrite
With Partial Progress
Remove Dangling Deletes
Best Practices
-
Choose the right strategy:
- Use BINPACK for general compaction
- Use SORT when query patterns favor sorted data
- Use Z-ORDER for multi-dimensional range queries
- Set appropriate file sizes: Target sizes between 128MB and 1GB depending on your workload
- Use filters for incremental optimization: Rewrite only problematic partitions
- Enable partial progress for large tables: Prevents losing all work on failure
- Monitor resource usage: Adjust concurrency settings based on cluster capacity
Related
- RewriteManifests - Optimize manifest file organization
- ExpireSnapshots - Clean up old snapshots after rewriting
- DeleteOrphanFiles - Remove unreferenced files