DeleteOrphanFiles
The DeleteOrphanFiles action identifies and deletes orphan files in a table that are not reachable by any valid snapshot. This is essential for reclaiming storage space from failed writes and other operations.
Interface
public interface DeleteOrphanFiles extends Action<DeleteOrphanFiles, DeleteOrphanFiles.Result>
Overview
Orphan files can accumulate in a table for several reasons:
- Failed write operations that didn’t commit
- Interrupted jobs that wrote data but didn’t create snapshots
- Files from unsuccessful transactions
- Leftover files from testing or development
The DeleteOrphanFiles action:
- Lists all files in table storage
- Identifies files not referenced by any snapshot
- Safely deletes files older than a safety threshold
- Can process both data files and metadata files
This operation lists all files in the table location and is expensive for large tables. Use with caution.
Methods
location
Specify a location to scan for orphan files.
DeleteOrphanFiles location(String location)
Parameters:
location - The path to scan for orphan files
Returns: this for method chaining
Example:
// Scan a specific data directory
action.location("s3://my-bucket/warehouse/db/table/data");
If not set, the root table location will be scanned, potentially removing both orphan data and metadata files.
olderThan
Only delete files older than the specified timestamp.
DeleteOrphanFiles olderThan(long olderThanTimestamp)
Parameters:
olderThanTimestamp - Timestamp in milliseconds (from System.currentTimeMillis())
Returns: this for method chaining
Example:
// Only delete files older than 7 days
long sevenDaysAgo = System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7);
action.olderThan(sevenDaysAgo);
Defaults to 3 days ago if not specified. This safety measure prevents deleting files from concurrent operations.
Never use a very recent timestamp. Always allow sufficient time for concurrent operations to complete.
deleteWith
Provide a custom delete function.
DeleteOrphanFiles deleteWith(Consumer<String> deleteFunc)
Parameters:
deleteFunc - A function that accepts file paths to delete
Returns: this for method chaining
Example:
// Collect orphan files instead of deleting
Set<String> orphans = new HashSet<>();
action.deleteWith(orphans::add);
Use a custom delete function to preview orphan files before actually deleting them.
executeDeleteWith
Provide an executor service for parallel deletion.
DeleteOrphanFiles executeDeleteWith(ExecutorService executorService)
Parameters:
executorService - The executor service for parallel deletes
Returns: this for method chaining
Only used if a custom delete function is provided or the FileIO doesn’t support bulk deletes.
prefixMismatchMode
Control how to handle files with mismatched authority/scheme.
DeleteOrphanFiles prefixMismatchMode(PrefixMismatchMode newPrefixMismatchMode)
Parameters:
newPrefixMismatchMode - Mode for handling prefix mismatches
Returns: this for method chaining
Modes:
ERROR (default) - Throw an exception on mismatch
IGNORE - Skip files with mismatches
DELETE - Consider mismatched files as orphans
Example:
action.prefixMismatchMode(PrefixMismatchMode.IGNORE);
Use DELETE mode only after manually verifying all mismatches. Deleted files cannot be recovered.
equalSchemes
Define schemes that should be considered equivalent.
DeleteOrphanFiles equalSchemes(Map<String, String> newEqualSchemes)
Parameters:
newEqualSchemes - Map of equivalent scheme groups
Returns: this for method chaining
Example:
// Treat s3, s3a, and s3n as equivalent
action.equalSchemes(Map.of("s3a,s3n", "s3"));
equalAuthorities
Define authorities that should be considered equivalent.
DeleteOrphanFiles equalAuthorities(Map<String, String> newEqualAuthorities)
Parameters:
newEqualAuthorities - Map of equivalent authority groups
Returns: this for method chaining
Example:
// Treat different service names as equivalent
action.equalAuthorities(Map.of("old-service,legacy-service", "new-service"));
Result
The Result interface provides information about deleted files.
Methods
interface Result {
Iterable<String> orphanFileLocations();
long orphanFilesCount();
}
orphanFileLocations()
Returns the paths of all deleted orphan files.
orphanFilesCount()
Returns the total number of orphan files deleted.
Usage Examples
Basic Orphan File Deletion
// Delete orphan files older than default (3 days)
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.execute();
System.out.println("Deleted " + result.orphanFilesCount() + " orphan files");
Custom Time Threshold
// Delete orphan files older than 7 days
long sevenDaysAgo = System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7);
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.olderThan(sevenDaysAgo)
.execute();
Preview Mode
// Preview orphan files without deleting
List<String> orphanFiles = new ArrayList<>();
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.deleteWith(orphanFiles::add)
.execute();
System.out.println("Found " + orphanFiles.size() + " orphan files:");
orphanFiles.forEach(System.out::println);
Specific Location
// Delete orphans from a specific data directory
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.location("s3://my-bucket/warehouse/db/table/data/year=2023")
.olderThan(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(14))
.execute();
Handle Scheme Mismatches
// Handle different S3 schemes
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.equalSchemes(Map.of(
"s3a,s3n", "s3"
))
.prefixMismatchMode(PrefixMismatchMode.IGNORE)
.execute();
System.out.println("Deleted " + result.orphanFilesCount() + " files");
With Progress Tracking
// Track deletion progress
AtomicInteger deletedCount = new AtomicInteger(0);
DeleteOrphanFiles.Result result = actions
.deleteOrphanFiles(table)
.deleteWith(path -> {
int count = deletedCount.incrementAndGet();
if (count % 100 == 0) {
System.out.println("Deleted " + count + " files...");
}
table.io().deleteFile(path);
})
.execute();
System.out.println("Total deleted: " + deletedCount.get());
Safety Considerations
Always follow these safety practices:
- Use appropriate time thresholds: Never delete recently written files
- Test in preview mode first: Use a custom delete function to review files
- Understand concurrent operations: Ensure no writes are in progress
- Handle scheme mismatches carefully: Use
equalSchemes and equalAuthorities appropriately
- Monitor execution: Track deleted files for verification
Best Practices
-
Run during maintenance windows: Minimize concurrent activity
-
Use conservative time thresholds: 7+ days for production tables
-
Preview before deleting: Always run in preview mode first
-
Schedule regular cleanup: Run periodically to prevent accumulation
-
Monitor storage savings: Track the result to measure impact
-
Document scheme equivalences: Maintain a record of equal schemes/authorities
Costs
- Lists all files in the specified location (expensive for large tables)
- Requires reading table metadata
- May require multiple API calls to cloud storage
Optimization Tips
- Use
location() to limit scope to specific directories
- Run during off-peak hours
- Consider parallel execution for very large tables
- Use bulk delete APIs when available
When to Run
Run DeleteOrphanFiles when:
- After failed operations: Jobs that crashed or were cancelled
- Storage costs are high: Significant orphan file accumulation
- After major migrations: Moving or restructuring tables
- During maintenance: Regular cleanup schedules
- Before decommissioning: Final cleanup before table removal