Documentation Index
Fetch the complete documentation index at: https://mintlify.com/apache/iceberg/llms.txt
Use this file to discover all available pages before exploring further.
The Transforms class provides factory methods for creating partition transform functions in Apache Iceberg.
Overview
Transforms are used to:
- Partition data efficiently
- Create hidden partitions from column values
- Enable partition pruning during queries
Most users should create transforms using PartitionSpec.builderFor(Schema) rather than directly.
identity()
Returns an identity transform that passes values through unchanged.
<T> Transform<T, T> identity()
Example:
import org.apache.iceberg.transforms.Transforms;
import org.apache.iceberg.transforms.Transform;
Transform<String, String> idTransform = Transforms.identity();
Usage in PartitionSpec:
import org.apache.iceberg.PartitionSpec;
PartitionSpec spec = PartitionSpec.builderFor(schema)
.identity("category")
.identity("region")
.build();
bucket()
Returns a bucket transform that hashes values into a fixed number of buckets.
<T> Transform<T, Integer> bucket(int numBuckets)
Parameters:
numBuckets - The number of buckets to distribute values into
Example:
Transform<Long, Integer> bucketTransform = Transforms.bucket(16);
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.bucket("user_id", 16) // 16 buckets
.build();
Common Bucket Sizes:
- 4, 8, 16 - For small to medium tables
- 32, 64 - For larger tables
- 128, 256 - For very large tables
truncate()
Returns a truncate transform that truncates values to a specified width.
<T> Transform<T, T> truncate(int width)
Parameters:
width - The width to truncate to
- For strings: truncates to width characters
- For integers/longs: truncates to width units
- For decimals: truncates to width units
Example:
Transform<String, String> truncTransform = Transforms.truncate(10);
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.truncate("name", 10) // First 10 chars
.truncate("value", 100) // Truncate to 100s
.build();
year()
Extracts the year from dates or timestamps.
<T> Transform<T, Integer> year()
Example:
Transform<Long, Integer> yearTransform = Transforms.year();
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.year("event_time")
.build();
month()
Extracts the month from dates or timestamps (as months since epoch).
<T> Transform<T, Integer> month()
Example:
Transform<Long, Integer> monthTransform = Transforms.month();
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.month("created_date")
.build();
day()
Extracts the day from dates or timestamps (as days since epoch).
<T> Transform<T, Integer> day()
Example:
Transform<Long, Integer> dayTransform = Transforms.day();
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.day("event_date")
.build();
hour()
Extracts the hour from timestamps (as hours since epoch).
<T> Transform<T, Integer> hour()
Example:
Transform<Long, Integer> hourTransform = Transforms.hour();
Usage in PartitionSpec:
PartitionSpec spec = PartitionSpec.builderFor(schema)
.hour("event_timestamp")
.build();
alwaysNull()
Returns a transform that always produces null (void transform).
<T> Transform<T, Void> alwaysNull()
Example:
Transform<String, Void> voidTransform = Transforms.alwaysNull();
fromString()
Parses a transform from a string representation.
Transform<?, ?> fromString(String transform)
Supported Formats:
"identity"
"year", "month", "day", "hour"
"bucket[N]" - e.g., "bucket[16]"
"truncate[N]" - e.g., "truncate[10]"
"void"
Example:
Transform<?, ?> transform1 = Transforms.fromString("bucket[16]");
Transform<?, ?> transform2 = Transforms.fromString("year");
Transform<?, ?> transform3 = Transforms.fromString("truncate[10]");
Examples
Basic Partition Specs
import org.apache.iceberg.Schema;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.types.Types;
import static org.apache.iceberg.types.Types.NestedField.*;
// Create schema
Schema schema = new Schema(
required(1, "id", Types.LongType.get()),
required(2, "event_time", Types.TimestampType.withZone()),
required(3, "category", Types.StringType.get()),
required(4, "user_id", Types.LongType.get())
);
// Partition by date
PartitionSpec dateSpec = PartitionSpec.builderFor(schema)
.day("event_time")
.build();
// Partition by category (identity)
PartitionSpec categorySpec = PartitionSpec.builderFor(schema)
.identity("category")
.build();
// Partition by user bucket
PartitionSpec userSpec = PartitionSpec.builderFor(schema)
.bucket("user_id", 16)
.build();
Time-Based Partitioning
// Yearly partitions
PartitionSpec yearlySpec = PartitionSpec.builderFor(schema)
.year("event_time")
.build();
// Monthly partitions
PartitionSpec monthlySpec = PartitionSpec.builderFor(schema)
.month("event_time")
.build();
// Daily partitions
PartitionSpec dailySpec = PartitionSpec.builderFor(schema)
.day("event_time")
.build();
// Hourly partitions
PartitionSpec hourlySpec = PartitionSpec.builderFor(schema)
.hour("event_time")
.build();
Multi-Level Partitioning
// Partition by year and month
PartitionSpec yearMonthSpec = PartitionSpec.builderFor(schema)
.year("event_time")
.month("event_time")
.build();
// Partition by date and category
PartitionSpec dateCategorySpec = PartitionSpec.builderFor(schema)
.day("event_time")
.identity("category")
.build();
// Partition by date and user bucket
PartitionSpec dateUserSpec = PartitionSpec.builderFor(schema)
.day("event_time")
.bucket("user_id", 16)
.build();
String Truncation
Schema schema = new Schema(
required(1, "id", Types.LongType.get()),
required(2, "email", Types.StringType.get()),
required(3, "name", Types.StringType.get())
);
// Partition by email prefix
PartitionSpec emailSpec = PartitionSpec.builderFor(schema)
.truncate("email", 10) // First 10 characters
.build();
// Partition by name prefix
PartitionSpec nameSpec = PartitionSpec.builderFor(schema)
.truncate("name", 5) // First 5 characters
.build();
Numeric Truncation
Schema schema = new Schema(
required(1, "price", Types.DecimalType.of(10, 2)),
required(2, "quantity", Types.IntegerType.get())
);
// Partition by price in $100 increments
PartitionSpec priceSpec = PartitionSpec.builderFor(schema)
.truncate("price", 100)
.build();
// Partition by quantity in groups of 1000
PartitionSpec quantitySpec = PartitionSpec.builderFor(schema)
.truncate("quantity", 1000)
.build();
Hash-Based Distribution
// Distribute users evenly
PartitionSpec userDistribution = PartitionSpec.builderFor(schema)
.bucket("user_id", 32) // 32 buckets
.build();
// Combine with time partitioning
PartitionSpec timeUserSpec = PartitionSpec.builderFor(schema)
.day("event_time")
.bucket("user_id", 16)
.build();
Evolving Partition Specs
import org.apache.iceberg.Table;
// Initial spec - daily partitions
PartitionSpec initialSpec = PartitionSpec.builderFor(schema)
.day("event_time")
.build();
Table table = createTable(schema, initialSpec);
// Later - add category partitioning
table.updateSpec()
.addField("category")
.commit();
// Later - change to monthly partitions
table.updateSpec()
.removeField("event_time_day")
.addField(Transforms.month(), "event_time")
.commit();
Custom Partition Values
import org.apache.iceberg.transforms.Transform;
import org.apache.iceberg.PartitionData;
import org.apache.iceberg.StructLike;
// Get transform
Transform<Long, Integer> bucketTransform = Transforms.bucket(16);
// Apply transform
Long userId = 12345L;
Integer bucket = bucketTransform.apply(userId);
System.out.println("User " + userId + " -> bucket " + bucket);
// Year transform
Transform<Long, Integer> yearTransform = Transforms.year();
Long timestamp = System.currentTimeMillis() * 1000; // microseconds
Integer year = yearTransform.apply(timestamp);
System.out.println("Timestamp " + timestamp + " -> year " + year);
import org.apache.iceberg.transforms.Transform;
Transform<?, ?> bucket16 = Transforms.bucket(16);
System.out.println(bucket16.toString()); // "bucket[16]"
Transform<?, ?> year = Transforms.year();
System.out.println(year.toString()); // "year"
Transform<?, ?> trunc10 = Transforms.truncate(10);
System.out.println(trunc10.toString()); // "truncate[10]"
Best Practices
- Time-based data: Use
year(), month(), day(), or hour() based on query patterns
- High cardinality columns: Use
bucket() to limit number of partitions
- String prefixes: Use
truncate() for prefix-based partitioning
- Low cardinality: Use
identity() for direct partitioning
Partition Granularity
// Too fine - creates too many small files
PartitionSpec tooFine = PartitionSpec.builderFor(schema)
.hour("event_time")
.bucket("user_id", 1000)
.build();
// Better - balanced partition size
PartitionSpec balanced = PartitionSpec.builderFor(schema)
.day("event_time")
.bucket("user_id", 16)
.build();
Bucket Count Selection
// Small table (< 1M rows)
PartitionSpec small = PartitionSpec.builderFor(schema)
.bucket("id", 4)
.build();
// Medium table (1M - 100M rows)
PartitionSpec medium = PartitionSpec.builderFor(schema)
.bucket("id", 16)
.build();
// Large table (> 100M rows)
PartitionSpec large = PartitionSpec.builderFor(schema)
.bucket("id", 64)
.build();
See Also