Transforms

The Transforms class provides factory methods for creating partition transform functions in Apache Iceberg.

Overview

Transforms are used to:

Partition data efficiently
Create hidden partitions from column values
Enable partition pruning during queries

Most users should create transforms using PartitionSpec.builderFor(Schema) rather than directly.

Identity Transform

identity()

Returns an identity transform that passes values through unchanged.

<T> Transform<T, T> identity()

Example:

import org.apache.iceberg.transforms.Transforms;
import org.apache.iceberg.transforms.Transform;

Transform<String, String> idTransform = Transforms.identity();

Usage in PartitionSpec:

import org.apache.iceberg.PartitionSpec;

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .identity("category")
    .identity("region")
    .build();

Bucket Transform

bucket()

Returns a bucket transform that hashes values into a fixed number of buckets.

<T> Transform<T, Integer> bucket(int numBuckets)

Parameters:

numBuckets - The number of buckets to distribute values into

Example:

Transform<Long, Integer> bucketTransform = Transforms.bucket(16);

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .bucket("user_id", 16)  // 16 buckets
    .build();

Common Bucket Sizes:

4, 8, 16 - For small to medium tables
32, 64 - For larger tables
128, 256 - For very large tables

Truncate Transform

truncate()

Returns a truncate transform that truncates values to a specified width.

<T> Transform<T, T> truncate(int width)

Parameters:

width - The width to truncate to
- For strings: truncates to width characters
- For integers/longs: truncates to width units
- For decimals: truncates to width units

Example:

Transform<String, String> truncTransform = Transforms.truncate(10);

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .truncate("name", 10)      // First 10 chars
    .truncate("value", 100)    // Truncate to 100s
    .build();

Temporal Transforms

year()

Extracts the year from dates or timestamps.

<T> Transform<T, Integer> year()

Example:

Transform<Long, Integer> yearTransform = Transforms.year();

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .year("event_time")
    .build();

month()

Extracts the month from dates or timestamps (as months since epoch).

<T> Transform<T, Integer> month()

Example:

Transform<Long, Integer> monthTransform = Transforms.month();

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .month("created_date")
    .build();

day()

Extracts the day from dates or timestamps (as days since epoch).

<T> Transform<T, Integer> day()

Example:

Transform<Long, Integer> dayTransform = Transforms.day();

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .day("event_date")
    .build();

hour()

Extracts the hour from timestamps (as hours since epoch).

<T> Transform<T, Integer> hour()

Example:

Transform<Long, Integer> hourTransform = Transforms.hour();

Usage in PartitionSpec:

PartitionSpec spec = PartitionSpec.builderFor(schema)
    .hour("event_timestamp")
    .build();

Void Transform

alwaysNull()

Returns a transform that always produces null (void transform).

<T> Transform<T, Void> alwaysNull()

Example:

Transform<String, Void> voidTransform = Transforms.alwaysNull();

Parsing Transforms

fromString()

Parses a transform from a string representation.

Transform<?, ?> fromString(String transform)

Supported Formats:

"identity"
"year", "month", "day", "hour"
"bucket[N]" - e.g., "bucket[16]"
"truncate[N]" - e.g., "truncate[10]"
"void"

Example:

Transform<?, ?> transform1 = Transforms.fromString("bucket[16]");
Transform<?, ?> transform2 = Transforms.fromString("year");
Transform<?, ?> transform3 = Transforms.fromString("truncate[10]");

Examples

Basic Partition Specs

import org.apache.iceberg.Schema;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.types.Types;
import static org.apache.iceberg.types.Types.NestedField.*;

// Create schema
Schema schema = new Schema(
    required(1, "id", Types.LongType.get()),
    required(2, "event_time", Types.TimestampType.withZone()),
    required(3, "category", Types.StringType.get()),
    required(4, "user_id", Types.LongType.get())
);

// Partition by date
PartitionSpec dateSpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .build();

// Partition by category (identity)
PartitionSpec categorySpec = PartitionSpec.builderFor(schema)
    .identity("category")
    .build();

// Partition by user bucket
PartitionSpec userSpec = PartitionSpec.builderFor(schema)
    .bucket("user_id", 16)
    .build();

Time-Based Partitioning

// Yearly partitions
PartitionSpec yearlySpec = PartitionSpec.builderFor(schema)
    .year("event_time")
    .build();

// Monthly partitions
PartitionSpec monthlySpec = PartitionSpec.builderFor(schema)
    .month("event_time")
    .build();

// Daily partitions
PartitionSpec dailySpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .build();

// Hourly partitions
PartitionSpec hourlySpec = PartitionSpec.builderFor(schema)
    .hour("event_time")
    .build();

Multi-Level Partitioning

// Partition by year and month
PartitionSpec yearMonthSpec = PartitionSpec.builderFor(schema)
    .year("event_time")
    .month("event_time")
    .build();

// Partition by date and category
PartitionSpec dateCategorySpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .identity("category")
    .build();

// Partition by date and user bucket
PartitionSpec dateUserSpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .bucket("user_id", 16)
    .build();

String Truncation

Schema schema = new Schema(
    required(1, "id", Types.LongType.get()),
    required(2, "email", Types.StringType.get()),
    required(3, "name", Types.StringType.get())
);

// Partition by email prefix
PartitionSpec emailSpec = PartitionSpec.builderFor(schema)
    .truncate("email", 10)  // First 10 characters
    .build();

// Partition by name prefix
PartitionSpec nameSpec = PartitionSpec.builderFor(schema)
    .truncate("name", 5)    // First 5 characters
    .build();

Numeric Truncation

Schema schema = new Schema(
    required(1, "price", Types.DecimalType.of(10, 2)),
    required(2, "quantity", Types.IntegerType.get())
);

// Partition by price in $100 increments
PartitionSpec priceSpec = PartitionSpec.builderFor(schema)
    .truncate("price", 100)
    .build();

// Partition by quantity in groups of 1000
PartitionSpec quantitySpec = PartitionSpec.builderFor(schema)
    .truncate("quantity", 1000)
    .build();

Hash-Based Distribution

// Distribute users evenly
PartitionSpec userDistribution = PartitionSpec.builderFor(schema)
    .bucket("user_id", 32)  // 32 buckets
    .build();

// Combine with time partitioning
PartitionSpec timeUserSpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .bucket("user_id", 16)
    .build();

Evolving Partition Specs

import org.apache.iceberg.Table;

// Initial spec - daily partitions
PartitionSpec initialSpec = PartitionSpec.builderFor(schema)
    .day("event_time")
    .build();

Table table = createTable(schema, initialSpec);

// Later - add category partitioning
table.updateSpec()
    .addField("category")
    .commit();

// Later - change to monthly partitions
table.updateSpec()
    .removeField("event_time_day")
    .addField(Transforms.month(), "event_time")
    .commit();

Custom Partition Values

import org.apache.iceberg.transforms.Transform;
import org.apache.iceberg.PartitionData;
import org.apache.iceberg.StructLike;

// Get transform
Transform<Long, Integer> bucketTransform = Transforms.bucket(16);

// Apply transform
Long userId = 12345L;
Integer bucket = bucketTransform.apply(userId);
System.out.println("User " + userId + " -> bucket " + bucket);

// Year transform
Transform<Long, Integer> yearTransform = Transforms.year();
Long timestamp = System.currentTimeMillis() * 1000; // microseconds
Integer year = yearTransform.apply(timestamp);
System.out.println("Timestamp " + timestamp + " -> year " + year);

Transform String Representation

import org.apache.iceberg.transforms.Transform;

Transform<?, ?> bucket16 = Transforms.bucket(16);
System.out.println(bucket16.toString()); // "bucket[16]"

Transform<?, ?> year = Transforms.year();
System.out.println(year.toString()); // "year"

Transform<?, ?> trunc10 = Transforms.truncate(10);
System.out.println(trunc10.toString()); // "truncate[10]"

Best Practices

Choosing Partition Transforms

Time-based data: Use year(), month(), day(), or hour() based on query patterns
High cardinality columns: Use bucket() to limit number of partitions
String prefixes: Use truncate() for prefix-based partitioning
Low cardinality: Use identity() for direct partitioning

Partition Granularity

// Too fine - creates too many small files
PartitionSpec tooFine = PartitionSpec.builderFor(schema)
    .hour("event_time")
    .bucket("user_id", 1000)
    .build();

// Better - balanced partition size
PartitionSpec balanced = PartitionSpec.builderFor(schema)
    .day("event_time")
    .bucket("user_id", 16)
    .build();

Bucket Count Selection

// Small table (< 1M rows)
PartitionSpec small = PartitionSpec.builderFor(schema)
    .bucket("id", 4)
    .build();

// Medium table (1M - 100M rows)
PartitionSpec medium = PartitionSpec.builderFor(schema)
    .bucket("id", 16)
    .build();

// Large table (> 100M rows)
PartitionSpec large = PartitionSpec.builderFor(schema)
    .bucket("id", 64)
    .build();

Core API

Catalog API

Scan API

Write API

Actions API

REST Catalog API

Types & Expressions

Overview

Identity Transform

identity()

Bucket Transform

bucket()

Truncate Transform

truncate()

Temporal Transforms

year()

month()

day()

hour()

Void Transform

alwaysNull()

Parsing Transforms

fromString()

Examples

Basic Partition Specs

Time-Based Partitioning

Multi-Level Partitioning

String Truncation

Numeric Truncation

Hash-Based Distribution

Evolving Partition Specs

Custom Partition Values

Transform String Representation

Best Practices

Choosing Partition Transforms

Partition Granularity

Bucket Count Selection

See Also

​Overview

​Identity Transform

​identity()

​Bucket Transform

​bucket()

​Truncate Transform

​truncate()

​Temporal Transforms

​year()

​month()

​day()

​hour()

​Void Transform

​alwaysNull()

​Parsing Transforms

​fromString()

​Examples

​Basic Partition Specs

​Time-Based Partitioning

​Multi-Level Partitioning

​String Truncation

​Numeric Truncation

​Hash-Based Distribution

​Evolving Partition Specs

​Custom Partition Values

​Transform String Representation

​Best Practices

​Choosing Partition Transforms

​Partition Granularity

​Bucket Count Selection

​See Also

Overview

Identity Transform

identity()

Bucket Transform

bucket()

Truncate Transform

truncate()

Temporal Transforms

year()

month()

day()

hour()

Void Transform

alwaysNull()

Parsing Transforms

fromString()

Examples

Basic Partition Specs

Time-Based Partitioning

Multi-Level Partitioning

String Truncation

Numeric Truncation

Hash-Based Distribution

Evolving Partition Specs

Custom Partition Values

Transform String Representation

Best Practices

Choosing Partition Transforms

Partition Granularity

Bucket Count Selection

See Also