Catalogs
Spark uses pluggable table catalogs configured via properties underspark.sql.catalog.
Catalog Types
Iceberg provides two catalog implementations:| Implementation | Description | Use Case |
|---|---|---|
SparkCatalog | Dedicated Iceberg catalog | Hive Metastore or Hadoop warehouse |
SparkSessionCatalog | Adds Iceberg support to built-in catalog | Mixed Iceberg and non-Iceberg tables |
Hive Metastore Catalog
Configure a Hive-based catalog:REST Catalog
Configure a REST catalog:Hadoop Catalog
Configure a directory-based catalog:Catalog Configuration
Common configuration properties:| Property | Values | Description |
|---|---|---|
spark.sql.catalog.<name>.type | hive, hadoop, rest, glue, jdbc, nessie | Catalog implementation type |
spark.sql.catalog.<name>.catalog-impl | Class name | Custom catalog implementation |
spark.sql.catalog.<name>.io-impl | Class name | Custom FileIO implementation |
spark.sql.catalog.<name>.warehouse | Path | Warehouse directory base path |
spark.sql.catalog.<name>.uri | URI | Metastore URI (Hive) or REST URL |
spark.sql.catalog.<name>.default-namespace | Namespace | Default current namespace |
spark.sql.catalog.<name>.cache-enabled | true/false | Enable catalog cache (default: true) |
spark.sql.catalog.<name>.cache.expiration-interval-ms | Milliseconds | Cache expiration time (default: 30000) |
Table Defaults and Overrides
Set default or enforced table properties:View Defaults and Overrides
Similar configuration for views:Using Catalogs
Reference tables with catalog names:Replacing the Session Catalog
Add Iceberg support to Spark’s built-in catalog:This allows using the same Hive Metastore for both Iceberg and non-Iceberg tables. Non-Iceberg tables are handled by the built-in catalog.
Catalog-Specific Hadoop Configuration
Set per-catalog Hadoop properties:Catalog-specific properties take precedence over global
spark.hadoop.* properties.Loading Custom Catalogs
Use a custom catalog implementation:SQL Extensions
Enable Iceberg SQL extensions for advanced features:CALLstored proceduresALTER TABLE ... ADD/DROP PARTITION FIELDALTER TABLE ... WRITE ORDERED BYALTER TABLE ... SET IDENTIFIER FIELDS- Branching and tagging DDL
Runtime Configuration
Configuration Precedence
Settings are applied in the following order (highest to lowest priority):- DataSource Read/Write Options -
.option(...)in code - Spark Session Configuration -
spark.conf.set(...)orspark-defaults.conf - Table Properties -
ALTER TABLE SET TBLPROPERTIES - Default Value
Spark SQL Options
Global Iceberg behaviors via Spark configuration:Common SQL Options
| Option | Default | Description |
|---|---|---|
spark.sql.iceberg.vectorization.enabled | Table default | Enable vectorized reads |
spark.sql.iceberg.parquet.reader-type | ICEBERG | Parquet reader (ICEBERG, COMET) |
spark.sql.iceberg.check-nullability | true | Validate write schema nullability |
spark.sql.iceberg.check-ordering | true | Validate write schema column order |
spark.sql.iceberg.aggregate-push-down.enabled | true | Push down aggregates (MAX, MIN, COUNT) |
spark.sql.iceberg.distribution-mode | See Writes | Write distribution strategy |
spark.wap.id | null | Write-Audit-Publish snapshot ID |
spark.wap.branch | null | WAP branch name |
spark.sql.iceberg.compression-codec | Table default | Write compression codec |
spark.sql.iceberg.compression-level | Table default | Compression level |
spark.sql.iceberg.data-planning-mode | AUTO | Data file scan planning (AUTO, LOCAL, DISTRIBUTED) |
spark.sql.iceberg.delete-planning-mode | AUTO | Delete file scan planning |
spark.sql.iceberg.locality.enabled | false | Report locality for task placement |
spark.sql.iceberg.executor-cache.enabled | true | Enable executor-side cache |
spark.sql.iceberg.merge-schema | false | Enable schema evolution on write |
spark.sql.iceberg.report-column-stats | true | Report Puffin statistics to Spark CBO |
Read Options
Options for DataFrame reads:| Option | Default | Description |
|---|---|---|
snapshot-id | Latest | Snapshot ID to read |
as-of-timestamp | Latest | Timestamp in milliseconds |
branch | - | Branch name to read |
tag | - | Tag name to read |
split-size | Table property | Override split target size |
lookback | Table property | Override planning lookback |
file-open-cost | Table property | Override file open cost |
vectorization-enabled | Table property | Enable vectorized reads |
batch-size | Table property | Vectorization batch size |
stream-from-timestamp | - | Streaming start timestamp |
streaming-max-files-per-micro-batch | INT_MAX | Max files per streaming batch |
streaming-max-rows-per-micro-batch | INT_MAX | Soft max rows per batch |
Write Options
Options for DataFrame writes:| Option | Default | Description |
|---|---|---|
write-format | Table default | File format (parquet, avro, orc) |
target-file-size-bytes | Table property | Target file size |
compression-codec | Table default | Compression codec |
compression-level | Table default | Compression level |
compression-strategy | Table default | ORC compression strategy |
distribution-mode | See Writes | Distribution mode |
fanout-enabled | false | Enable fanout writer |
check-nullability | true | Validate field nullability |
check-ordering | true | Validate column order |
isolation-level | null | Isolation level (serializable, snapshot) |
validate-from-snapshot-id | null | Base snapshot for conflict detection |
snapshot-property.<key> | - | Custom snapshot metadata |
delete-granularity | file | Delete granularity |
Commit Metadata
Add custom metadata to snapshots:Next Steps
Getting Started
Set up your first Iceberg table
Write Data
Configure write performance and distribution
Procedures
Use stored procedures for maintenance
Structured Streaming
Configure streaming reads and writes