Spark Integration Overview - Apache Iceberg Documentation

Overview

Spark is currently the most feature-rich compute engine for Iceberg operations. Apache Iceberg uses Spark’s DataSourceV2 API for data source and catalog implementations, providing comprehensive support for table management, queries, and writes.

Key Features

Full DDL Support

Create, alter, and manage Iceberg tables with complete SQL DDL operations

Advanced Queries

Time travel, metadata tables, and efficient scan planning

Row-Level Operations

MERGE INTO, UPDATE, and DELETE operations for data modification

Streaming Support

Structured Streaming reads and writes with incremental processing

Compatibility

Iceberg integrates with Apache Spark through the DataSourceV2 API, with different levels of support across Spark versions:

Feature	Availability	Notes
SQL INSERT INTO	✔️ All versions	Requires ANSI assignment policy (default since Spark 3.0)
SQL MERGE INTO	✔️ All versions	Requires Iceberg Spark extensions
SQL DELETE FROM	✔️ All versions	Row-level deletes require extensions
SQL UPDATE	✔️ All versions	Requires Iceberg Spark extensions
DataFrame writes	✔️ All versions	DataFrameWriterV2 API recommended
Structured Streaming	✔️ All versions	Append and complete modes

Type Compatibility

Iceberg automatically converts between Spark and Iceberg types:

Spark to Iceberg Type Mapping

Spark Type	Iceberg Type	Notes
boolean	boolean
byte, short, integer	integer	Promotion supported
long	long
float	float
double	double
decimal	decimal
date	date
timestamp	timestamp with timezone
timestamp_ntz	timestamp without timezone
string, char, varchar	string
binary	binary	Can write to fixed type with length assertion
struct	struct
array	list
map	map

Iceberg to Spark Type Mapping

Iceberg Type	Spark Type	Supported
boolean	boolean	✔️
integer	integer	✔️
long	long	✔️
float	float	✔️
double	double	✔️
decimal	decimal	✔️
date	date	✔️
time	-	❌ Not supported
timestamp with timezone	timestamp	✔️
timestamp without timezone	timestamp_ntz	✔️
string	string	✔️
uuid	string	✔️
fixed	binary	✔️
binary	binary	✔️
struct	struct	✔️
list	array	✔️
map	map	✔️
variant	variant	✔️ (Spark 4.0+)
unknown	null	✔️ (Spark 4.0+)

Getting Started

Add Iceberg Runtime

Include the Iceberg Spark runtime in your Spark environment:

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5:{{ icebergVersion }}

Configure Catalogs

Set up Iceberg catalogs in your Spark configuration:

spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.type=hive

Enable SQL Extensions

Add Iceberg SQL extensions for advanced features:

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Next Steps

Getting Started

Set up your first Iceberg table with Spark

DDL Operations

Learn about CREATE, ALTER, and DROP commands

Query Data

Execute queries and explore metadata tables

Write Data

Insert, update, and merge data into tables

Documentation Index

​Overview

​Key Features

Full DDL Support

Advanced Queries

Row-Level Operations

Streaming Support

​Compatibility

​Type Compatibility

​Spark to Iceberg Type Mapping

​Iceberg to Spark Type Mapping

​Getting Started

​Next Steps

Getting Started

DDL Operations

Query Data

Write Data

Overview

Key Features

Compatibility

Type Compatibility

Spark to Iceberg Type Mapping

Iceberg to Spark Type Mapping

Getting Started

Next Steps