Getting Started with Spark - Apache Iceberg Documentation

Installation

The latest version of Iceberg is .

Using Spark Shell

To use Iceberg in a Spark shell, add the runtime JAR using the --packages option:

spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5:{{ icebergVersion }}

If you want to include Iceberg in your Spark installation permanently, add the iceberg-spark-runtime JAR to Spark’s jars folder.

Using Spark SQL

For Spark SQL with catalog configuration:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5:{{ icebergVersion }} \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse

Configuring Catalogs

Iceberg catalogs enable SQL commands to manage tables and load them by name. Configure catalogs using properties under spark.sql.catalog.(catalog_name).

Hadoop Catalog Example

Create a path-based catalog named local for tables under a warehouse directory:

spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type=hadoop
spark.sql.catalog.local.warehouse=/path/to/warehouse

Hive Metastore Catalog

Configure a Hive-based catalog with session catalog support:

spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive

Creating Your First Table

Create a Simple Table

Use the CREATE TABLE command to create your first Iceberg table:

CREATE TABLE local.db.table (
    id bigint,
    data string
) USING iceberg;

Insert Data

Add data to your table using INSERT INTO:

INSERT INTO local.db.table 
VALUES (1, 'a'), (2, 'b'), (3, 'c');

Query Data

Read data from your table:

SELECT count(1) as count, data
FROM local.db.table
GROUP BY data;

Row-Level Updates

Iceberg adds row-level SQL updates to Spark:

MERGE INTO

Update existing rows and insert new ones in a single operation:

MERGE INTO local.db.table t 
USING (SELECT * FROM updates) u 
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.data = u.data
WHEN NOT MATCHED THEN INSERT *;

DELETE FROM

Remove rows matching a condition:

DELETE FROM local.db.table 
WHERE ts < '2024-01-01';

Writing with DataFrames

Iceberg supports the v2 DataFrame write API for programmatic writes:

val data: DataFrame = ...
data.writeTo("local.db.table").append()

The old v1 write API is supported but not recommended. Use the v2 writeTo API instead.

Reading with DataFrames

Load tables by name using spark.table:

val df = spark.table("local.db.table")
df.count()

Inspecting Tables

Use metadata tables to inspect table history and snapshots:

View Snapshots

SELECT * FROM local.db.table.snapshots;

+-------------------------+----------------+-----------+-----------+----------------------------------------------------+
| committed_at            | snapshot_id    | parent_id | operation | manifest_list                                      |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+
| 2019-02-08 03:29:51.215 | 57897183625154 | null      | append    | s3://.../table/metadata/snap-57897183625154-1.avro |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+

View History

SELECT * FROM local.db.table.history;

View Files

SELECT * FROM local.db.table.files;

Type Conversion

Spark to Iceberg

When creating tables or writing data, Spark types are automatically converted:

Spark Type	Iceberg Type	Notes
boolean	boolean
byte, short, integer	integer	Promoted to integer
long	long
float	float
double	double
decimal	decimal
timestamp	timestamp with timezone
timestamp_ntz	timestamp without timezone
string, char, varchar	string
binary	binary	Assertion on length for fixed type

Numeric types support promotion during writes. For example, you can write Spark integer to Iceberg long.

Iceberg to Spark

When reading from Iceberg tables:

Iceberg Type	Spark Type	Supported
timestamp with timezone	timestamp	✔️
timestamp without timezone	timestamp_ntz	✔️
uuid	string	✔️
time	-	❌ Not supported
variant	variant	✔️ Spark 4.0+
unknown	null	✔️ Spark 4.0+

Next Steps

DDL Commands

Learn about CREATE, ALTER, and DROP operations

Query Data

Explore SELECT queries and metadata tables

Write Data

Master INSERT INTO and MERGE INTO operations

Procedures

Maintain tables with stored procedures

Documentation Index

​Installation

​Using Spark Shell

​Using Spark SQL

​Configuring Catalogs

​Hadoop Catalog Example

​Hive Metastore Catalog

​Creating Your First Table

​Row-Level Updates

​MERGE INTO

​DELETE FROM

​Writing with DataFrames

​Reading with DataFrames

​Inspecting Tables

​View Snapshots

​View History

​View Files

​Type Conversion

​Spark to Iceberg

​Iceberg to Spark

​Next Steps

DDL Commands

Query Data

Write Data

Procedures

Installation

Using Spark Shell

Using Spark SQL

Configuring Catalogs

Hadoop Catalog Example

Hive Metastore Catalog

Creating Your First Table

Row-Level Updates

MERGE INTO

DELETE FROM

Writing with DataFrames

Reading with DataFrames

Inspecting Tables

View Snapshots

View History

View Files

Type Conversion

Spark to Iceberg

Iceberg to Spark

Next Steps