Apache Iceberg Changelogs Are Not Semantic Diffs

Changelog entries in Apache Iceberg do not represent logical change. They are a projection derived from the system's single authoritative source of truth: the snapshot.

Iceberg defines table correctness at the snapshot level. A snapshot represents the complete and valid state of a table at a point in time: which data files are visible, under which schema, and respecting all table invariants.

A snapshot fixes one concrete realization of a table logical state. Multiple snapshots may satisfy the same table semantics.

Snapshots are immutable. Once committed, its manifest and data files are never modified. Any change to the table is expressed by producing new files and publishing a new snapshot, while previous remain unchanged and accessible.

When write operations affect existing rows, immutabilty is preserved by rewriting data files (copy-on-write) or layering deltas on top of immutable files (merge-on-read). Under copy-on-write, updates and deletes are materialized by rewriting all data files containing the affected rows. Rows that are logically unchanged but reside in rewritten files, known as carry-over rows, are physically re-materialized.

Changelog tables surface this boundary. Each row is associated with the snapshot that owns it.

The example below illustrates this boundary directly. Two logically equivalent update sequences produce different physical rewrites and, as a consequence, different changelog entries. This is not a Spark quirk, nor a changelog artifact. It is a direct consequence of Iceberg's snapshot-level contract under copy-on-write semantics.

Comparing logically equivalent sequences under copy-on-write

We start from an empty Iceberg table, queried through Apache Spark and configured with copy-on-write semantics.

CREATE TABLE example(id INT, value STRING) USING ICEBERG;

At this point, the table's logical state is trivial: no rows visible and no data files exist.

We now express the same logical intent through two different operation sequences.

Sequence A: INSERT INTO + MERGE INTO

INSERT INTO example VALUES (1, 'a'), (2, 'b'), (3, 'c');

Spark executes multiple parallel write tasks. As a result, three independent immutable Parquet data files are produced, one per row.

A subsequent MERGE INTO operation updates one row and inserts two new ones:


            MERGE INTO example t
            USING (
            SELECT * FROM VALUES
                (1, 'd'),
                (5, 'e'),
                (6, 'f')
            AS v(id, value)
            )
            ON t.id = v.id
            WHEN MATCHED THEN UPDATE SET value = v.value
            WHEN NOT MATCHED THEN INSERT *;

Logically, the resulting table contains five rows:

id	value
1	d
2	b
3	c
5	e
6	f

Physically, only the data file containing id=1 is rewritten. The other two files remain unchanged from the previous snapshot.

In the changelog, rows originating from unchanged files retain their original change_ordinal, while rows introduced or rewritten in the new file appear with a new one.

id	value	_change_type	_change_ordinal	_commit_snapshot_id
1	d	INSERT	1	6403164038722032081
2	b	INSERT	0	8146078716036017366
3	c	INSERT	0	8146078716036017366
5	e	INSERT	1	6403164038722032081
6	f	INSERT	1	6403164038722032081

Sequence B: DOUBLE MERGE INTO

We now apply the same logical changes using two successive MERGE INTO operations.


            MERGE INTO example t
            USING (
              SELECT * FROM VALUES
                (1, 'a'),
                (2, 'b'),
                (3, 'c')
              AS v(id, value)
            )
            ON t.id = v.id
            WHEN MATCHED THEN UPDATE SET *
            WHEN NOT MATCHED THEN INSERT *;

In this case, Spark materializes all three rows into a single immutable Parquet file.

The second MERGE INTO is identical to the one used in Sequence A.


            MERGE INTO example t
            USING (
            SELECT * FROM VALUES
                (1, 'd'),
                (5, 'e'),
                (6, 'f')
            AS v(id, value)
            )
            ON t.id = v.id
            WHEN MATCHED THEN UPDATE SET value = v.value
            WHEN NOT MATCHED THEN INSERT *;

The final logical table state is identical to Sequence A.

Physically, the single data file produced by the first merge is identified as affected and therefore must be fully rewritten under copy-on-write semantics.

As a result, every row appears in the changelog with a new change_ordinal

id	value	_change_type	_change_ordinal	_commit_snapshot_id
1	d	INSERT	1	7023772432058531414
2	b	INSERT	1	7023772432058531414
3	c	INSERT	1	7023772432058531414
5	e	INSERT	1	7023772432058531414
6	f	INSERT	1	7023772432058531414

Conclusion

This behavior is expected and necessary.

Changelog tables do not encode semantic diffs at the row level. They encode snapshot ownership.

Under copy-on-write, any row that crosses a file rewrite boundary must be re-materialized and re-associated with the new snapshot in the changelog projection - regardless of whether its logical value changed. Failing to do so would introduce an additional source of truth outside the snapshot, violating Iceberg's correctness guarantees.

This is not an implementation detail, but a direct consequence of defining table correctness at snapshot level under immutability and copy-on-write semantics.