Skip to content

Add jobs to ingest deletes and moves.

Xcollazo requested to merge ingest-deletes-and-moves into main

So far, we have ignored events that relate to page changes, namely page deletes and moves.

To keep the downstream reconciliation process lightweight, we decided to also consume this events on an hourly basis.

In this MR, we implement page delete and move ingestion by:

  • For deletes, we calculate all touched page_ids and inject those in the Iceberg DELETE statement.
  • For moves we do similarly, but we also JOIN against the target table to be able resolve all touched revision_ids. This is neccesary as the moves cannot be applied with an UPDATE statement, and require a MERGE INTO, and this mechanism requires uniqueness on the join clause.
    • Further, we introduce a new control column, row_move_last_update, to be able to apply moves independently of other updates.

These changes introduce two new writes per hour, and I benchmarked them to add ~20 minutes more, most of that cost on file rewrites. But, when enabling Iceberg's merge-on-read, the whole pipeline (revision level events, plus deletes and moves) takes ~7 minutes to run for a typical hour of enwiki. Thus if we incorporate these changes, we should also set the target table to 'write.merge.mode' = 'merge-on-read', and figure out a rewrite job on a separate MR.

Bug: T369868

Merge request reports