Add jobs to ingest deletes and moves.
So far, we have ignored events that relate to page changes, namely page delete
s and move
s.
To keep the downstream reconciliation process lightweight, we decided to also consume this events on an hourly basis.
In this MR, we implement page delete
and move
ingestion by:
- For
delete
s, we calculate all touchedpage_id
s and inject those in the Iceberg DELETE statement. - For
move
s we do similarly, but we also JOIN against the target table to be able resolve all touchedrevision_id
s. This is neccesary as the moves cannot be applied with an UPDATE statement, and require a MERGE INTO, and this mechanism requires uniqueness on the join clause.- Further, we introduce a new control column,
row_move_last_update
, to be able to applymove
s independently of other updates.
- Further, we introduce a new control column,
These changes introduce two new writes per hour, and I benchmarked them to add ~20 minutes more, most of that cost on file rewrites. But, when enabling Iceberg's merge-on-read
, the whole pipeline (revision level events, plus delete
s and move
s) takes ~7 minutes to run for a typical hour of enwiki
. Thus if we incorporate these changes, we should also set the target table to 'write.merge.mode' = 'merge-on-read'
, and figure out a rewrite job on a separate MR.
Bug: T369868