Skip to content

Use new partitioning and bloom filter strategy for ingestion.

Xcollazo requested to merge tune-ingestion into main

Draft: Still need to test this in Airflow

In this MR we:

  • Refactor event ingestion code to allow independently running process_revisions(), process_page_deletes() and process_page_moves(). This will allow us to tune each Spark job independently, while still being able to run them all in one Spark job for when we ingest smaller event tables, such as the reconciliation table.
  • Start utilizing a new partitioning, sorting, and bloom filter strategy to ingest faster.
Before After Rationale
PARTITIONED BY (wiki_db, months(revision_timestamp)) PARTITIONED BY (wiki_db) The revision_timestamp based partitioning was not helping anymore now that the vast majority of the time was spent on our page deletes DELETE and page moves MERGE INTO which were doing a full table scan to see which revisions matched a particular page_id. We are better served by sorting by page_id instead.
WRITE ORDERED BY wiki_db, revision_timestamp WRITE ORDERED BY wiki_db, page_id, revision_timestamp This sorting strategy aligns our page deletes DELETE and page moves MERGE INTO with the data, and they typycally touch now very few files, finishing in ~1 mins each.
No bloom filter. Max 5MB bloom filter on revision_id Doing the two changes above makes the revision level MERGE INTO be the part that consumes the most time now as it will now require full table scans. And while a bloom filter was not effective before when applied to page_ids, it is very effective against revision_ids with the new partitioning strategy. This is so because we now have way less files per wiki_db, and so the time spent going over all files does not dwarf the gains from the bloom filter. For the revision level MERGE INTO we now see it done in ~10 mins.

Bug: T375402

Edited by Xcollazo

Merge request reports