Use new partitioning and bloom filter strategy for ingestion.
Draft: Still need to test this in Airflow
In this MR we:
- Refactor event ingestion code to allow independently running
process_revisions()
,process_page_deletes()
andprocess_page_moves()
. This will allow us to tune each Spark job independently, while still being able to run them all in one Spark job for when we ingest smaller event tables, such as the reconciliation table. - Start utilizing a new partitioning, sorting, and bloom filter strategy to ingest faster.
Before | After | Rationale |
---|---|---|
PARTITIONED BY (wiki_db, months(revision_timestamp)) |
PARTITIONED BY (wiki_db) |
The revision_timestamp based partitioning was not helping anymore now that the vast majority of the time was spent on our page deletes DELETE and page moves MERGE INTO which were doing a full table scan to see which revisions matched a particular page_id . We are better served by sorting by page_id instead. |
WRITE ORDERED BY wiki_db, revision_timestamp |
WRITE ORDERED BY wiki_db, page_id, revision_timestamp |
This sorting strategy aligns our page deletes DELETE and page moves MERGE INTO with the data, and they typycally touch now very few files, finishing in ~1 mins each. |
No bloom filter. | Max 5MB bloom filter on revision_id
|
Doing the two changes above makes the revision level MERGE INTO be the part that consumes the most time now as it will now require full table scans. And while a bloom filter was not effective before when applied to page_ids , it is very effective against revision_id s with the new partitioning strategy. This is so because we now have way less files per wiki_db , and so the time spent going over all files does not dwarf the gains from the bloom filter. For the revision level MERGE INTO we now see it done in ~10 mins. |
Bug: T375402