Use new partitioning and bloom filter strategy for ingestion. (!36) · Merge requests · repos / data-engineering / dumps / mediawiki-content-dump · GitLab

How to register an account on GitLab. Due to spam, new accounts are locked until approved by an admin or the approver bot. Your GitLab account gets automatically approved within one hour if you are a member of Trusted Contributors in Gerrit, or a member of the Trusted-Contributors group in Phabricator and linked your Developer account to your Phabricator account. If none of these apply, you can file an unlock request to expedite access.

Support: mw:GitLab, how to host a project on GitLab, #wikimedia-gitlab on libera.chat, #GitLab on Phabricator.

🚨THIS IS A REPLICA🚨 -- You probably want to use the production gitlab, https://gitlab.wikimedia.org. Data on this instance is likely to be overwritten at short notice. Login with hardware 2FA key does not work, please use one-time passwords.

Xcollazo requested to merge tune-ingestion into main Oct 02, 2024

Draft: Still need to test this in Airflow

In this MR we:

Refactor event ingestion code to allow independently running process_revisions(), process_page_deletes() and process_page_moves(). This will allow us to tune each Spark job independently, while still being able to run them all in one Spark job for when we ingest smaller event tables, such as the reconciliation table.
Start utilizing a new partitioning, sorting, and bloom filter strategy to ingest faster.

Before	After	Rationale
`PARTITIONED BY (wiki_db, months(revision_timestamp))`	`PARTITIONED BY (wiki_db)`	The `revision_timestamp` based partitioning was not helping anymore now that the vast majority of the time was spent on our page deletes DELETE and page moves MERGE INTO which were doing a full table scan to see which revisions matched a particular `page_id`. We are better served by sorting by `page_id` instead.
`WRITE ORDERED BY wiki_db, revision_timestamp`	`WRITE ORDERED BY wiki_db, page_id, revision_timestamp`	This sorting strategy aligns our page deletes DELETE and page moves MERGE INTO with the data, and they typycally touch now very few files, finishing in ~1 mins each.
No bloom filter.	Max 5MB bloom filter on `revision_id`	Doing the two changes above makes the revision level MERGE INTO be the part that consumes the most time now as it will now require full table scans. And while a bloom filter was not effective before when applied to `page_ids`, it is very effective against `revision_id`s with the new partitioning strategy. This is so because we now have way less files per `wiki_db`, and so the time spent going over all files does not dwarf the gains from the bloom filter. For the revision level MERGE INTO we now see it done in ~10 mins.

Edited Oct 18, 2024 by Xcollazo