Use an intermediate table when backfilling wmf_dumps.wikitext_raw_rc1.
(Depends on repos/data-engineering/dumps/mediawiki-content-dump!13 (merged))
This MR replaces the 4 DAGs dumps_merge_backfill_to_wikitext_raw_*
with a single one that is much more efficient in time and resource usage.
From T346281#9170772:
The current backfill takes ~7.5h per group per year on recent years. As stated before, mosf of that time is wasted time reading
wmf.mediawiki_wikitext_history
over and over. This comes to be ~7.5h * 22 years * 4 groups = 660 hours = 27.5 days if run sequentially. Because we run each group in parallel, its actually ~6.8 days using 75% of cluster resources.But if we have an intermediate table (with schema as in T346281#9170438), we have the following: ~19h (create intermediary table) + 2.4h * 22 years = ~72 hours = ~3 days !
🎉 All of this was done by using ~18.4% of the cluster resources (100 executors with 24GB RAM, 2 cores each, and in this casespark.sql.adaptive.coalescePartitions.enabled=true
)
Bug: T346281