Programatic declaration of Airflow Pools. Single slot pool for MERGEs into wmf_dumps.wikitext_raw
In this MR we do 2 things to introduce the use of Airflow pools.
We define a new mechanism, PoolRegistry
, that borrows heavily from existing DagProperties
and DatasetRegistry
mechanisms. This new mechanism:
- Reads pool definition from a list of YAML files.
- Keeps these definitions in memory.
- In the event that a
pool
is needed, it creates the pool against the AirflowPool
API if needed. - This mechanism does not delete any existing pools, nor mutates them. That is, if there is an inconsistency between the YAML files and the Airflow DB, the DB wins.
Additionally, in the YAML file for the analytics
instance, we add a new pool called merge_into_mutex_for_wmf_dumps_wikitext_raw
. This pool with 1 slot allows us to have multiple DAGs with MERGE INTOs that effectively run their spark tasks serially against table wmf_dumps.wikitext_raw
, thus mimicking a mutex. This solves an issue where multiple MERGEs on recent data fail sporadically on conflicts. We also bump the priority_weight
of the MERGEs from DAG dumps_merge_backfill_to_wikitext_raw
so that backfills are prioritized.
Bug: T347718