Avoid deadlock between dag runs & Give more chances for auto-correction
Avoid deadlock between dag runs
We currently have those 2 configurations:
- a limit to the number of active DAG runs (currently 16).
- a dependency to the past between tasks in default args.
It's a recipe for a deadlock. We may arrive in this kind of scenario:
- 1 task failed. Eventually, the DAG run is going to be marked as failed.
- The scheduler creates the next DAG run, and one of its tasks is waiting for the past.
- The scheduler launches 15 more DAG runs (all of them are "running"). None are advancing because their tasks depend on the past
- The maintainer arrives and fixes the task code or environment. Now it should work at the next execution.
- But the failed tasks is not running. Because its DAG run can't go from "failed" to "running", as 16 DAG runs are currently "running"...
The current trick is to mark one of the 16 DAG runs as "failed" and then turn it back to "success" again. It will execute the first DAG, which failed, and unlock the situation.
This kind of scenario most likely happens only for hourly DAGs.
One solution is avoiding unnecessarily "depends_on_past", which is the case here for aqs/hourly.
Another solution would be to set the maximum number of DAG runs with a higher value.
Give more chances for auto-correction
A restart of a server may have happened in the night from Saturday to Sunday (2022-02-27, 00:00:00).
2 Errors stopped an Hive partition sensor:
- MySQLdb._exceptions.OperationalError: (2026, 'SSL connection error: Error in the pull function.')
- airflow.exceptions.AirflowException: Task received SIGTERM signal
The next run of the task triggered manually went alright.
I am suggesting increasing the number of retries.
Consequently, the large or long-running or resource-intensive tasks need to decrease this value in the task definition.