Skip to content

Mitigate refine Webrequest concurrency problem

Aqu requested to merge T376882_webrequest_concurrency_on_write_failure into main

We are experiencing some missing files from web requests during certain hours, and we suspect this issue is related to concurrency access on the _temporary folder that Spark uses when running with algorithm.version=1. In this version, Spark writes task outputs to the _temporary directory before moving them to the final output location (in our case, the Hive partition folder).

The problem arises because both the upload web request refine process and the text web request refine process are using the same _temporary directory. One process may be clearing the _temporary folder while the other is still writing to it, potentially causing the missing files.

With algorithm.version=2, Spark tasks write directly to the final output location, bypassing the need to stage files in the _temporary directory. While this version improves performance and avoids concurrency issues with the temporary folder, it comes with the risk of leaving partial output files in case of task failures.

In our situation, switching to algorithm.version=2 might resolve the concurrency issue by eliminating the shared use of the _temporary directory between the two processes.

Bug: T376882

Merge request reports