Make pod and pvc watcher threads resilient to errors
Sometimes the Kubernetes server will drop a connection associated with
an open watch (probably after some timeout period). This was causing
pvc-cleaner to terminate regularly, preventing it from properly
tracking idle PVCs.
This commit makes the following changes in an attempt to deal with this:
-
Set
_request_timeout=60
in theWatch.stream()
call. If no event
is received within that timeout period, the watch will raise a
timeout exception. We catch this exception and restart the watch
when it happens. This results in a fresh query to the k8s api
server on a regular basis, reducing the likelihood of server-side
idle timeouts. -
Also catch urllib3.exceptions.ProtocolError and treat it the same
way as the timeout (restart the watch). This is the exception that
as seen in production.
Also, factor out code in common between watch_pvcs and watch_pods.
Bug: T351478