Move logstash checker code into scap (!358) · Merge requests · repos / releng / Scap

Ahmon Dancy requested to merge master-I8ec33a8cdad453e35e3c67840f3e0ee843d28dad into master Jun 17, 2024

Move the logstash checker code (formerly in operations/puppet) into
scap, which is the only program that uses it as far as I can tell.
Moving the code into scap makes it more efficient for the Release
Engineering team to make changes to it when needed.

To avoid a moving baseline, the logstash checker no longer queries the
past hour of samples. Instead the "canary_threshold" (changed from a
float to an int) scap configuration value is used as the limit on the
number of errors seen in the last "canary_wait_time" seconds. I set
canary_threshold to 10 based on analysis of the last 90 days of canary
logs (using 20 second windows and discarding outliers). 10 is about
3 standard deviations above the mean of non-zero samples.

Also, when retrying the canary check, the canary_wait_time wait
happens again (unless the previous check resulted in an unexpected
error). This is to allow time for new data to arrive, otherwise an
immediate recheck is likely to have the same outcome.

When the canary check error rate threshold has been exceeded, print
the top 5 errors seen.

Created utils.log_large_message() from a method previously in
kubernetes.py. Make sure it uses proper format strings to log.

Bug: T367131
Bug: T183999
Bug: T159991

Edited Jun 20, 2024 by Ahmon Dancy

Admin message

Admin message

Admin message

Move logstash checker code into scap

Merge request reports