Move logstash checker code into scap
Move the logstash checker code (formerly in operations/puppet) into
scap, which is the only program that uses it as far as I can tell.
Moving the code into scap makes it more efficient for the Release
Engineering team to make changes to it when needed.
To avoid a moving baseline, the logstash checker no longer queries the
past hour of samples. Instead the "canary_threshold" (changed from a
float to an int) scap configuration value is used as the limit on the
number of errors seen in the last "canary_wait_time" seconds. I set
canary_threshold to 10 based on analysis of the last 90 days of canary
logs (using 20 second windows and discarding outliers). 10 is about
3 standard deviations above the mean of non-zero samples.
Also, when retrying the canary check, the canary_wait_time wait
happens again (unless the previous check resulted in an unexpected
error). This is to allow time for new data to arrive, otherwise an
immediate recheck is likely to have the same outcome.
When the canary check error rate threshold has been exceeded, print
the top 5 errors seen.
Created utils.log_large_message()
from a method previously in
kubernetes.py. Make sure it uses proper format strings to log.