What are some good patterns for cleaning up noisy logging alerts

In addition to traditional logging from applications going into e.g. Elasticsearch, an organisation may have an alerting system “Sentry” that receives log messages/exception events sent by applications over HTTP, and notifies developers of potential problems.

Suppose that Sentry now contains not only “actionable” events (e.g. error connecting to the database. Devops should investigate), but has been polluted with a lot of “non-actionable” events (e.g. user input could not be processed – expecting the user to try again, nothing for devops to do).

What are some options for going from a system full of mixed good and bad event data, to a clean system with only good data so that the alerts become meaningful again and don’t get ignored?

Examples:
1) Gradually work through each event, starting with the low hanging fruit/most common events, deciding whether or not it’s actionable.
2) Create a new system and gradually transfer actionable events to it.

Answer

Every alert must require intelligent action. No action required alerts guarantee alert fatigue, and eventually missing real problems. Real problems result in status reports about degraded services, or open issues with software developers.

Creating sane altering from a noisy system is toil. Most likely the backlog will not be worked fast enough.

Consider declaring alert bankruptcy and removing all alerts. Add back the most basic essentials, like the error ratio on your API servers and median user response time. See for inspiration the four golden signals from the Google SRE book.

Going forward, do a root cause analysis on unplanned events and near misses. Where you have data that predicts the problem, add an alert. Schedule the alert for removal when the root cause is resolved and the alert has not fired in a long time.

Attribution
Source : Link , Question Author : Will Sheppard , Answer Author : John Mahowald

Leave a Comment