Keep: Open-Source Alert Management That Actually Wires Things Together

Keep is an open-source alert management platform — think PagerDuty meets Zapier for on-call engineers. It ingests alerts from 130+ providers, deduplicates and correlates them into incidents, and runs YAML-defined workflows that can query data, evaluate conditions, and fire actions across all the same integrations.

Why I starred it

The problem Keep targets is real: you're running Grafana, Datadog, CloudWatch, and some custom webhook — each screaming at you in its own format, with its own severity vocabulary, through its own notification channel. By the time an incident is declared, three people have already acked three different alerts for the same thing.

What caught my eye wasn't the feature list — it was the provider model. Keep didn't build 130 integrations by hand. They built one solid abstraction and made it composable.

How it works

The core abstraction lives in keep/providers/base/base_provider.py. Every integration inherits from BaseProvider, which declares class-level metadata:

class BaseProvider(metaclass=abc.ABCMeta):
    FINGERPRINT_FIELDS: list[str] = []
    PROVIDER_CATEGORY: list[Literal["Monitoring", "Ticketing", "Collaboration", ...]] = ["Others"]
    PROVIDER_TAGS: list[Literal["alert", "ticketing", "messaging", "data", ...]] = []
    PROVIDER_SCOPES: list[ProviderScope] = []
    PROVIDER_METHODS: list[ProviderMethod] = []

FINGERPRINT_FIELDS is the key one. Each provider declares which fields on an incoming alert uniquely identify it — so the deduplication layer doesn't need to know anything about Datadog vs Grafana semantics. It just hashes those fields and compares.

The deduplication logic is in keep/api/alert_deduplicator/alert_deduplicator.py. It's a two-tier system: full duplicates (same fingerprint, same hash — alert is suppressed) and partial duplicates (same fingerprint, different hash — alert is an update to an existing one). The hash is SHA-256 over the alert dict after stripping the fields configured in the dedup rule:

alert_hash = hashlib.sha256(
    json.dumps(alert_copy.dict(), default=str, sort_keys=True).encode()
).hexdigest()

Dedup rules can be per-provider or custom, pulled from the DB at evaluation time.

The rules engine in keep/rulesengine/rulesengine.py uses cel-python — Google's Common Expression Language — to evaluate alert correlation rules. There's a performance workaround I noticed: they're monkeypatching __repr__ on half a dozen CEL types to suppress logging overhead:

celpy.evaluation.Referent.__repr__ = lambda self: ""
celpy.celtypes.MapType.__repr__ = lambda self: ""
# ... six more of these

Not pretty, but it's an honest fix for a real profiling problem. The upstream issue is linked in the comment.

The workflow engine in keep/workflowmanager/ runs on a ThreadPoolExecutor (default 20 workers, configurable via KEEP_MAX_WORKFLOW_WORKERS). Workflows have three execution strategies defined in workflow.py: PARALLEL, NONPARALLEL, and NONPARALLEL_WITH_RETRY — the last being the default. If a workflow for a given fingerprint is already running and you get another alert for the same fingerprint, Keep re-queues it rather than dropping it or spawning a parallel run. That's the right default for stateful ops like ticket creation.

Using it

Docker Compose gets you running in a few minutes:

curl -s https://raw.githubusercontent.com/keephq/keep/main/docker-compose.yml \
  -o docker-compose.yml

docker compose up -d

Workflows are YAML files you push to Keep or load from your repo. Here's the basic shape:

workflow:
  id: notify-on-critical
  triggers:
    - type: alert
      filters:
        - key: severity
          value: critical
  steps:
    - name: get-runbook
      provider:
        type: bash
        config: "{{ providers.default-bash }}"
        with:
          command: curl -s https://internal.docs/runbook/{{ alert.name }}
  actions:
    - condition:
        - type: assert
          assert: "{{ steps.get-runbook.results.return_code }} == 0"
      name: post-to-slack
      provider:
        type: slack
        config: "{{ providers.slack-oncall }}"
        with:
          message: "Critical: {{ alert.name }} — runbook: {{ steps.get-runbook.results.stdout }}"

Triggers can be alert events, manual, interval-based, or incident-based. The {{ }} templating resolves against a context manager that carries alert fields, step results, provider configs, and secrets — all lazily evaluated via an IOHandler.

Rough edges

The provider count is impressive, but quality is uneven. Some providers have detailed scope declarations and webhook support; others are thin wrappers with minimal field mapping. If you're relying on a provider outside the top tier (Datadog, PagerDuty, Grafana), read the source before depending on it.

The CEL __repr__ patches work but signal that cel-python has not been battle-tested at Keep's scale. If they ever need to debug a malformed CEL expression, those patches will make tracing harder.

Multi-tenancy is baked in throughout — every DB query is scoped to a tenant_id. That's good for SaaS. If you're self-hosting for a single team, it adds ceremony without benefit.

The UI is Next.js (keep-ui/), reasonably complete, but the API-first design means you can drive everything via REST or workflow YAML without ever touching it.

Bottom line

Keep is the right tool if you're drowning in alerts from heterogeneous sources and want something self-hostable with real workflow logic — not just routing rules. The provider abstraction is solid enough that adding a new integration is a day's work, not a week's.