At 02:13, we received a token in our logs that led to a 500 error across our systems. By 03:00, we had a SIEM redaction in place, but the token had already spread, even reaching a contractor’s laptop. We were left wondering who else might have seen it, but we didn’t have that info.
By then, downstream masking was too late.
Issues that contributed:
- Log Proliferation: Too many log destinations increase exposure.
- Unsafe Defaults: Frameworks often log sensitive data by default.
- Fragmented Ownership: Fixes end up with SIEM across different teams.
- Detection Challenges: Secrets are varied and hard to identify.
- Pressure to Act: “Log everything” settings tend to stick around after crises.
Why conventional solutions fail:
- Regex Failures: Often miss context and formats.
- Delayed Masking: Raw data gets exposed quickly.
- Blind Drops: Can lead to observability issues.
- Encryption Limitations: Don’t prevent initial exposure.
- Quick Fix Illusions: Rapid exposure complicates cleanup.
Preventive Measures:
- Keep sensitive data minimal from the start.
- Enforce strict logging policies.
For effective logs:
- Use allowlists for specific event types.
- Implement derived identifiers.
Policy enforcement:
- Use logging SDKs that accept only approved fields.
- Opt-in to production logging settings.
- Exclude sensitive headers.
Checks to have in place:
- Integrate CI testing.
- Use runtime guards.
To reduce impact:
- Utilize short-lived raw and sanitized data sinks.
- Limit access.
Verification strategies:
- Map out routes and track exposure times.
- Use versioned policy-as-code.
Effective logging tips:
- Exclude request bodies and auth headers.
- Set up alerts for unknown fields.
- Use safe exception details.
- Default to sanitized logs.
Consider downstream masking only for third-party or legacy systems.
To improve:
- Map critical paths and reduce logging.
- Use minimal schemas with allowlists.
- Run CI tests for confidentiality.
- Set up agent discards.
- Rotate access credentials regularly.
Tradeoffs:
- Allowlists might slow things down; mitigate with good design.
- On-path guards add some latency but boost security.
- Focus on legacy paths with high risks.
- Test downstream masking after deployment.
In conclusion, once data is logged, control is lost. Be proactive, validate with testing, and only use downstream masking as a last resort.