>Ryan Watts
All posts
Data Engineering
12 min read
March 3, 2025

Building a Real-Time Threat Detection Platform from Zero

The engineering decisions behind Cork's AI-powered cybersecurity intelligence platform, and what I'd do differently.

RW

Ryan Watts

Principal AI & Data Engineer

When I joined Cork as Staff Data Engineer in 2023, the mission was clear: build a real-time threat detection platform for MSPs (managed service providers) from scratch. No existing data infrastructure. No data team. Just a greenfield problem with high stakes.

Eighteen months later, we had what I'd call the world's first real-time AI-powered threat detection platform with automated risk signal generation and compliance event convergence. Here's what I learned.

The architectural constraints

Cybersecurity data is uniquely challenging:

1. Volume: MSPs manage thousands of endpoints across dozens of clients. The event volume is massive.

2. Latency requirements: A threat detected in 5 minutes is worth 100x a threat detected in an hour.

3. Accuracy demands: False positives destroy trust faster than missed detections. The bar is high.

4. Multi-tenancy: Every client's data must be strictly isolated while analytics run across the full fleet.

These constraints ruled out most standard data warehouse patterns and pushed us toward a streaming-first architecture.

The stack decision

After evaluating several options, we landed on:

·GCP Pub/Sub for event ingestion (low latency, fully managed, excellent at MSP-scale volumes)
·Dataflow for streaming transformations
·BigQuery as the analytical layer (surprisingly capable for near-real-time with streaming inserts)
·Cloud Functions for risk signal generation
·Microsoft Graph API for identity and device event enrichment

The Microsoft Graph API integration was particularly important. MSPs live in Microsoft's ecosystem, so events from Entra ID, Defender, and Intune were primary signal sources.

The risk classification system

The core IP was the risk signal classification system. We needed to:

1. Ingest raw security events

2. Enrich them with device/user context from Graph API

3. Apply ML classification to score risk

4. Generate actionable risk signals for MSP analysts

5. Route high-confidence signals to automated response workflows

The classification model was a gradient boosted ensemble trained on labeled threat data. But the more interesting engineering problem was the feature engineering pipeline — extracting meaningful features from heterogeneous event streams in real time.

What I'd do differently

Three things I'd change with hindsight:

1. Start with a more opinionated schema. We evolved our event schema too organically, which created technical debt that cost weeks to unwind. Define a strict canonical event format early, even if it feels premature.

2. Invest earlier in observability. We were debugging live production pipelines with inadequate tooling for the first three months. Datadog for metrics, structured JSON logs from day one, and distributed tracing would have saved enormous time.

3. Use DuckDB for local development. BigQuery is great in production but slow and expensive for local iteration. DuckDB now handles most of our local development and testing, and the SQL compatibility is excellent.

The platform ultimately processed tens of millions of security events per day with sub-second threat signal generation. That's the kind of impact that makes complex engineering challenges worth it.

RW

Ryan Watts

Principal AI & Data Engineer with 15+ years building enterprise systems. Head of AI at DVx Ventures, Staff Data Engineer at Cork, and independent consultant.