When I joined Cork as Staff Data Engineer in 2023, the mission was clear: build a real-time threat detection platform for MSPs (managed service providers) from scratch. No existing data infrastructure. No data team. Just a greenfield problem with high stakes.
Eighteen months later, we had what I'd call the world's first real-time AI-powered threat detection platform with automated risk signal generation and compliance event convergence. Here's what I learned.
The architectural constraints
Cybersecurity data is uniquely challenging:
1. Volume: MSPs manage thousands of endpoints across dozens of clients. The event volume is massive.
2. Latency requirements: A threat detected in 5 minutes is worth 100x a threat detected in an hour.
3. Accuracy demands: False positives destroy trust faster than missed detections. The bar is high.
4. Multi-tenancy: Every client's data must be strictly isolated while analytics run across the full fleet.
These constraints ruled out most standard data warehouse patterns and pushed us toward a streaming-first architecture.
The stack decision
After evaluating several options, we landed on:
The Microsoft Graph API integration was particularly important. MSPs live in Microsoft's ecosystem, so events from Entra ID, Defender, and Intune were primary signal sources.
The risk classification system
The core IP was the risk signal classification system. We needed to:
1. Ingest raw security events
2. Enrich them with device/user context from Graph API
3. Apply ML classification to score risk
4. Generate actionable risk signals for MSP analysts
5. Route high-confidence signals to automated response workflows
The classification model was a gradient boosted ensemble trained on labeled threat data. But the more interesting engineering problem was the feature engineering pipeline — extracting meaningful features from heterogeneous event streams in real time.
What I'd do differently
Three things I'd change with hindsight:
1. Start with a more opinionated schema. We evolved our event schema too organically, which created technical debt that cost weeks to unwind. Define a strict canonical event format early, even if it feels premature.
2. Invest earlier in observability. We were debugging live production pipelines with inadequate tooling for the first three months. Datadog for metrics, structured JSON logs from day one, and distributed tracing would have saved enormous time.
3. Use DuckDB for local development. BigQuery is great in production but slow and expensive for local iteration. DuckDB now handles most of our local development and testing, and the SQL compatibility is excellent.
The platform ultimately processed tens of millions of security events per day with sub-second threat signal generation. That's the kind of impact that makes complex engineering challenges worth it.