Back to Whitepapers

Bud Sentinel: A CPU-Native Safety Guardrail for Large Language Models

May 6, 2026

Bud Sentinel is a CPU-native safety guardrail for LLMs covering jailbreak detection, prompt-injection detection, and content moderation across categories like toxicity, hate, harassment, self-harm, violence, illegal content, and regulated advice. It is built on a new attention mechanism called Resource Aware Attention (RAA).

The problem. Every production LLM call passes through safety classifiers on input and output, almost always on CPU fleets shared with the application itself. Existing transformer guard models were designed for GPUs and compressed down to fit. They land in the hundreds-of-milliseconds-per-classification regime, cap inputs at 512 tokens, and force operators to choose between guards that miss most attacks or block most benign traffic. Compression has plateaued. Another round will not close a two-orders-of-magnitude gap.

The approach. Rather than compress the existing template further, RAA treats the deployment envelope (cache hierarchy, precision tier, latency SLO) as a first-class input to the attention mechanism itself. Cost does not scale quadratically with sequence length. Multiple task heads compose over a single attention pass, so a guardrail, router, and PII extractor can share one cost. RAA is a family of models; RAA-Safety powers BudSentinel, with RAA-Span, RAA-Retrieve, and RAA-Route on the roadmap.

Headline results.

  • Latency: 5.67 ms per classification on EPYC, 5.99 ms on Xeon, 8.39 ms on a laptop i7. Transformer baselines on the same CPUs sit between 334 ms and 3,855 ms. BudSentinel on a laptop CPU is faster than every transformer baseline running on an A100.
  • Throughput: Over 4,400 req/s on a single Xeon 6972P at 512 tokens with p99 under 12 ms. Nearly 1,500 req/s on a fanless Lunar Lake laptop.
  • Long context: Native 65,536-token inputs in a single call (560 ms p50). Transformer baselines do not operate at this length at all.
  • Accuracy: 15.97% ASR and 14.92% FRR aggregate. The only model evaluated that lands both ASR and FRR under 20%. Models with lower ASR achieve it by refusing 82–89% of benign traffic.

Trade-offs the paper is direct about. RAA is not generative and is not chasing frontier transformer accuracy at unbounded compute. It targets the most accurate deployable model at the target envelope. PIGuard remains the hardest slice for every model evaluated, BudSentinel included. The fast RAA layer is the first line of defense in a layered product, with deeper scans behind it for tail risk.

The broader bet. RAA generalizes beyond safety to any CPU-deployed small-model workload in the LLM stack: routing, retrieval re-ranking, PII and compliance extraction, query understanding, long-context scoring, and on-device deployments. The thesis is that designing attention against the envelope the application actually runs in, rather than retrofitting GPU architectures down to CPU, changes what a request can afford to do.