What is P3AK
Platform OverviewBayouClaw — Linux Agent Runtime
What Impressed YouAlpine Docker, per-agent Linux users, Unix domain sockets, seccomp syscall filtering, and a policy engine enforcing the Iron Rule: no agent acts without HITL sign-off. Seven agents. Nine test suites. Fourteen board meetings run in production.
Agent Roster
| Agent | Linux User | Role | Transport | Seccomp |
|---|---|---|---|---|
| amber | p3ak-amber | Kernel — routes all requests, voice interface | Unix sock | Restricted |
| ledger | p3ak-ledger | Financial analysis, AR/AP | Unix sock | Strict |
| counselor | p3ak-counselor | Legal review, contract analysis | Unix sock | Strict |
| signal | p3ak-signal | Email, calendar, communications | Unix sock | Restricted |
| scout | p3ak-scout | Web research, external data | Unix sock | Strict |
| architect | p3ak-architect | Codebase analysis, technical decisions | Unix sock | Strict |
| board | p3ak-board | Multi-agent synthesis, board meeting engine | Unix sock | Restricted |
Request Routing Architecture
/run/bayouclaw/*.sock. No network stack exposed. No open ports between processes. Not interceptable from outside the container — this is the architectural advantage over Claude Code's file-based mailbox approach.Dockerfile Security Model
HITL — Human in the Loop
OS-Level ControlThe agent cannot bypass this. The kernel enforces it. Every external action goes through a staging directory. Agents have WRITE to staging, ZERO access to outbox. The hitl-daemon is the only process that can move files forward. This is file permissions — not an if-statement.
6 Approval Levels
HITL Policy (abbreviated)
Approval Flow
Encryption & Key Management
Passcode CyclingYour question about passcode cycling landed on a real gap. The encryption stack is solid. The rotation pipeline is manual. That's Sprint 1 of hardening.
Current State
Target State
Known Attack Surface & Hardening Plan
Honest AssessmentEvery gap we know about, stated plainly. This is what a security specialist needs to see before touching the stack. No gap here is a design flaw — they're implementation debt from building fast. Each one has a fix path, a risk level, and a sprint target.
p3ak-vault rekey exists but is manual-only. No scheduled rotation, no policy enforcement. A compromised passphrase stays valid indefinitely until a human intervenes./var/spool/p3ak/staging/system/. Requires biometric + PIN. Old key archived in WAL with expiry timestamp. Sprint 1..vault binary format (header, segment table, encrypted blocks) has never been fuzzed. A malformed vault file could trigger panics, buffer misreads, or undefined behavior in the Rust parser. Ingest path accepts 38 file formats — each a potential vector.cargo fuzz targets for: vault header parsing, segment table deserialization, encrypted block decryption with corrupted ciphertext, WAL replay. CI-integrated. Each ingest format converter gets its own fuzz target. Sprint 2.git filter-repo and rotated. But anyone who cloned before the purge has the old keys in their local reflog.gitleaks pre-commit hook installed across all 4 repos. Repos are private (siliconbayou org). No public clones existed. Residual risk: low. Ongoing: secrets scanner in CI pipeline. Done + Sprint 1.p3ak-vault create without --passphrase creates an unencrypted vault. No warning, no prompt. A user who forgets the flag gets zero encryption and may not realize it. The vault file looks identical from the outside.create requires passphrase or interactive prompt. Unencrypted mode requires explicit --no-encrypt flag. Vault header includes encryption-status byte readable by any tooling. Sprint 1.p3ak-vault serve binds 127.0.0.1:8080 — localhost only. But no Web Application Firewall, no input sanitization layer above what Rust's type system provides. A malicious local process could send crafted HTTP payloads.SO_PEERCRED on Unix socket) or API token. Default: 100 req/s search, 50 req/s write. Configurable in config.toml. Burst buffer: 2x sustained. Sprint 2.chown + chmod 700) provide access control, but traffic is plaintext. A root-level compromise reads all IPC.SO_PEERCRED verifies peer UID. For multi-host deployments: mTLS over Tailscale WireGuard. Current single-host model: accepted risk. Document threat model.adduser). Use rootless Podman or Docker --userns=auto. Init script drops to unprivileged user immediately after socket bind via gosu. seccomp profile already blocks setuid/mount/ptrace for agent processes. Sprint 3.Risk Summary
Hardening Timeline
Performance
Real NumbersThese come from dev-environment profiling — not a formal load test. Vault ops are sub-second. Voice latency is network-bound (Deepgram + Cartesia). Agent routing is LLM-bound, not compute-bound. Formal JMeter benchmarks are the gap — Sean's expertise.
| Operation | Median | P99 | Bottleneck | Status |
|---|---|---|---|---|
| Vault ingest (markdown) | 18ms | 45ms | Tantivy indexing | FAST |
| BM25 search | 12ms | 28ms | Index scan | FAST |
| Hybrid search (BM25 + ZVec + PageIndex) | 31ms | 62ms | Three-way merge | FAST |
| Vault create | 8ms | 20ms | Argon2id KDF | FAST |
| Vault ingest (PDF, Tier 2) | ~400ms | ~1.2s | PDF extraction | OK |
| Voice STT (Deepgram) | ~500ms TTFB | ~900ms | Network (cloud) | OK |
| Voice TTS (Cartesia) | ~500ms TTFB | ~800ms | Network (cloud) | OK |
| BayouClaw agent routing | <2s | <5s | LLM inference | LLM-BOUND |
| Board meeting (5 agents) | 30–60s | ~90s | Sequential LLM calls | EXPECTED |
| Room → vault push (full doc) | ~180ms | ~400ms | HTTP + ingest | FAST |
What Needs Measuring
Target SLAs (Proposed)
Amber — Voice AI with OS Access
LiveKit · Deepgram · CartesiaReal-time voice agent with full OS access, connected to BayouClaw's agent network. LiveKit Cloud handles WebRTC. Deepgram for STT. Cartesia TTS (Caroline voice). Nine tools — dangerous ones staged for HITL approval before any execution. The Force Field eliminates AI fingerprint patterns from her speech.
Test Suite
Coverage State362 Rust unit tests across 14 modules, 54 CLI integration tests, 12 accuracy benchmarks, 44 PyO3 native binding tests, 35 Python SDK tests. Plus 9 BayouClaw shell suites and 51 session regression tests. Score: 88% (45/51 regression). Two failures last session — both security, both fixed same session. No CI/CD pipeline yet. Tests run manually. That's the gap.
Vault unit tests (Rust, 362 tests across 14 modules): Each module has an inline
#[cfg(test)] mod tests block. Tests cover: crypto roundtrip (encrypt → decrypt → verify), search accuracy (BM25, ZVec TF-IDF, hybrid), format conversion (38 file types → plaintext), WAL integrity (write-ahead log append + hash chain verification), classification confidence scoring, entity obligation matching, consent token HMAC-SHA256 signing + verification + revocation, and memory rot detection (4 rot types: stale, contradicted, orphaned, decayed).Module test breakdown:
classify ............ 60 tests (8-store classification, confidence thresholds, signal weighting)
rerank .............. 55 tests (BM25+ZVec+PageIndex fusion, edge cases, empty results)
search .............. 41 tests (hybrid search accuracy, multi-vault, mode switching)
convert ............. 28 tests (38 formats: PDF, DOCX, XLSX, CSV, HTML, .mdr, .lhr, audio)
ingest .............. 23 tests (file → vault pipeline, upsert, dedup, batch)
rot ................. 21 tests (4 rot types, decay curves, remediation triggers)
zvec ................ 21 tests (TF-IDF vector math, cosine similarity, sparse ops)
types ............... 21 tests (Doc struct, metadata, serialization roundtrip)
binary_store ........ 20 tests (vault binary format read/write/corruption recovery)
wal ................. 19 tests (append-only log, hash chain integrity, tamper detection)
crypto .............. 18 tests (AES-256-GCM encrypt/decrypt, Argon2id KDF, key derivation)
consent ............. 14 tests (HMAC-SHA256 tokens, scope: sector/doc/full, TTL, revocation)
entity .............. 12 tests (CompanyMetadata schema, obligation catalog, completeness scoring)
store ................ 9 tests (vault CRUD, section management, room isolation)
CLI integration tests (54 tests): Run the compiled
p3ak-vault binary as a subprocess. Test every CLI command end-to-end: create, ingest, search, read, write, classify, delete, export, sync, watch, canary-add, canary-check, serve. Each test creates a temp vault, runs the command, validates stdout JSON, checks exit codes, verifies vault state after mutation.Accuracy benchmarks (12 tests): Seed a vault with known documents, run 12 queries with known expected top-1 results. Measure Top-1, Top-3, and MRR (Mean Reciprocal Rank). Current: 98% Top-1 accuracy on the standard benchmark set. Tests fail if accuracy drops below 95%.
BayouClaw shell suites (9 suites, ~55 checks): Run inside Docker container. Cover: end-to-end pipeline (HTTP → Amber → agent → vault → response), agent intelligence (response quality scoring via rubric), security audit (auth bypass, injection, privilege escalation, seccomp enforcement, socket permissions), policy engine (rate limiting, Iron Rule blocking, token guard), self-test (Amber evaluates her own output quality). The security suite has 34 individual checks across 6 attack categories.
Regression tests (51 tests, 7 categories): Automated bash script that tests live infrastructure: service health (HTTP endpoints), vault search across 6 vaults, tool definitions in TypeScript, HITL policy enforcement, security posture (git history, env files, port exposure, container user), voice pipeline (STT/TTS/Force Field), and file existence. Results exported as JSON for the ops dashboard.
What's missing (the gap): No CI/CD pipeline. No automated runs on push. No concurrency testing. No load testing. No degraded-state testing. No mutation testing. No fuzz testing on vault binary format. These are the production hardening deliverables.
Where Sean Fits
Production Hardening RoadmapP3AK's architecture is correct. The gaps are production readiness, not design. No CI/CD. No formal pen test. No automated key rotation. No load benchmarks. That's a QA and security engineering engagement. Here's the exact work.