Self-optimizing AI infrastructure for agentic applications. An RL + GAN framework where AI evaluators score the product, AI engineers iterate the code, and the architecture learns from its own failures.
Getting from prototype to production is brutally manual. Engineers spend months hand-tuning prompts, evaluation frameworks, and orchestration logic β and it breaks every time the underlying models change.
Prompt engineering, eval tuning, and orchestration changes require human engineers at every step. The cycle time is weeks, not minutes.
Current tools help you BUILD agents but not IMPROVE them. There's no automated way to evaluate output, identify failures, and iterate.
Every model update, every new capability, every edge case requires human intervention. Engineering teams become the bottleneck.
Software that evaluates itself, writes its own code, tests with simulated users, and iterates β 300 times. No human engineering required.
7 AI evaluators (modeled after world-class VCs and product leaders) independently test the product via browser, research competitors, and score output against multi-dimensional objectives.
7 AI project managers translate the board's critique into sprint tickets. Every board concern maps to an engineering task β gradient propagation is measured (M/N ratio).
Engineering squads write real code in parallel git worktrees. Real tests, real commits. The architecture optimizes itself through adversarial learning.
10 simulated beta testers interact with the live product and provide NPS scores. User feedback is the ground truth that calibrates the discriminator.
ββββ Environment (Ground Truth) ββββ
β Beta tester NPS β
β Competitor landscape β
β Technical health (tests, build) β
ββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββ Reward R(t) ββββ
β 0.25 Γ PMF β
β 0.20 Γ Board β
β 0.15 Γ Moat β
β 0.15 Γ Design β
β 0.10 Γ Technical β
β 0.10 Γ Competitiveβ
β 0.05 Γ Founder β
ββββββββββ¬βββββββββββ
β
βΌ
ββββ Policy Ο(s) ββββ
β EXPLOIT / EXPLORE β
β PIVOT / RESEARCH β
β CONSOLIDATE β
ββββββββββ¬ββββββββββββ
β
βΌ
βββ Generator βββ
β 7 PMs β
β 2+ Eng squads β
β 10 Users β
βββββββββββββββββReal results from a continuous autonomous run on a single Mac Studio (128GB Apple Silicon).
The non-monotonic curve is by design β a discriminator that only goes up has collapsed.
R15-R17 drop: system detected hallucinated metrics (engineering claimed 10B stores; reality was 4). Board independently verified via live product and crashed the score β exactly what adversarial learning should do.
Built on reinforcement learning with a GAN-style adversarial discriminator component.
7 board evaluators + 7 PMs + engineering squads + 10 users run as parallel AI sessions. Each agent has persistent memory across rounds.
Qwen 3.5 122B runs locally on Apple Silicon via MLX at 42 tok/s. The orchestrator never sends sensitive data to external APIs.
944 files indexed with bge-m3 embeddings. Hybrid BM25 + vector search enables experience replay across 300 rounds.
3-minute heartbeat, watchdog cron for stall detection, auto-restart via LaunchAgents. Progress stalls > 60 min are automatically resolved.
Score-adaptive difficulty β the discriminator gets HARDER as the product improves. Anti-convergence rules prevent groupthink at high scores.
Alternative ideas tracked as bandit arms with UCB scoring. Exploration rate decays from 30% to 1% over 300 rounds.
Solo founder & CEO
We're looking for design partners building agentic AI applications who want to accelerate from prototype to production.