Head of AI Evaluation & Reliability Engineering Location: Flexible / Hybrid Reports To: Head of Engineering Role Mission Build and scale Codvo’s AI Evaluation & Reliability Engineering capability as a core engineering function supporting the design, validation, and continuous improvement of enterprise AI systems in production. You will architect the frameworks, tooling, benchmark assets, and operational processes required to ensure AI systems deployed by Codvo and its customers meet enterprise standards for reliability, safety, performance, and governance. This role is deeply embedded within engineering and serves as the quality and reliability backbone for Codvo’s AI platform and delivery organization. Why This Role Matters As AI systems move from pilots to business-critical workflows, reliability and evaluation become core engineering disciplines—not optional afterthoughts. Codvo is building the infrastructure and operational rigor required to ensure every AI deployment is measurable, governed, and production-ready. Core Responsibilities Engineering Ownership - Build Codvo’s AI Evaluation & Reliability Engineering function as a core platform/engineering capability. - Define engineering standards for AI evaluation, testing, release gating, and runtime monitoring. - Integrate evaluation/reliability frameworks into Codvo’s engineering and delivery lifecycle. Evaluation Architecture - Design reusable evaluation frameworks for: - LLM / multimodal quality - RAG grounding / evidence fidelity - Agent reasoning / decision quality - Tool / workflow execution success - Safety / policy / compliance adherence - Cost / latency / production economics Benchmark Infrastructure - Build benchmark packs, golden datasets, and regression suites for priority enterprise workflows. - Define benchmark coverage and versioning standards. - Establish processes for edge-case capture and benchmark expansion. Runtime Reliability Systems - Design systems/processes for: - Runtime drift / degradation monitoring - Failure mode analysis / incident diagnostics - Human review / escalation pathways - Continuous evaluation and improvement loops Technical Leadership - Partner closely with platform, product, and solution engineering teams. - Serve as internal SME on AI reliability, benchmark design, and evaluation methodology. - Help shape architecture standards for AI-native product and workflow delivery. Team Leadership - Build and lead a team of: - Evaluation Engineers - Benchmark / QA Engineers - Reliability / Observability Engineers - Domain Review / Feedback Ops Specialists Required Qualifications - 10+ years in engineering / AI / ML leadership roles. - 5+ years building or operating production AI / ML systems. - Proven experience designing or operating: - AI/LLM evaluation frameworks - Benchmark / regression systems - AI QA / testing / validation infrastructure - Production ML / observability / monitoring systems - Reliability engineering / quality engineering organizations Technical Expertise - LLM / multimodal evaluation methodologies - Benchmark / golden dataset design - Agent / tool-use / workflow evaluation - RAG evaluation / grounding analysis - AI observability / telemetry / tracing - Human-in-the-loop feedback systems - AI safety / governance / policy testing - Release gating / CI/CD / engineering quality systems Preferred Backgrounds - AI Infrastructure / Evaluation Platforms - AI Observability / MLOps Companies - Enterprise AI Platform Teams - Applied AI Product / Platform Organizations - Reliability / QA Engineering Leadership in Complex Systems Success Metrics - Establish Codvo-wide AI evaluation/reliability standards - Integrate evaluation frameworks into engineering lifecycle - Launch reusable benchmark packs for target workflows - Reduce AI production failure / exception rates across deployments - Improve release confidence and deployment velocity for AI systems - Increase benchmark/evaluation asset reuse across customers Ideal Candidate Profile - Systems/reliability engineer mindset with strong AI depth - Product-minded builder who can create reusable engineering frameworks - Obsessed with operational excellence and measurable quality - Comfortable driving standards across engineering organizations Note- Please apply via our official careers portal only, as applications sent directly to executives may not be considered.
Requirements
10+ years in engineering / AI / ML leadership roles
5+ years building or operating production AI / ML systems
Proven experience designing or operating AI/LLM evaluation frameworks
Benchmark / regression systems
AI QA / testing / validation infrastructure
Benefits
401k Matching
Health Insurance
Paid Time Off
Remote Work
Stock Options
Skills
Python Machine Learning AI Engineering Evaluation Frameworks Reliability Engineering Benchmarking