abra

Visit Website

🔍

Remote-Friendliness Score Coming Soon

We're analyzing abra's remote work culture to provide detailed insights.

Open Positions at abra

AI Evaluation & Reliability Engineer (Agents & LLM Systems)

abra

Full Time

$120,000 - $180,000*

2 months ago

Worldwide

AI Governance & Programs

Senior

Python

LLMs

Evaluation Frameworks

+2 more

Job Description

abra R&D is looking for a AI Evaluation & Reliability Engineer (Agents & LLM Systems)! abra R&D is looking for a AI Evaluation & Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data. What You’ll Do: Design and implement evaluation frameworks for AI agents and multi-agent systems Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality Develop agent-based evaluation systems (agents evaluating agents) for scalable testing Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance Build data-driven evaluation pipelines using synthetic and real-world datasets Identify and analyze failure modes, edge cases, and non-deterministic behaviors Improve agent robustness, consistency, and reliability in production environments Work with tools such as Google ADK, Opik, and related evaluation frameworks Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality Requirements Must have: 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering Strong programming skills in Python Hands-on experience working with LLMs in production environments Experience building evaluation systems, automation frameworks, or testing infrastructure Strong understanding of prompt engineering, tool use, and agent behavior Ability to think in terms of metrics, correctness, and system reliability Nice to have: Experience with LLM evaluation frameworks (Opik, LangSmith, etc.) Experience with Google ADK / agent frameworks Experience implementing LLM-as-a-judge or ranking systems Background in data systems, analytics, or real-time pipelines Experience with multi-agent systems Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)

Requirements

4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
Strong programming skills in Python
Hands-on experience working with LLMs in production environments
Experience building evaluation systems, automation frameworks, or testing infrastructure
Strong understanding of prompt engineering, tool use, and agent behavior
Ability to think in terms of metrics, correctness, and system reliability

Benefits

401k Matching
Certification Support
Flexible Hours
Health Insurance
Home Office Budget
Learning Budget
Paid Time Off
Remote Work

Skills

Python LLMs Evaluation Frameworks Google ADK Opik

AI Evaluation & Reliability Engineer

abra

Full Time

$120,000 - $180,000*

2 months ago

Worldwide

AI Governance & Programs

Senior

Python

LLMs

Evaluation Frameworks

+2 more

Job Description

abra R&D is looking for a Reliability Engineer! abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale. We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering. You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data. What You’ll Do: Design and implement evaluation frameworks for AI agents and multi-agent systems Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality Develop agent-based evaluation systems (agents evaluating agents) for scalable testing Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance Build data-driven evaluation pipelines using synthetic and real-world datasets Identify and analyze failure modes, edge cases, and non-deterministic behaviors Improve agent robustness, consistency, and reliability in production environments Work with tools such as Google ADK, Opik, and related evaluation frameworks Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality Requirements Must have: 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering Strong programming skills in Python Hands-on experience working with LLMs in production environments Experience building evaluation systems, automation frameworks, or testing infrastructure Strong understanding of prompt engineering, tool use, and agent behavior Ability to think in terms of metrics, correctness, and system reliability Nice to have: Experience with LLM evaluation frameworks (Opik, LangSmith, etc.) Experience with Google ADK / agent frameworks Experience implementing LLM-as-a-judge or ranking systems Background in data systems, analytics, or real-time pipelines Experience with multi-agent systems Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)

Requirements

4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
Strong programming skills in Python
Hands-on experience working with LLMs in production environments
Experience building evaluation systems, automation frameworks, or testing infrastructure
Strong understanding of prompt engineering, tool use, and agent behavior
Ability to think in terms of metrics, correctness, and system reliability

Benefits

401k Matching
Certification Support
Flexible Hours
Health Insurance
Home Office Budget
Learning Budget
Paid Time Off
Remote Work

Skills

Python LLMs Evaluation Frameworks Google ADK Opik

Senior AI Evaluation & Reliability Engineer

abra

Full Time

$120,000 - $180,000*

2 months ago

Worldwide

AI Governance & Programs

Senior

Python

PyTorch

TensorFlow

+5 more

Job Description

abra R&D is looking for a Staff AI Engineer! abra R&D is looking for an AI Engineer that will take part of building a next-generation agentic analytics platform powered by a real-time, AI-optimized data infrastructure. We are looking for an experienced AI Engineer to design, build, and deploy intelligent systems that operate at scale and in real time. This role is hands-on and product-oriented, focusing on developing, integrating, and productionizing AI and machine learning models as part of a complex, high-performance platform. What You Will Do: Design, develop, and deploy AI and machine learning models into production systems Build scalable AI services that operate on large-scale and real-time data Implement deep learning and machine learning solutions using modern frameworks Integrate AI models into end-to-end product flows and backend systems Collaborate closely with software engineers and AI teams to deliver production-ready solutions Optimize model performance, reliability, and scalability in real-world environments Develop and maintain data pipelines and model-serving infrastructure Contribute to the evolution of AI-powered, agent-based systems and analytics capabilities Requirements 3+ years of experience in AI engineering, machine learning engineering, or applied ML in production Strong programming skills in Python Hands-on experience with PyTorch or TensorFlow Experience implementing ML models using frameworks such as scikit-learn, XGBoost, or LightGBM Solid experience with data processing tools (Pandas, NumPy, Spark) Experience working with large-scale or real-time data systems Strong software engineering mindset with a focus on reliability and maintainability Strong Advantages Experience deploying AI models in production environments Familiarity with LLM-based systems, AI agents, or agentic workflows Experience with event-driven or real-time analytics systems Background in AI-powered platforms or data-driven products

Requirements

3+ years of experience in AI engineering, machine learning engineering, or applied ML in production
Strong programming skills in Python
Hands-on experience with PyTorch or TensorFlow
Experience implementing ML models using frameworks such as scikit-learn, XGBoost, or LightGBM
Solid experience with data processing tools (Pandas, NumPy, Spark)
Experience working with large-scale or real-time data systems
Strong software engineering mindset with a focus on reliability and maintainability

Benefits

401k Matching
Certification Support
Flexible Hours
Health Insurance
Home Office Budget
Learning Budget
Performance Bonus
Remote Work

Skills

Python PyTorch TensorFlow Scikit-learn XGBoost Pandas LightGBM NumPy

abra

Remote-Friendliness Score Coming Soon

Open Positions at abra

AI Evaluation & Reliability Engineer (Agents & LLM Systems)

Job Description

Requirements

Benefits

Skills

AI Evaluation & Reliability Engineer

Job Description

Requirements

Benefits

Skills

Senior AI Evaluation & Reliability Engineer

Job Description

Requirements

Benefits

Skills

Similar Companies

10x Team

66degrees

8am

Absa Group

Accenture

Accordion