Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems.
Remote LLM & Agent Evaluation jobs
Every LLM & Agent Evaluation listing below is remote and traceable to a real, verified employer. 108 open right now, updated daily; when a role goes stale it comes down within 24 hours.
LLM & Agent Evaluation is part of our AI & Data Training domain. Filter by pay, experience level, or country to narrow the list.
Prompt Writer (AI Training) About the Role What if your curiosity, creativity, and way with words could directly shape how the world's most advanced AI systems think, reason, and respond?
AI Chatbot Tester About the Role What if your opinions could make AI smarter, safer, and more useful for millions of people? As an AI Chatbot Tester with Alignerr, that's exactly what you'll do.
Seeking users with hands-on experience using enterprise AI agent systems to provide structured feedback on real-world workflows and tool behavior. This fully remote contractor role offers flexible scheduling and pays $30 per hour.
Role Overview Mercor is partnering with a leading AI research laboratory to improve the quality of AI-generated document evaluation systems through structured human review. This role is designed for highly detail-oriented professionals who can critically assess AI-generated…
Mercor is seeking Internet-Native Bilingual Evaluator Experts (Spanish–English) to contribute to cutting-edge AI research focused on multilingual communication, internet culture, and real-world language use. In this role, you will evaluate, annotate, and improve AI model…
Mercor is seeking Internet-Native Bilingual Evaluator Experts (French–English) to contribute to advanced AI research focused on multilingual communication, internet culture, and real-world language use. This role is ideal for individuals who actively participate in both French-…
Mercor is seeking Internet-Native Bilingual Evaluator Experts (Japanese–English) to contribute to advanced AI research focused on multilingual communication, internet culture, and contemporary digital language use. This role is ideal for individuals who actively participate in…
About Turing Based in San Francisco, California, Turing is a leading research accelerator supporting frontier AI laboratories and enterprises deploying advanced AI systems. Turing helps organizations improve AI performance through high-quality datasets, advanced training…
As an AI Data Trainer with Linux/Ubuntu, you will work remotely on hourly-paid AI data training projects involving structured, detail-oriented tasks. Responsibilities include installing applications, following detailed instructions, and recording actions while completing…
Job Summary As a Kurdish Language Expert, you will leverage your linguistic expertise to help train next-generation AI systems through translation, editing, linguistic analysis, data annotation, and language quality evaluation tasks. Your contributions will help AI systems…
Job Summary As a Wolof Language Expert, you will use your linguistic expertise to help train next-generation AI systems through high-quality translation, localization, writing, editing, and language evaluation tasks. Your work will contribute directly to improving how AI…
About the Role A leading AI research organization is seeking experienced AI power users to evaluate how effectively advanced language models perform real-world, highly personalized life-assistance tasks. Contributors will assess AI outputs across domains such as productivity,…
Please submit your CV in English and indicate your level of English proficiency. Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems.
About the Role Turing is seeking Korean-speaking AI Quality Analysts to evaluate a new personalization feature for Gemini. In this role, you will assess how effectively the AI uses information from a user's Gemini conversations, Gmail, Google Search history, and YouTube activity…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Marathi to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking Dutch Audio Generalist Evaluator Experts to contribute to a high-impact AI research project focused on audio understanding, transcription, annotation, and model evaluation. Contributors will help train and benchmark advanced language models by converting audio…
Mercor is seeking an Argentinian Spanish Audio Generalist Evaluator Expert to contribute to a high-impact AI research project focused on audio understanding, transcription, annotation, and model evaluation. This role supports the training and benchmarking of advanced language…
Senior Software Engineer — Agentic Coding (AI Training) About the Role What if your software engineering expertise could define how the next generation of AI writes, debugs, and ships code on its own? We're looking for Senior Software Engineers in Seattle to work at the…
<h2>About the Role</h2> <p>In this remote contractor role, you'll apply your Nepali language expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input.</p> <h2>Key Responsibilities</h2>…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Odia to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Gujarati to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Assamese to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Punjabi to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Malayalam to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Telugu to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Tamil to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Kannada to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Urdu to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
Mercor is seeking bilingual AI Safety Experts fluent in both English and Bengali to help evaluate and strengthen the safety of frontier AI systems. This role focuses on red teaming AI models by identifying vulnerabilities, testing misuse scenarios, and generating high-quality…
DataAnnotation is seeking Secondary Education Teachers to help train and evaluate AI models. In this role, contributors will assess AI-generated responses, evaluate reasoning quality, and complete writing and editing tasks that help improve AI performance and accuracy.…
As an AI Evaluation Specialist, you will help train and evaluate next-generation AI systems by designing evaluation tasks, creating grading rubrics, and assessing AI performance across real-world computer-based workflows. This role focuses on structured thinking, precise…
In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required, your domain knowledge is what matters. Key…
As a Generalist, you will contribute to the training and improvement of next-generation AI systems by providing high-quality real-world input, feedback, and operational support. This role is designed for adaptable professionals who can work efficiently across multiple tasks and…
Mercor is seeking experienced enterprise professionals to contribute to an AI evaluation and benchmarking project in partnership with a leading AI lab. Contributors will evaluate the quality of AI-generated reasoning, decision-making, and analytical outputs across complex…
Vetto is hiring contributors for the Safety Project — Emotional Distress (Role-Play Track), an AI safety initiative focused on evaluating how AI systems respond during emotionally sensitive and high-risk conversations. The project centers on testing whether AI systems can…
Centific is hiring a Research Intern focused on Multimodal LLM Benchmarking to contribute to advanced AI evaluation research involving multimodal foundation models. This internship focuses on designing, executing, and analyzing benchmark systems for AI models operating across:…
Innodata (Nasdaq: INOD) is a global data engineering company. We believe that data and Artificial Intelligence (AI) are inextricably linked. Our mission is to enable the responsible advancement of artificial intelligence by providing the data, evaluation frameworks, and human…
Mercor is seeking analytically minded generalists to help train AI systems on real-world reasoning and visual understanding tasks. This role focuses on evaluating AI performance across: Multi-modal reasoning Real-world interpretation Spatial reasoning Common-sense problem…
Mercor is hiring remote generalists to contribute to AI training and evaluation projects for leading AI labs and enterprises. This opportunity involves working on projects focused on improving frontier AI systems using human expertise across a variety of tasks and domains.…
About Volga Partners Volga Partners is a U.S.-based company specializing in Artificial Intelligence, machine learning, business process outsourcing, and professional services for leading technology companies and multinational organizations. Role Overview Volga Partners…
Role Overview We are seeking detail-oriented AI Evaluation & Annotation Specialists to help train and improve Large Language Models (LLMs). In this role, you will review AI-generated responses, evaluate output quality, annotate content, and follow structured guidelines to…
About the Role DataAnnotation is seeking Customer Support Specialists to help train and improve next-generation AI systems. In this role, you will evaluate AI chatbot outputs, assess response quality, and help improve language accuracy, reasoning, and customer-facing…
About the Role DataAnnotation is hiring Freelance Writers to help train and improve advanced AI language models. In this role, you will create writing and editing tasks for AI systems, evaluate chatbot-generated responses, and help improve the overall quality, logic, and…
About the Role DataAnnotation is hiring AI Content Writing Specialists to help train and improve advanced AI language models. In this role, you will evaluate AI-generated writing, create editing and writing tasks, assess response quality, and help improve chatbot performance…
About the Role Surge AI is hiring Generative AI Generalists to help train and improve advanced AI chatbot systems. In this role, you will evaluate AI-generated responses, assess logical reasoning, provide writing and editing tasks, and help improve overall AI model quality and…
About the Role Query AI is hiring Germany-based AI Generalist Trainers to support the evaluation and improvement of advanced AI systems. In this role, you will assess, compare, and rank AI-generated outputs across a variety of domains while helping improve the quality,…
About Turing Turing is one of the world’s fastest-growing AI companies helping advance frontier AI systems and build real-world AI solutions for businesses worldwide. The company collaborates with leading AI labs to improve reasoning, coding, multimodal understanding,…
Tips for finding remote jobs
- Set up job alerts on multiple platforms to never miss an opportunity.
- Highlight your remote work experience and self-management skills.
- Prepare for video interviews and remote work assessments.
- Customize your resume and cover letter for each remote position.
- Build a strong online presence on LinkedIn and professional networks.