Why This Role Exists
Mercor partners with leading AI teams to improve the quality, usefulness, and reliability of general-purpose conversational AI systems.
This role focuses on evaluating and improving general chat behavior in large language models (LLMs). You will assess model-generated responses across diverse topics, provide structured human feedback, and help ensure AI systems communicate clearly, accurately, and effectively.
What You’ll Do
- Evaluate LLM-generated responses for accuracy and effectiveness.
- Conduct fact-checking using trusted public sources and external tools.
- Annotate strengths, weaknesses, and factual inaccuracies in responses.
- Assess reasoning quality, clarity, tone, and completeness.
- Ensure responses align with expected conversational guidelines.
- Apply consistent annotations using detailed taxonomies and benchmarks.
Who You Are
- Bachelor’s degree holder.
- Native Spanish speaker or ILR 5 / C2 proficiency.
- Fluent in English.
- Experienced user of large language models (LLMs).
- Strong writing and structured analytical skills.
- Detail-oriented and capable of identifying subtle issues.
- Comfortable working across multiple domains.
- Strong college-level mathematics skills.
Nice-to-Have
- Experience with RLHF, model evaluation, or annotation workflows.
- Background in research, policy, analytics, linguistics, or engineering.
- Familiarity with evaluation rubrics and benchmarking systems.
- Experience comparing multiple AI outputs and making qualitative judgments.
What Success Looks Like
- You consistently identify factual and reasoning errors.
- Your evaluation artifacts are clear and reproducible.
- Your feedback measurably improves model response quality.
- AI systems improve before public release due to your reviews.
Contract & Payment
- Independent contractor engagement.
- Fully remote, flexible schedule.
- Weekly payment via Stripe or Wise.
- Geography restricted to USA and Mexico.
- $18.14 per hour.
About Mercor
Mercor partners with leading AI labs and enterprises to train frontier models using human expertise. Contributors collaborate with researchers to improve advanced AI systems used globally.