Back to remote jobs

Research Intern, Multimodal LLM Benchmarking

Centific

LLM & Agent Evaluation Intern · Full-time
United States $40 – $40/hr May 16, 2026

Job description

Centific is hiring a Research Intern focused on Multimodal LLM Benchmarking to contribute to advanced AI evaluation research involving multimodal foundation models.

This internship focuses on designing, executing, and analyzing benchmark systems for AI models operating across:

  • Text
  • Images
  • Audio
  • Video
  • Cross-modal retrieval systems

Contributors will work on cutting-edge multimodal evaluation problems involving benchmark design, scoring methodologies, dataset curation, and AI model analysis.

About Centific

Centific is an AI data and infrastructure company specializing in:

  • AI evaluation
  • Data curation
  • Fine-tuned LLMs
  • RAG pipelines
  • Enterprise AI deployment

The company works with enterprise clients and frontier AI organizations to support safe and scalable AI systems.

Centific’s ecosystem includes:

  • 150+ PhDs and data scientists
  • 4,000+ AI practitioners and engineers
  • 1.8 million domain experts across 230+ markets

Responsibilities

Multimodal Benchmark Design & Development

  • Design benchmark suites for multimodal foundation models involving:

    • Text-image tasks
    • Text-audio tasks
    • Text-video tasks
    • Cross-modal retrieval systems
  • Define:

    • Evaluation formats
    • Annotation guidelines
    • Scoring criteria
    • Benchmark coverage dimensions

Benchmark Execution & Analysis

  • Run multimodal models against benchmark suites

  • Analyze:

    • Performance trends
    • Failure modes
    • Evaluation outcomes
  • Produce actionable research summaries and recommendations

Metric & Scoring Research

  • Investigate automated evaluation approaches including:

    • Model-as-judge systems
    • Reference-free metrics
    • Human alignment evaluation
  • Evaluate trade-offs involving:

    • Reliability
    • Validity
    • Scalability
    • Evaluation cost

Dataset Curation & QA

  • Support:
    • Data collection
    • Annotation workflows
    • Dataset filtering
    • Inter-rater reliability analysis

Literature Review & Methodology

  • Survey multimodal evaluation literature
  • Identify gaps in existing benchmark systems
  • Propose novel evaluation approaches grounded in current research

Documentation & Communication

  • Produce:
    • Research write-ups
    • Benchmark documentation
    • Internal reports
    • Presentation-ready summaries

for technical and non-technical audiences.

Focus Areas

Depending on project alignment, contributors may work on:

Vision-Language Evaluation

  • Image captioning
  • Visual question answering
  • Chart reasoning
  • Image-text alignment

Audio & Speech-Language Benchmarking

  • Spoken language understanding
  • Audio captioning
  • Speech-text evaluation

Video Understanding

  • Temporal reasoning
  • Video QA
  • Video-text retrieval

Cross-Modal Robustness

  • Distribution shift testing
  • Adversarial multimodal inputs
  • Robustness analysis

Automated Multimodal Scoring

  • Judge-model evaluation systems
  • Open-ended multimodal generation evaluation

Required Qualifications

  • Currently enrolled in:
    • MS program
    • PhD program

in:

  • Computer Science

  • Machine Learning

  • Statistics

  • AI

  • Linguistics

  • Related quantitative field

  • Experience with:

    • Multimodal models
    • NLP systems
    • Vision-language systems
    • Audio or video ML tasks
  • Exposure to:

    • Benchmark design
    • Model evaluation
    • Experimental analysis
  • Strong Python skills

  • Experience with:

    • PyTorch
    • Hugging Face Transformers
  • Basic statistical analysis knowledge

  • Strong written and verbal communication skills

Preferred Qualifications

  • Experience with multimodal models such as:

    • LLaVA
    • GPT-4o
    • Gemini
    • Flamingo
  • Familiarity with benchmarks including:

    • MMBench
    • MMMU
    • SeedBench
    • VQAv2
    • AudioCaps
    • ActivityNet-QA
  • Experience with:

    • Annotation tools
    • Human evaluation workflows
    • Model-as-judge systems
  • Research publications or open-source contributions in:

    • Multimodal ML
    • NLP
    • AI evaluation

What You'll Gain

  • Mentorship from senior AI researchers and ML engineers
  • Ownership of publishable multimodal research projects
  • Exposure to enterprise AI workflows and applied research teams
  • Potential co-authorship opportunities
  • Flexible remote work arrangement
  • Competitive internship compensation
Apply now

You will be redirected to the company's website to complete your application.

Apply now

Stay in the loop.

One email per week, 5 hand-picked roles.