Research Intern, Multimodal LLM Benchmarking

LLM & Agent Evaluation Intern · Full-time

United States $40 – $40/hr May 16, 2026

Job description

Centific is hiring a Research Intern focused on Multimodal LLM Benchmarking to contribute to advanced AI evaluation research involving multimodal foundation models.

This internship focuses on designing, executing, and analyzing benchmark systems for AI models operating across:

Text
Images
Audio
Video
Cross-modal retrieval systems

Contributors will work on cutting-edge multimodal evaluation problems involving benchmark design, scoring methodologies, dataset curation, and AI model analysis.

About Centific

Centific is an AI data and infrastructure company specializing in:

AI evaluation
Data curation
Fine-tuned LLMs
RAG pipelines
Enterprise AI deployment

The company works with enterprise clients and frontier AI organizations to support safe and scalable AI systems.

Centific’s ecosystem includes:

150+ PhDs and data scientists
4,000+ AI practitioners and engineers
1.8 million domain experts across 230+ markets

Responsibilities

Multimodal Benchmark Design & Development

Design benchmark suites for multimodal foundation models involving:
- Text-image tasks
- Text-audio tasks
- Text-video tasks
- Cross-modal retrieval systems
Define:
- Evaluation formats
- Annotation guidelines
- Scoring criteria
- Benchmark coverage dimensions

Benchmark Execution & Analysis

Run multimodal models against benchmark suites
Analyze:
- Performance trends
- Failure modes
- Evaluation outcomes
Produce actionable research summaries and recommendations

Metric & Scoring Research

Investigate automated evaluation approaches including:
- Model-as-judge systems
- Reference-free metrics
- Human alignment evaluation
Evaluate trade-offs involving:
- Reliability
- Validity
- Scalability
- Evaluation cost

Dataset Curation & QA

Support:
- Data collection
- Annotation workflows
- Dataset filtering
- Inter-rater reliability analysis

Literature Review & Methodology

Survey multimodal evaluation literature
Identify gaps in existing benchmark systems
Propose novel evaluation approaches grounded in current research

Documentation & Communication

Produce:
- Research write-ups
- Benchmark documentation
- Internal reports
- Presentation-ready summaries

for technical and non-technical audiences.

Focus Areas

Depending on project alignment, contributors may work on:

Vision-Language Evaluation

Image captioning
Visual question answering
Chart reasoning
Image-text alignment

Audio & Speech-Language Benchmarking

Spoken language understanding
Audio captioning
Speech-text evaluation

Video Understanding

Temporal reasoning
Video QA
Video-text retrieval

Cross-Modal Robustness

Distribution shift testing
Adversarial multimodal inputs
Robustness analysis

Automated Multimodal Scoring

Judge-model evaluation systems
Open-ended multimodal generation evaluation

Required Qualifications

Currently enrolled in:
- MS program
- PhD program

in:

Computer Science
Machine Learning
Statistics
AI
Linguistics
Related quantitative field
Experience with:
- Multimodal models
- NLP systems
- Vision-language systems
- Audio or video ML tasks
Exposure to:
- Benchmark design
- Model evaluation
- Experimental analysis
Strong Python skills
Experience with:
- PyTorch
- Hugging Face Transformers
Basic statistical analysis knowledge
Strong written and verbal communication skills

Preferred Qualifications

Experience with multimodal models such as:
- LLaVA
- GPT-4o
- Gemini
- Flamingo
Familiarity with benchmarks including:
- MMBench
- MMMU
- SeedBench
- VQAv2
- AudioCaps
- ActivityNet-QA
Experience with:
- Annotation tools
- Human evaluation workflows
- Model-as-judge systems
Research publications or open-source contributions in:
- Multimodal ML
- NLP
- AI evaluation

What You'll Gain

Mentorship from senior AI researchers and ML engineers
Ownership of publishable multimodal research projects
Exposure to enterprise AI workflows and applied research teams
Potential co-authorship opportunities
Flexible remote work arrangement
Competitive internship compensation

Apply now

You will be redirected to the company's website to complete your application.

Apply now

Please let Centific know you found this job on RemoWork. This helps us grow!

Job summary

Company: Centific
Location: United States
Work Type: Intern · Full-time
Posted on: May 16, 2026
Employment: Intern
Schedule: Full-time
Experience level: Entry , Junior
Language: English
Work authorization: Required
Authorized countries: United States
Visa sponsorship: No sponsorship
Category: AI & Data Training LLM & Agent Evaluation
Salary: $40 – $40/hr

Skills

Multimodal AI LLM Evaluation Benchmark Design Machine Learning Natural Language Processing Computer Vision PyTorch Hugging Face Transformers Data Annotation Statistical Analysis Research Methods Python Dataset Curation

Centific

More roles that match your skills.

Browse all jobs View company