About the Role
Build reproducible, standardized test environments using Docker images to accurately replicate known issues or generate expected outputs according to specified procedures.
What You'll Do
- Review and enhance the coverage and effectiveness of existing unit tests to assess the correctness and stability of target code.
- Validate the completeness and rationality of test sets, ensuring that workflows for tasks related to SWE-Bench and Terminal-Bench are precisely aligned.
- Write high-quality task documentation (task.yaml/README), emphasizing reproducibility and standardized processes.
Who We're Looking For
Required Skills:
- Proficient in Linux command line and Shell scripting, skilled with tools like grep, sed, awk, curl, jq.
- Expert in Python programming for writing task harnesses, test cases, and automation tools.
- Proficient with Docker, including writing Dockerfiles and building reproducible environments.
- Familiar with testing frameworks like pytest, capable of writing structured unit tests and using techniques for mocking data and controlling randomness.
- Familiar with Git/GitHub workflows, able to submit high-quality, reproducible pull requests.
Professional Background:
- Background in Computer Science, Software Engineering, Artificial Intelligence, or related fields; or experience in roles such as Software Development, Test Engineering, DevOps, or Data Engineering.
- Contributors to open-source projects (especially in automated testing, CI/CD, containerization) are preferred.
Bonus Points:
- Proficiency in high-performance languages like Go or Rust.
- Familiarity with other sandbox technologies like Docker Compose or Podman.
- Ability to design datasets/tasks that prevent "task cheating."
- Understanding of scientific benchmark design principles (fairness, repeatability, scalability).
- Experience with automated testing systems or CI/CD, and a cross-disciplinary perspective.
Compensation
USD $80 - 120/Day, dependent on actual skills and experience.