We are looking for exceptional researchers and research engineers to design and build the next generation of AI benchmarks. You will create high-impact, challenging evaluations that push the boundaries of what we can measure in foundation models. This role is perfect for someone with deep research expertise who wants to see their work directly influence how the world evaluates AI systems.

You will lead the design and development of novel benchmarks that assess real-world capabilities of LLMs. Our benchmark shape how foundation models are developed and generative AI applications are built. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg.

We are building the standard for evaluating the ability of LLMs to perform real-world tasks. You will be at the forefront of defining what that standard looks like.

What You'll Do

Design and develop novel, high-impact benchmarks that assess challenging real-world capabilities
Conduct research to ensure our benchmarks are valid, reliable, and meaningful
Collaborate with foundation model labs and enterprises to understand evaluation needs
Analyze model performance across benchmarks and communicate findings
Publish research findings and contribute to the broader evaluation research community
Work closely with the infrastructure team to implement your benchmark designs at scale
Stay current with the latest developments in LLM capabilities and evaluation methodologies

Requirements

Advanced research experience: Master's degree or PhD in Computer Science, NLP, Machine Learning, or related field. Undergrads with very strong research backgrounds may also be considered.
Publication track record: Published papers in reputable venues (NeurIPS, ICML, ACL, EMNLP, etc.) with focus on NLP, ML evaluation, or benchmarking
Research methodology: Strong understanding of experimental design, statistical analysis, and evaluation frameworks
Technical skills: Proficiency in Python for research and experimentation
Communication: Ability to clearly communicate complex research ideas to both technical and non-technical audiences
Collaboration: Experience working in research teams and integrating feedback
Portfolio: Demonstrated track record of impactful research work
Location: We are an in-person team based in San Francisco. We will support your relocation or transportation as needed.

Nice to Haves

Experience specifically in LLM evaluation or benchmarking research
Familiarity with foundation model architectures and capabilities
Experience working with industry partners or in applied research settings
Background in areas like human-computer interaction, psychology, or domain-specific evaluation
Experience at early-stage startups or research labs
Contributions to open-source evaluation tools or datasets

What We Offer

Highly competitive salary and meaningful ownership. Excellence is well rewarded.
Relocation and transportation support
Health/dental insurance coverage
Lunch and dinner provided, free snacks/coffee/drinks
Unlimited PTO
Opportunity to publish and present your work

About Us

Founding team: The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work.

Tech stack: Our frontend is built in React with TSX. We use Django as our back-end framework. All of the infra is on AWS.

What We're Looking For

Intelligence over credentials: Raw talent and research ability are more important than where you got your degree. Academic pedigree is valuable only insofar as it is a proxy for research excellence.
Ownership: We don't have the scale or time to actively "manage" every project. Working in a small, talent-dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus.
Intensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier, creating new capabilities that demand new evaluation approaches. The companies that will emerge as leaders are being built now. Those that win will have an incredibly high speed of execution.
Solution-oriented mindset: We're looking for researchers who see each evaluation challenge as an opportunity to design innovative solutions, not insurmountable problems.

Referral Bonus

Know someone who would be a good fit? Connect them with rayan@vals.ai. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch!

Apply now

See more open positions at Vals AI

Privacy policy Cookie policy

Home

Resources

Portfolio

Fellowship

About

Build

Our Thesis

Jobs

Team

Contact

OpenGov: A Changing Guard; A Continuing Mission

Nik Spirin (NVIDIA) Fireside Chat

Get introduced to the
most exciting companies in
our portfolio.

Member of Technical Staff - Research

About the Role

What You'll Do

Requirements

Nice to Haves

What We Offer

About Us

What We're Looking For

Further Reading:

Referral Bonus

Links

Company

Programs

Contact

OpenGov: A Changing Guard; A Continuing Mission

Nik Spirin (NVIDIA) Fireside Chat

Get introduced to the most exciting companies in our portfolio.

Member of Technical Staff - Research

About the Role

What You'll Do

Requirements

Nice to Haves

What We Offer

About Us

What We're Looking For

Further Reading:

Referral Bonus

Links

Company

Programs

Get introduced to the
most exciting companies in
our portfolio.