Job Description

About the Company

We are a seed-stage AI company building the industry standard for evaluating and benchmarking large language models on real enterprise tasks.

About the Role

As a Research Scientist, you will develop new benchmarks, methodologies, and evaluation pipelines that shape how cutting-edge models are assessed, compared, and deployed in production environments. Your work will directly influence model selection and safety decisions across foundation model labs, high-growth AI product companies, and Fortune-scale enterprises.

Responsibilities

Benchmarking & Model Analysis

Evaluate newly released models as they launch (e.g., Gemini, DeepSeek, etc.)
Run large-scale assessment workflows using internal evaluation infrastructure
Compare model performance across enterprise-grade task categories

Design New Benchmarks from Scratch

Identify high-value model application domains through research exploration
Construct datasets, including labeling strategy and workforce coordination
Write short “white-paper-style” summaries explaining benchmark purpose, method, and findings

Advance Automated Evaluation Methodologies

Improve systems for scoring generated text beyond standard metrics
Explore research in reference-free, rubric-based, and human-aligned evaluation
Develop new techniques for reliability, consistency, and repeatability

Cross-functional Collaboration

Work closely with engineering to scale evaluation infra
Partner with customers to refine evaluation relevance and task fit
Influence product direction through research insights

Qualifications

0–3 years post-grad experience (Master’s, PhD, or equivalent applied research)
Publications, preprints, or demos showing cutting-edge work
Experience with diffusion models, NLP, multimodal, or benchmarking work
Ability to operate independently with ambiguity
Clean, maintainable research codebases
Candidates from:
top engineering/research universities
applied AI startups
well-regarded research internships

Required Skills

Built or contributed to a benchmark or evaluation methodology
Experience in enterprise task model evaluation
Stanford / top lab adjacency (per their historical hiring success)

Preferred Skills

Wants to publish as primary motivation
Purely academic with slow timelines
Big-tech culture fit concerns (Meta / Google / Salesforce specifically noted—but case-by-case)

Pay range and compensation package

Base Salary: up to $250K depending on background
Equity: typically 0.3% – 0.5%, flexibility for exceptional candidates
Rapid equity refresh possible based on impact

Equal Opportunity Statement

Visa sponsorship available. Relocation support. Health & dental coverage. Lunch + dinner provided, snacks & coffee. Unlimited PTO. Weekly happy hours with community guests. Team events (bowling, hiking, rock climbing, etc.). Swag program (hats, etc.).

Work Environment & Culture

In-person, San Francisco HQ (required). Core hours: 9–5, some teammates extend voluntarily. Most team members work 1 weekend day per week (flexible). High-ownership, low-ego, collaborative. Live demos Mondays, team lunch Thursdays, community Fridays. Early-stage pace, applied focus—not academic publishing.

Tech Environment

(while research-focused, exposure beneficial) Backend: Python / Django. Frontend: React + TypeScript. Infra: AWS. Evaluation frameworks + internal tooling.

Why This Role Is Unique

The company already collaborates with foundation model labs, high-growth AI vertical product companies, and Fortune 500 enterprises (not publicly facing). ChatGPT Vals AI $5M seed raised, runway of 2+ years at current burn. Only one research scientist is being hired—true founding impact. Opportunity to define industry standards for model trust, reliability, and certification. Positioned to become the rating agency for generative AI.

Job Tags

Internship, Visa sponsorship, Relocation package, Flexible hours, Weekend work, 1 day per week,

Similar Jobs

PrismHR

Heavy Equipment Operator Job at PrismHR

...Build the Foundation: Heavy Equipment Operator (Utilities) Location: Texarkana, TX Employment Type: Full-Time Are you a skilled operator who takes pride in precision? We are looking for a Heavy Equipment Operator specialized in utilities to join our team....

Clayco

MEP Project Manager Job at Clayco

...institutional, and residential related building projects. The Role We Want You For As an MEP Project Manager, you will be responsible for overseeing the mechanical... ..., maintaining high standards of quality, contract compliance, change management, and scheduling....

Laguna Source

Senior R&D Manager, Protein Beverage Commercialization Job at Laguna Source

This is an exciting entrepreneurial leadership opportunity within an ultra-innovative category-defining protein beverage, functional foods and nutritional products company with new research labs and expansive growth plans. You will be charged with overseeing Product Development...

Altea Healthcare

Remote On Call Certified Nurse Practitioner (Night Shift) Job at Altea Healthcare

...Certified Nurse Practitioner (NP) Post-Acute Care Location: Must reside in Central Michigan Compensation: $80,000 - $110,000... ...What Youll Do: As a Certified NP , you will be taking calls remotely for national level for CCM visits from 7 pm-7am rotating with...

Physician Affiliate Group of New York, P.C. (PAGNY)

Ultrasound Technologist Job at Physician Affiliate Group of New York, P.C. (PAGNY)

Physician Affiliate Group of New York (PAGNY) has an opportunity for an Ultrasound Technologist at NYC Health + Hospitals/Harlem. Harlem Hospital is a designated Level 1 Trauma Center and is recognized for providing centers of excellence in the areas of Bariatric ...

Research Scientist Job at kadence, Santa Rosa, CA

eXJ0UFhpZXp0SHYrZVRTZ09ObUZTS0czSlE9PQ==