AI evaluation2024 - PresentHandshake AI Fellowship

AI Model Training

Frontier model evaluation work through prompt review, output ranking, hallucination checks, and structured feedback loops.

View case studies Back to projects

Frontier

AI labs

Beet 2.0

project track

RLHF

training signal

2024+

active work

Context

What it is.

The Handshake AI Fellowship connects domain experts with frontier AI labs to generate high-quality training data for large language models. Project Beet 2.0 is a specialized engagement in that program, requiring careful evaluation and specific expertise.

The work falls into a category broadly called RLHF: reinforcement learning from human feedback. Human evaluators assess model responses, rank alternatives, label issues, and write examples that shape future behavior.

Working on the training side of AI systems gives me a practical view of what models get right, where they fail, and why evaluation quality matters as much as prompt quality in production workflows.

Work involved

Evaluation, ranking, and structured feedback.

Response quality evaluation

Assess AI-generated responses across accuracy, completeness, tone, formatting, and instruction-following. The work requires careful reading and domain judgment, not just preference selection.

Comparative ranking

Compare multiple model responses to the same prompt and identify which answer is more helpful, accurate, and appropriate in context.

Data annotation and labeling

Label text, structured outputs, and multi-turn conversations with intent classifications, safety flags, hallucination markers, and response categories.

Prompt and response writing

Write prompts and ideal responses that model the desired behavior, including edge cases, caveats, and domain-specific output structure.

Relevance

Why it matters for automation work.

Having worked on both sides of AI, deploying Claude in production systems and evaluating models during training, gives me a more complete view of how to design AI-assisted workflows responsibly.

It informs better prompt engineering, clearer fallback paths, stronger output validation, and more realistic expectations for where human review still belongs.

Skills

Tools and concepts demonstrated.

Prompt evaluationOutput rankingData annotationLLM quality assessmentRLHFInstruction-following analysisHallucination detection