Published using Google Docs
Report abuseLearn more
The Snake Eyes Project Tasking Handbook
Updated automatically every 5 minutes
The Snake Eyes Project Tasking Handbook
Prompts and RLHF
Daily Webinars/Office Hours: 8a PT/11a ET AND 11a PT/2p ET
Link Posted Daily in Outlier Community Threads
⚠️Knowledge Cutoff Date: Jan 31, 2025
Update Log
As projects move along, we will add updates here. Make sure you keep coming back to the instructions,
which is the ground truth for tasking well for this project.
Date Announcement/Updates Notes
Mar 31, 2025 Introduced “Part 0: Plan for Your
Conversation” section to emphasize having
user intent in mind when initiating a
conversation
Mar 31, 2025 Additional instructions on taking a “real user
perspective” to plan for multi-turn
conversation and create effective prompts
Mar 31, 2025 Guidance on regenerating responses if
neither one is justifiably better than the other
Mar 28, 2025 Model Cutoff Date is now JANUARY 31,
2025
Project Overview
The purpose of this project is to help a customer hone its model to take on increasingly challenging
everyday prompts. Tasks are a mix of single- and multi-turn, including two main components: creating
prompts and rating model responses. AI models will learn from your ratings. 🤖
Turn 1:
Write an initial prompt in your assigned category and topic, verifying that you can envision
a realistic user, goal, and ways to progress the conversation for this prompt
Rate both model responses using the RLHF dimensions
Select the best response out of the two, or regenerate the responses if neither one is truly
better than the other
Turn 2 through Final Turn:
Create a follow-up prompt that continues the conversation naturally
You are encouraged to use diverse prompt categories across turns
Rate both model responses for that turn
Select the best response out of the two, or regenerate the responses if neither one is truly
better than the other
Special Notes:
Each complete exchange (your prompt + model responses) counts as one "turn"
Tasks have specific conversation lengths (between 3-15 turns). Complete each task using exactly
the number of turns specified, staying on the same topic throughout the entire conversation.
Keep the conversation flowing naturally on the same topic, as if talking to a real assistant
Complete all rating dimensions for both model responses in each turn
🎯 Please read this thoroughly before beginning tasking and/or reviewing on this project.
🧐Part 0: Plan for your Conversation
💡 Key Mindset Shift: You're starting a multi-turn conversation with real user needs in mind, not
just asking a single question or trying to trip up the model.
Who is your user and why are they using the model?
A rough mental outline of who could be entering this prompt, and why, allows us to sense check that our
first prompt is realistic, set up natural constraints that could be helpful to this user, and chart
a meaningful progression from one turn to the next.
If your first prompt is truly realistic and can sustain a conversation, you should be able to answer all of
the following questions:
1. What kind of person might use this prompt?
2. Why would this person use this prompt?
3. How might this person continue the conversation?
Quick Self-Assessment
As a basic guideline, if you cannot envision answers to any of these questions for your first prompt, it is
not realistic or sustainable enough for this project. Rethink your initial prompt before proceeding.
✍️ Part 1: Write a Great Prompt
✍️ Task Workflow
Step 1: Acknowledge the Suggested Topic and Prompt Category (1st turn)
Step 2: Write an initial Prompt with your “Who” and “Why” in mind
⚠️ For the first turn, You MUST Adhere to assigned category for the first turn.
For the subsequent turns, you are strongly encouraged to choose different categories for follow-
up prompts you create in order to assure your prompts are diverse.
Prompt Category
For examples of good prompts for each category, look at this table in the Appendix!
Note: Be careful that your Brainstorming and Chitchat prompts are not actually Open QA! As a
rule, if the prompt asks for structured and non-creative advice on a decision (or possibilities for
that decision), it is Open QA.
For Classification prompts, it is crucial that you include all classification groups in your prompt
and define them if they do not have universally-accepted definitions.
Suggested Topic
It’s critical that we deliver diverse data to the customer. You can occasionally skip or ignore the topic if
it’s something you’re not an expert in; consistent use of the same topic in all of your attempts will be
flagged as your quality is reviewed.
Key Do & Don’ts for Great Prompts
The best prompts read like authentic human communication - straightforward, purposeful, and focused on
the task at hand.
📎 How to Use Reference Texts
Reference texts are required for extraction, rewriting and summarization tasks - to attach a reference
text to your prompt, click on the “+” button (shown below) and input the text into the New Reference
Text box. Also include a corresponding link in Reference URL if the reference text was accessed via an
online source, and click Add Reference Text to save.
Note: Reference texts do not need to be included in the prompt for well-known texts or famous speeches
(e.g. a presidential address).
It is also okay if you attach a reference text more recent than Jan 2025 as long as the model is not
expected to independently recall information past its knowledge cutoff date.
Common Errors to avoid when writing prompts
1) Do NOT ask ❌ Bad examples:
unrealistic,
contrived questions Extract the number of words starting with "t" and add the number of
that a typical words starting with "Q", then divide by 5
person would List every single important person in history with the first name
never ask. “Matthew”
What real user would ever be interested in these questions? What real life tasks
or learning opportunities could these possibly contribute to?
2) Do NOT add The constraints we use to add complexity and direction to prompts should not be
unnecessary or random or designed specifically to trigger model failure - instead, they
unhelpful should align with our image of the prompt’s user and their goal.
constraints! Suppose we are writing a prompt from the perspective of an American college
student, who wants to find books on the Vietnam War from a mix of
perspectives. Let’s look at two different ways to use constraints:
❌ What are some good books on the Vietnam War? They should all include
some reference to an American military operation name, and should be
organized in reverse alphabetical order of title.
The constraints (include references to an American military operation name and
reverse alphabetical order) serve no purpose other than to add complexity to
the prompt and do not serve to help our user learn more about the Vietnam
War.
✅ What are some good books for a mix of perspectives on the Vietnam War.
I’m looking for one from a Vietnamese writer from the northern side, one from
the southern, and an American. I’d prefer only ones with an English translation.
The constraints (one book from a writer from each side of the conflict, English
translations only) all align with our user and their goal. The diversity of
authors serves to further knowledge on the conflict through multilateral
perspectives, while the English translation component makes sense for an
American student.
3) Do not ❌ Bad example: Tell me who the first president is, and then tell me their last
stack name, then give me four more examples of leaders with the same name, but
questions – different birth years
Focus Even if these questions are related, no one would ever overload a single prompt
prompts on in this fashion!
one
question/ask.
4) Do not ❌ Bad examples:
ask simple
“trivia” What year was “The Godfather” released?
questions Who was the CEO of Apple when the iPhone was invented?
Your prompt should not be resolvable through a couple words or a simple
Internet search - remember, the model should have to reason.
Prompt Quality Checklist
✅Realistic: A genuine question someone would ask an AI assistant
✅Strategic: open paths for follow-up and further exploration
✅Challenging: help the model reason instead of directly retrieving facts
✅Natural: Avoids artificial constraints that feel contrived
✅ Any constraints make sense for the prompt’s target user and goal
✅Follow the assigned prompt category & suggested topic in the first turn & diversify in the
following turns
🔢Part 2: RLHF Grading
✍️ Task Workflow
Step 1: Rate Model Responses on 6 dimensions
Step 2: Select the Better Response
Six Dimensions to Evaluate the 2 Model Responses
You will be asked to compare 2 model responses side by side. First, you are going to grade that response
in the following dimensions:
Issue Definition
Harmfulness Evaluates whether the response contains any harmful, offensive, or inappropriate
content that could negatively affect the user.
Instruction Evaluates how well the response addresses all explicit and reasonably implied
Following elements of the prompt. It assesses the model's ability to fulfill both direct
requests and constraints intended to guide those requests.
Truthfulness Checks whether all claims are accurate and supported by reputable sources, such
as trusted news outlets or scientific publications.
Writing Style and Assesses the clarity, organization, and readability of the response. The tone
Tone should be natural and conversational, encouraging next steps, without being
preachy or overly formal.
Content Evaluates whether the response provides enough relevant information to fully
Completeness address the prompt, without omitting key details or important content
Content Measures whether the response is concise, containing only necessary content.
Conciseness & Each sentence should add value, with additional suggestions or conversational
Relevance elements being relevant and non-repetitive.
Each dimension is evaluated from 1-3:
Score Severity
Major issues → 1 The model fails to meet the dimension or has significant flaws.
Minor Issues → 2 The model partially meets the dimension with some small gaps or
imperfections.
No issues → 3 The model fully meets the dimension.
You will be asked to provide justification for issues in Instruction Following, Harmfulness and
Truthfulness:
After evaluating the preferred response using the RLHF dimensions, you’ll provide an Overall Score to
the preferred Model Response from 1-5.
OVERALL Severity
EVAL
EXCELLENT → 5 Response doesn’t have ANY flaw and cannot be meaningfully improved. There
are NO major or minor issues in any dimensions of the rubric. In other words,
the response addresses the main user intent and instructions exceptionally well,
in a way that is extremely clear, fluent, natural in its use of language and
organization, and does not have any repetitive or unnecessary information.
GOOD → 4 The response is good overall, with NO major issues and just a few minor issues.
Response successfully fulfills the user’s intent.
ADEQUATE → 3 Response addresses the main user intent and instructions with NO major issues,
but has several minor issues (e.g. includes unnecessary details, misses certain
elements in following the instructions, etc).
BAD → 2 Minor room for improvement: The response has a major issue (whether in one of
the above dimensions, or along some other dimension you observed) and/or does
not really satisfy the user’s intent, with the exception of avoiding safety issues.
POOR → 1 Response has multiple major issues and is really unhelpful and frustrating
Select Your Preferred Response
After evaluating each dimension, select your overall preference between the two model responses using
the 6-point scale shown below.
⚠️Important:
You must choose one response over the other - no ties permitted
Neither response needs a failure in order for this to be a good turn, but one response does need to
be justifiably better!
Your overall preference must be consistent with your dimensional ratings
Example of inconsistency: Rating all dimensions favorably toward Response 1 but giving
an overall preference that favors Response 2
Rating inconsistencies will reduce the quality of your evaluation and the model's learning.
Justify Your Preference
After ranking, write down specific reasons why you preferred one response over the other.
⚠️ If you are not confident that either response is is even slightly better, or you feel like you are
“forcing” this justification, you should REGENERATE the responses until you can justify that
a response is better.
Examples of “forced” justifications:
Both responses present the same information without issues, but one is marginally more
concise than the other
Both responses have a major factual error, and you are “splitting hairs” trying to explain why
one is more impactful than the other
Now, Take More Turns!
✍️ Task Workflow (for each turn after the first)
Step 1: Create a follow-up prompt
Step 2: Rate both responses on 6 dimensions
Step 3: Select the Better Response
You will have a min 3 turns and max 15 turns. # of turns would be specified for each task.
Write Follow Up Prompts
Once you have confirmed there are no catastrophic errors, continue the chat by writing follow up
prompts. Keep conversation flow natural on the same topic, as if talking to a real assistant.
How to Create High-quality Follow-up Prompts
DOs ❌DON’Ts
🏖️Keep It Natural and On Point 🙅Avoid Derail, Detour or Duplicate
Dig deeper into interesting points Change topics completely
Ask about real-world applications Reply with minimal responses that don't
Respectfully challenge ideas and advance the conversation (e.g., "Thanks!" or
assumptions when relevant "Sounds good")
Connect related topics within the same
domain Repeatedly ask the model to rewrite or
rephrase both previous responses since the
model only sees the winning response.
Limit asking for rewrites to 1 time in
shorter conversations (<7 turns) and 2
times in longer conversations
Send identical or very similar prompts over
and over again
GOOD EXAMPLES BAD EXAMPLES
"You mentioned meditation helps with "Thanks for explaining gardening. Can you
stress - what specific techniques work tell me about black holes?"
best for beginners?" "Cool, thanks!"
"That's interesting about renewable "I don't like how you explained that recipe.
energy - how would it work for someone Do it again."
living in an apartment?"
"I see your point about team leadership,
but what happens when team members
strongly disagree?"
Good Examples: A Natural Progression of Prompts
In the following multi-turn prompt progressions, a hypothetical user goal is pursued through a series of
questions that seek follow-up details, clarify responses to previous prompts, explore new facets of the
main topics, and utilize diverse prompt categories.
Example 1: the user’s goal is specific and targeted, with every prompt building on previous information
to help inform the destination and build an itinerary and budget for the weekend.
Example 2: Here, the user begins with a more general-interest goal of understanding baseball history
and expands to questions specifically about David Ortiz based on previous prompts’ answers - though
more flexible in terms of topics, this is still a natural progression!
Initial Prompt Category: Rewriting
User Perspective (Knowledge Seeking): A user who is not too knowledgeable about sports
wants to learn more about the factors underlying the 2004 Red Sox comeback, beginning with
getting a better understanding of an article on the topic.
Prompt 1: Rewrite the section on the key players with notes explaining any baseball jargon (e.g.
RBI, walk-offs).
Prompt 2: Give me a summary, using that same article, of how the Red Sox played differently in
2004 from previous seasons.
Prompt 3: Across all the listed metrics, which Red Sox player was most impactful in terms of
scoring runs in the 2004 season? I just want one player, but break down into runs batted in versus
home runs.
Prompt 4: I remember hearing about David Ortiz a lot as a kid… what was special about his
playing style?
Prompt 5: What were some of the other factors that made David Ortiz so culturally significant?
Prompt 6: What were the key points David Ortiz made in his famous 2013 speech after the
Boston Marathon?
Catastrophic Errors
Check
✅ Before you proceed to next turn, you need to confirm there’s no catastrophic errors.
🛑 You may terminate the task if you spotted any catastrophic errors.
⚠️ Make sure that you complete all rating dimensions for both model responses & select the
preferred response in each turn.
Appendix: RLHF Grading Rubric
See more details of each dimension. You will also see the rubrics again in the Task Instructions!
(1) Instruction Following
No Issues
Response completely follows all instructions from the prompt, fully respecting the user’s
prompt.
Minor Issue(s)
Response follows most of the instructions from the prompt, satisfying the user’s primary intent,
but misses certain elements.
Major Issue(s)
The response ignores, circumvents, or violates key components of the prompt, rendering the
response unhelpful to the user. The response punts unnecessarily on a non-safety related issue.
(2) Truthfulness
TAKE THE TIME YOU NEED to fact-check responses that contain factual claims. Spending 10ish minutes
across responses is not unreasonable.
No issues
All claims are accurate based on reputable web evidence.
Minor issue(s)
Primary claims (necessary for addressing the prompt) are accurate, but at least one secondary
claim is inaccurate, unsupported, or disputed according to reputable web evidence.
Major issue(s)
At least one primary claim is inaccurate, unsupported, or disputed according to reputable web
evidence.
(3) Harmlessness
No Issues
Response has no unsafe or toxic language.
Minor Issue(s)
Response contains minor/questionable aspects related to unsafe or toxic language, but they are
not highly concerning.
Major Issue(s)
Response contains significant safety or toxic language issue(s).
(4) Content Conciseness & Relevance
No Issues
Response contains only necessary content. Each sentence is relevant to the prompt and rich in
value. Any additional summaries, suggestions, considerations, and conversational questions are
clearly helpful and relevant and not repetitive.
Minor Issue(s)
Response is generally relevant to the prompt but contains a small portion of unnecessary
content that is repetitive, unhelpful, or irrelevant.
Major Issue(s)
Response contains a significant amount of unnecessary content that is repetitive, unhelpful, or
irrelevant.
(5) Content Completeness
No Issues
The response gives enough information and sufficient detail to helpfully fulfill the prompt;
there is no important and relevant content missing.
Minor Issue(s)
There is some relevant information that is missing the response, reducing its helpfulness. For
example, the response might be technically correct but far too terse, leaving the user
dissatisfied.
Major Issue(s)
Relevant content is missing to such an extent that the response does not at all fulfill the user’s
intent.
(6) Writing Style & Tone
No Issues
Response is written and organized such that it’s easy to understand and take next steps.
Response is communicated in a natural-sounding, conversational tone that makes it engaging.
Response does not preach at or lecture the user.
Minor Issue(s)
Response has minor issues of writing quality, such as being stilted or unnatural. Phrasing could
be more concise or appropriate for the conversational context. Response may contain some
stylistic issues that reduce how engaging it is, or be overly formatted in a distracting way (e.g.
unnecessarily nested bullet points or over bolding).
Major Issue(s)
Response is stylistically unnatural, unengaging, or formatted poorly enough that it is difficult to
read and understand. Or, the response preaches to or lectures the user.
Overall Quality
Cannot be improved
Response doesn’t have ANY flaw and cannot be meaningfully improved. There are NO major
or minor issues in any dimensions of the rubric. In other words, the response addresses the
main user intent and instructions exceptionally well, in a way that is extremely clear, fluent,
natural in its use of language and organization, and does not have any repetitive or unnecessary
information.
Minor room for improvement
The response is good overall, with NO major issues and just a few minor issues. Response
successfully fulfills the user’s intent.
Okay
Response addresses the main user intent and instructions with NO major issues, but has several
minor issues (e.g. includes unnecessary details, misses certain elements in following the
instructions, etc).
Pretty bad
The response has a major issue (whether in one of the above dimensions, or along some other
dimension you observed) and/or does not really satisfy the user’s intent, with the exception of
avoiding safety issues.
Horrible
Response has multiple major issues and is really unhelpful and frustrating.
Appendix: Good Prompt Examples
Category Good Example Need Reference Text
(MAX 2000 Words)
Open QA What are some good alliterative titles for this NOT ALLOWED
Advice and Guidance: model is asked to film essay I wrote? I’d like it to include a
reason through a problem/a good response caption.
will provide a framework or set of
considerations that could solve the problem. What's a good creatine brand, and how should
I take it to optimize strength while avoiding
any risks? (budget is not an issue)
Closed QA Based on this article, how can I get access to REQUIRED
Model is asked a question that can be only weight-loss medications like Ozempic? live in
answered by THINKING THROUGH California if that matters.
information contained entirely within a [Text from Website that includes information
reference text. about lots of weightloss options, including
⚠️ALL information necessary to respond Ozempic]
to the prompt must be contained within
the pasted in reference text
Extraction According to this document, which women REQUIRED
Model is asked to retrieve information from were involved in the events of the revolution?
within a reference text. I'd like to know a bit about their involvement
⚠️ALL information to be extracted must and, if applicable, their legacy
be contained within the pasted in [Reference text about french wars, has 4
reference text. women mentioned related to the Revolution,
1 of whom had a long lasting legacy]
🚫 Prompts that simply require the model
to find a list of names/events/dates are not
allowed.
Category Good Example Need Reference Text
(MAX 2000 Words)
Rewriting Modify the product spec below into a sales REQUIRED
Model is asked to re-write, adjust, annotate, pitch, targeting a 20-30 year old, male,
summarize, stylize, re-organize, or otherwise American demographic.
modify an existing reference text. ref text: [Product Description]
⚠️CORE information necessary for the re-
write must be contained within the pasted
in reference text.
Classification Of countries that participated in World OPTIONAL
Categorizes data or content into defined War II, which aligned with the Allied
groups or labels. You MUST provide Powers, Axis Powers, or remained
definitions for categories included in the Neutral? If any switched sides, include
prompt. them in the group they aligned with the
longest
Chitchat I've been thinking about getting a pet but I OPTIONAL
Casual, open-ended conversations on never know if I’m actually ready for
general topics, often light and informal. one. I’m in New York.
Brainstorming I need to create a unique team-building OPTIONAL
Offers creative ideas or solutions for a given activity for our remote work group of 15
problem or topic. people.
Roleplay You're a historical tour guide at the OPTIONAL
Simulates interactions with the model Roman Colosseum. I'm a tourist who
adopting a specific persona or expertise knows very little about ancient Rome.
area. Give me a 5-minute introduction to what
I'm seeing.
Summarization I need a concise summary of this article REQUIRED
Extracts and presents main points and on climate change adaptation strategies
essential information from longer texts. in coastal cities. Focus on the key
⚠️ALL information necessary for the findings and recommendations for urban
summary must be contained within the planners.
pasted in reference text.
[Add article text]
Writing Write a short story about someone who OPTIONAL
Produces creative content like text, stories, discovers they can communicate with
or dialogue from a prompt. plants.
Other OPTIONAL
Anything you want.