KEMBAR78
Valkyrie - Project Specifications - V2 | PDF | Artificial Intelligence | Intelligence (AI) & Semantics
0% found this document useful (0 votes)
3K views42 pages

Valkyrie - Project Specifications - V2

The Valkyrie project focuses on training AI models to address real-world problems through the creation of complex prompts and detailed rubrics for evaluation. Participants are required to craft professional prompts that challenge AI systems, ensuring they reflect realistic scenarios and require deep analysis. The project emphasizes collaboration, adherence to guidelines, and the importance of first principles thinking in both prompt creation and rubric development.

Uploaded by

joycity7310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views42 pages

Valkyrie - Project Specifications - V2

The Valkyrie project focuses on training AI models to address real-world problems through the creation of complex prompts and detailed rubrics for evaluation. Participants are required to craft professional prompts that challenge AI systems, ensuring they reflect realistic scenarios and require deep analysis. The project emphasizes collaboration, adherence to guidelines, and the importance of first principles thinking in both prompt creation and rubric development.

Uploaded by

joycity7310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Published using Google Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

Project Specifications

Welcome to Valkyrie! This project is dedicated to training AI


models to solve real-world problems across expert fields,
disciplines, and domains. At the heart of this effort are two pillars:
prompts and rubrics. You’ll design prompts that mirror the depth
and nuance of human problems in a professional setting. You’ll
build rubrics that define what excellence looks like and your
ratings will guide AI systems to learn, adapt, and improve. You’ll
assess each advanced model against your rubrics to identify the
most perfect response. Every prompt you craft and every
judgment you make brings us closer to building AI that can more
deeply understand the world and support us in our everyday work.

Important Information
Project Update Note - 5/13

Collaborative Workspace: Valkyrie War Room (24h / 365d)


🚨Daily Stand Up: Webinar link in Outlier
⚠️ Model Knowledge Cutoff Date: January 31, 2023

Note that unauthorized use of AI tools to


complete tasks will result in a
permanent ban from Outlier
This is an English Language Project - Tasks
created in other languages will be
rejected
Change Log + Updates
Published using Google Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes
Date Information Notes

May 21, REMINDER: Models


2025 are not capable of
web search, please
do not require the
model to Retrieve
specific data.

Example: “Provide 3
quotes from War &
Peace related to life
in Russia during
Napoleon’s invasion”

May 21, NEW TOOL: Valkyrie


2025 Valhalla Rubrics This tool will help
Generation you out with your
rubric creation.
While it does not
replace your own
process of doing
that, it will help you
to improve it.

May 19, REMINDER: Rubric


2025 Criteria must be
complete sentences!

Please add examples


in your rubric criteria

Index
Tasking Overview
Craft a Prompt
🔧 Guidelines for Creating High-Complexity Prompts
🎯 Objective
🪜 How to Build One Step-by-Step
Definition of a professional prompt:
Prompt Domains
First Principles Thinking
Image Specifications
How to add an image
How to upload your image
Image Guidelines
Image Tips
Define + Weight Rubric Criteria
Weight
Mental Model:
Requirements
1. Outline Core Criteria
Example Prompt:
2. Add Additional Criteria
TOOL: Valkyrie Valhalla Rubrics Generation
Evaluate each Response
⚠️ CONFIRM YOUR EVALUATION ⚠️
Finalize Ranking
Published using GoogleOverview
Docs Report abuse Learn more
Writing Good Rubrics
General Principle
Atomic Updated automatically every 5
Valkyrie: Project Specifications - V2/ Non-stacked minutes
Specific
Self-contained
Classifying Rubrics
Objective vs Subjective
Explicit vs Implicit
Rubric Dimensional Ratings
Common Pitfalls and How to Avoid Them
🚫 Pitfall: Overly Complex or Ambiguous Criteria
🚫 Pitfall: Subjectivity Confusion
🚫 Pitfall: Too few or too many Criteria
🚫 Pitfall: Double Counting Criteria
✅ Good Examples
Prompt: Scientific Research
Rubric
Prompt: Medicine
Rubric
Prompt: Administration & Logistics
Rubric
🚫 Bad Examples
Linters
What is a Linter?
Example linters:
How to dismiss a linter
Weighting Criteria.
Scoring

Task Workflow

Tasking Overview
1. Read the tasking instructions carefully.
2. Create a Prompt that fits your Domain of expertise; a task
that you or a colleague may come across in your day to day
work, but might require research and would take quite
some time to complete.
3. Establish the Core Criteria needed to respond to your
Prompt. These criteria should represent a bare minimum,
but not necessarily stellar, Response.

4. Create a Rubric that incorporates your Core


Requirements with any Additional Criteria that take a
Response from minimally acceptable to as close to perfect
as possible.

5. Before you rate each criterion, Rank each Response based


on an overall sense of how it handled your Criteria.

a. If any model is able to produce a minimum


acceptable Response, it’s likely your task is too
simple. Consider challenging the models further or
expecting more out of them.
6. Evaluate each Response against the Rubric Criteria
Published using Google Docs Report abuse Learn more
(Present or Not Present). If you find yourself wanting to
give partial credit, consider refactoring or splitting up your
criteria.
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
a. You must not submit any task in which ANY model
scores equal to or greater than 70% on your
evaluation.

7. Check if the final ranking produced by your Evaluation


matches up with your initial pass. If they are different, be
sure to double-check your criteria and make sure
everything is evaluated correctly.
8. Submit your work!

In this section, you'll learn how to create high-quality tasks and


rubrics from scratch. Each task consists of two essential
components: a prompt and a rubric used to evaluate responses.

Craft a Prompt

🔧 Guidelines for Creating High-


Complexity Prompts

🎯 Objective

Create prompts that:

● Require multi-step thinking

● Involve nuanced explanation, evaluation, or synthesis

● Would take a skilled human practitioner (in the relevant


field) 1–3 hours to complete

● Push the boundaries of basic Q&A by demanding creative,


technical, or strategic depth

● Incorporate first principles thinking (at least 1 out of every 5


tasks you create should incorporate this principle)

🪜 How to Build One Step-by-Step


1. Start with a Real-World Task
Published using Google Docs ○ Think of a task a professional might have to do atReport abuse Learn more
work that takes time and thinking.

Updated automatically every 5


Valkyrie: Project Specifications - V2○ Example: Auditing a company's hiring process, minutes
drafting a policy proposal, analyzing a legal case,
etc.

2. Add Constraints or Context

○ Include background: data, goals, client requests,


resource limits.

○ Avoid being too general or vague—force the models


to work under realistic conditions.

3. Include an Action Verb

○ Use verbs like analyze, evaluate, develop, design,


recommend, interpret.

4. Incorporate Multiple Dimensions

○ Include factors the responder needs to balance.

○ For example, ask for a solution that is effective


and cost-efficient and scalable.

5. Set the Deliverable

○ Be clear on what kind of output you expect: an


executive summary, a report, a plan, a decision-
making framework, a policy draft, etc.

All prompts must be fully professional


work domains and not general
questions

Definition of a professional
prompt:

● Is it something that a person without your expertise


would ask? If the answer is yes, this isn’t good.

● Is it something that would take you or a small team


a while to think through and do. Would it be difficult
or time consuming for you or a colleague?
● Is it something that would require pre-research to
Published using Google Docs Report abuse Learn more
understand before executing?

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

Your prompt should reflect realistic and challenging requests you


might make to a generative AI or a skilled human assistant.
Imagine you're assigning these tasks to a highly capable new
team member or personal assistant who has limited background
knowledge.

🧠 Knowledge Cutoff Date: The model has no knowledge of the


world after January 31, 2023. Requesting information after this
date is not allowed.

📚 Reference Texts: Any documents you'd like the model to use


must be copied and pasted into the Attachment tool below the
prompt box. 2000 word maximum.

🚫 No General Internet Search: The model has knowledge of the


world, but does not have internet access. If you’d like to reference
specific information, please provide that information as a
reference text.

✅ Your Goal: Craft a prompt that is difficult enough that ALL


MODELS score less than 60% on the final Evaluation based on the
Rubric criteria you will create later.

Each prompt must be:

Realistic
1. Reflects genuine scenarios
professionals might
encounter in their field.

2. Focuses on practical and


realistic situations rather
than hypothetical or overly
academic questions.

Challenging
1. Requires deep analysis,
reasoning, and/or data
synthesis - taking a few
hours for a knowledgeable
human to answer
thoroughly

2. State of the art models


should score <60% on
rubric items both weighted
and unweighted
Feasible
Published using Google Docs 1. Can realistically be Report abuse Learn more
completed by a
knowledgeable
professional with the given
Updated automatically every 5
Valkyrie: Project Specifications - V2 data + publicly available minutes
non-LLM resources.

2. Generally concise (under


~4K words)

3. Avoids analysis of
excessively large amounts
of information such as
entire books or data sets.
Subsets of large data and
chapters of books are
acceptable.

Clearly Defined
1. Clearly stated and specific
enough that other experts
would agree on what a
good answer looks like.

2. Avoids overly subjective


questions or unresolved
debates in research.

3. Amenable to developing
clear, objective evaluation
criteria.

Prompt Domains

The following are all the allowed domains for tasking in this
project:

● Accounting ● Law

● Administration & ● Law & Public Policy


Logistics

● Literature
● Agriculture

● Marketing & Sales


● Architecture

● Materials Science
● Art

● Mathematics
● Biochemistry
● Bioinformatics
Published using Google Docs ● Mechanical Report abuse Learn more
Engineering

● Biology
Updated automatically every 5
Valkyrie: Project Specifications - V2 ● Media & Journalism
minutes
● Biomedical Science
● Medicine

● Business & Finance


(Other) ● Microbiology

● Chemical Engineering ● Molecular Biology

● Chemistry ● Music

● Civil Engineering ● Nanotechnology

● Computer Science / AI ● Neuroscience

● Design ● Operations
Management &
Research
● Ecology

● Pharmacy
● Economics

● Physics
● Electrical Engineering

● Psychology
● Engineering (Other)

● Public Health
● Entrepreneurship

● Quantitative Business
● Genetics & Finance

● HR ● Scientific Research
(Other)

● Humanities Research
(Other) ● Social Sciences
Research (Other)

● Immunology
● Strategic & Managerial
Business

● World History

🚫These prompts are out of the scope of this project and are
not allowed:
● Any code editing/generation. This also includes providingReport abuse
Published using Google Docs Learn more
code to the model for further analysis in any way (No rubric
item should be a code test)
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
● Questions that have a verifiable GTFA (Ground Truth Final
Answer)

○ Questions that have an exact right answer

● Games or puzzles outside a professional context

● Sports, entertainment, or trivia outside a professional


context

Prompts with a single ground truth final answer (GTFA) don't


benefit much from a rubric since the answer is unique. There are
many other projects that focus on GTFAs, but in Valkyrie, we're
looking for prompts that could be answered in a variety of
different ways and therefore benefit from a rubric that isolates key
aspects that ideal responses share.

First Principles Thinking

First principles thinking is the habit of breaking a problem down


to its fundamental facts and constraints before proposing a
solution, instead of leaning on “everybody knows…” type of
shortcuts.

What it is What it isn’t


● Starting from the ● Repeating
specialist (physics, biology, standard practice
policy, etc) requirements of because it’s
the task. common.
● Listing the real ● Copy-pasting
constraints (temperature rules of
limits, patient safety, legal thumb without
rules, etc.). checking if they
● Testing candidate options apply here.
against those constraints, ● Skipping the
even if the option is “why” and
unusual. jumping straight to
● Explaining the trade- a favourite material
offs you considered as an / intervention.
expert in your field.

Why it’s important:


● Prevents shallow “common-sense” answers that miss
hidden pitfalls.
● Captures expert “taste” -when a best practice doesn’t fit
special conditions.
Example of first principles thinking:
● If a child with severe behavioral issues is experiencing anReport abuse
Published using Google Docs Learn more
aggressive behavioral episode, many clinicians would use
X intervention(s) to stop the behavior - BUT - under specific
conditions, those interventions would actually be
Updated automatically every 5
Valkyrie: Project Specifications -counterproductive.
V2 Instead there are a lot of factors that minutes
you'd need to assess before choosing an intervention, even
if that intervention is "best practice"

How to incorporate it into your tasks:


● When writing a prompt
○ Ask yourself “Is the prompt best solved by
reasoning from first principles?”
● When writing a rubric criterion:
○ Ask yourself “Does fulfilling this criterion require
first principles thinking?”

Not every prompt and every rubric criterion needs to incorporate


first principles thinking, but as experts in your fields, you are
better equipped than anyone else to challenge conventional
thinking and come up with complex work that will help advance
LLMs like never before.

Image Specifications - NOT CURRENTLY


AVAILABLE

How to add an image

Below the prompt box, you will find two different buttons.
To add a reference text
To attach an image

To correctly add your image, you will need to upload it and provide
the URL as seen in the following example
Published using Google Docs Report abuse Learn more

How to upload your image Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

To attach an image, you can right-click on your chosen image,


select “Copy Image Address,” and paste the URL into the
appropriate tab in Outlier.

If you are using your own image, enter your preferred image
website uploader (like Imgur), and upload it there. After that, follow
the previous step.

Please make sure that whatever website you use doesn't


automatically delete the image after a certain amount of time. For
example, using Imgbb, you have to select "Don't autodelete."
option.
Published using Google Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

Image Guidelines

✅ If you upload an image, it should be directly related to the


prompt.

✅ If you upload an image, it must be used.


✅ Save your file in one of the following formats: .jpg, .jpeg, or
.png.

✅ Ensure clarity and proper sizing (minimum 800px on the


smallest side).

Selected images SHOULD NOT:

❌ Include any Personal Identifiable Information


❌ Have NSFW (Not Safe For Work) content
❌ Be blurry/unreadable

When writing a prompt, don’t describe the image in so much detail that
the image itself becomes unnecessary. The image must be needed to
solve the prompt.
Published using GoogleImage
Docs Tips Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 your entire prompt and rubrics, please test that your image
Before adding
minutes
is readable by all 3 models. You can test it by adding a simple prompt like
“What’s in the image”

Also, make sure that, even when all 3 models can see the image, after
you add the whole prompt, the models should acknowledge the image. It
is not acceptable to fail a model just because it punts or says that there is
no image.

Any task submitted with an image punt will


result in removal from the project.

Define + Weight Rubric Criteria

Rubrics define success criteria and clearly illustrate what


constitutes a high-quality response. A rubric is structured as a set
of clear, specific criteria that are objectively met or not met and
associated with weighted value.

Weight

The Weight of each Criterion is based on how critical it is to


creating a perfect response to the Prompt. You will need to take
some time adjusting these Weights in order to achieve a ratio that
adequately incentivises the Model to produce the best response
(more on this later).

You will weigh each criterion on a -10 to +10 scale. Note: Criteria
cannot have a weight of “0”.

Mental Model:

[+ Points]: Doing something we’ve asked - OR - didn’t ask


for but makes the response better
[0 Points]: For not doing something you’ve asked - OR -
not doing something another model did that makes the
response better
[- Points]: For doing something actively wrong - OR -
makes the response worse
Published using GoogleRequirements
Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications
Criteria-must,
V2 first and foremost, be Mutually Exclusive, minutes
Collectively Exhaustive (MECE) and Complete Sentences.

This means that each criterion should not be repeated. Avoid


redundant criteria so a model isn't penalized twice for the same
mistake. At the same time, the sum of all criteria should be
thorough enough to cover all aspects of a perfect response.

If the prompt asks to calculate the average of 2 prices,


then instead of the following rubric:

● Price 1 is correctly determined to be $15

● Price 2 is correctly determined to be $13

● Computes and reports the correct average of the


two prices as $14

● Identifies all prices correctly and gets to the right


answer

In the example above, the last constraint does not test for
anything that isn’t already covered by the previous three
constraints, so it is redundant.

We want to avoid double penalization, which we can


achieve through MECE structuring. For example, instead of:

● "The response should include that the bill of


$45.27 with a pending charge of $54.32 represents
a 20% charge difference.

● "The response should include that the payment


pending for the customer's credit card would mean
a bank ghost charge of 20%.

—where the model could be penalized twice for the same


20% error—we should instead write:

● The response should include that the bill of $45.27


with a pending charge of $54.32 represents a 20%
charge difference.

● The response should explain that the pending


charge on the user’s account includes a bank ghost
charge.
Published using GoogleInDocs
addition to MECE, the criteria you create should follow these Report abuse Learn more
principles:

Diverse Updated automatically every 5


Valkyrie: Project Specifications - V2 The rubric items ❌ “the responseminutes
should include mentions A”, “the
variable types of response mentions
information. If a B”
majority of rubric
items follow the
same format, then
either the prompt is
too shallow or the
rubric does not
consider enough
dimensions of a
good response.

Atomic ❌ Response
Each rubric criterion identifies George
should evaluate Washington as the
exactly one distinct first U.S. president
aspect. Avoid and mentions he
bundling multiple served two terms.
criteria into a single
rubric. Most stacked ✅ Response
criteria with the identifies George
word “and” can be Washington as the
broken up into first U.S. president.
multiple pieces. ✅ Response
mentions that
George Washington
served two terms.

Specific Criteria should be ❌ The response


binary (true or false) should not be too
and objective (a verbose.
majority of readers ❌ The response
should agree on should provide a
whether a given few examples.
model response ✅ The response
satisfies the criteria). should not exceed
Even if the criterion 500 words.
must rely on some ✅ The response
level of personal should list exactly
interpretation, it three examples.
should be something
that >75% of people
would rate the same
way for all three
responses.

Self-contained
Each criterion ❌ Mentions the
should contain all capital city of
the information Canada.
needed to evaluate a
response and it
should be verifiable Why is it wrong?
without requiring an
external search.
It doesn’t contain
the answer which
is Ottawa.

✅ Mentions the
capital city of
Canada is Ottawa.
Published using Google Docs ⚠️This self- Report abuse Learn more
contained
condition means
that you must Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
provide examples ❌ Response
or final answer of names any of the
what the criterion Nobel Prize winners
is evaluating. in Physics in 2023
Ideally, most, if
not all, of the
criteria should be Why is it wrong?
supported with
examples or final
It doesn’t contain
answers.
the examples of
possible Nobel
Laureates.
Exclusion criteria
like:
✅ Response
names any of the
“The response following Nobel
must not mention Prize winners in
Michael Jackson Physics in 2023:
songs”. Pierre Agostini,
Ferenc Krausz, or
Anne L’Huillier.
don’t need
examples since
adding those
doesn’t provide
any new insight.

1. Outline Core Criteria

Your initial Rubric draft will define exactly what is critical for a
response to be considered minimally viable - meaning it meets
the essential requirements to answer your prompt sufficiently.
Avoid criteria representing enhancements ("nice-to-haves") or
negative criteria pointing out flaws or mistakes at this stage.
Instead, focus exclusively on essential content or components
without which the response would be incomplete or unacceptable.

These Core Criteria will likely have the highest Weight values, as
they are the most critical to creating a perfect Response.

Example Prompt:

Provide an analysis of the recent (~2021)


Supreme Court case between Google and
Oracle, focusing on intellectual property
implications.

For example, if your prompt asks for an analysis of a recent


Supreme Court ruling on intellectual property rights, a minimally
viable rubric might include criteria like:
Published using Google Docs Report abuse Learn more
● [10] Response accurately identifies the Supreme
Court ruling by name (GOOGLE LLC v. ORACLE
Updated automatically every 5
Valkyrie: Project Specifications - V2 AMERICA, INC) and year (April 5, 2021).
minutes

● [10] Response clearly summarizes the central


decision or holding of the ruling which was that
Google's copying of the Java SE API was a fair use
of that material as a matter of law.

● [10] Response references at least one publicly


available source verifying ruling details such as
(https://www.supremecourt.gov/opinions/20pdf/18-
956_d18f.pdf).

These criteria represent only those essential elements without


which the response would not sufficiently address the prompt.
They intentionally exclude deeper insights, additional background
information, or negative points for inaccuracies or irrelevant
details.

2. Add Additional Criteria

The next step is to create additional criteria. These new criteria


represent either enhancements ("nice-to-haves") that elevate
the quality of a response beyond the minimal standards or
negative criteria that highlight actual mistakes, common human
mistakes, inaccuracies, harmful or irrelevant content detracting
from a response's quality.

These Additional Criteria will likely have lower Weight values, as


they refine and enhance a Response rather than define it.

For example, using the previous prompt on analyzing a


recent Supreme Court ruling on intellectual property rights,
additional rubric criteria might include:

Enhancements

● [10] Response discusses broader legal implications


beyond the immediate case such as expanding the
boundaries of software patentability.

● [5] Response compares the ruling to at least one


other relevant Supreme Court case such as Alice
Corp. v. CLS Bank.

Negatives

● [-5] Response incorrectly states that Oracle lost its


case due to patent invalidation.

● [-2] Response discusses patents granted during the


year of the case, which is irrelevant to the
discussion.
Published using Google Docs Report abuse Learn more
When crafting Negative criteria, think about common errors a
human might make when doing this task. Draw on your experience
Updated automatically every 5
Valkyrie: Project Specifications - V2
and think through the issues that may arise on top of any blatant
minutes
issues in any of the responses.

🚨Negative Criteria should NOT simply be the direct


opposite of a criterion you have already included:

🚫Bad Example:
[+5] Notes that vasopressin is no longer recommended
as a substitute for epinephrine, though it may be
considered in combination with epinephrine (Class 2b)
[-5] Does not note that vasopressin is no longer
recommended as a substitute for epinephrine, though it
may be considered in combination with epinephrine
(Class 2b)

✅Instead:
[+5] Notes that vasopressin is no longer recommended
as a substitute for epinephrine, though it may be
considered in combination with epinephrine (Class 2b).
[-5] Claims vasopressin alone should be used instead of
epinephrine.

In the example above, we don’t want to penalize the


model for not doing something, instead we want to
penalize for actively responding in a way that makes
the response worse.

These additional criteria allow your rubric to recognize responses


that demonstrate deeper research, thoughtful analysis, and
careful attention to detail, as well as to penalize those with
mistakes or irrelevant content.

TOOL: Valkyrie Valhalla Rubrics Generation

The team developed a new application to help you with your rubric
creation. You can find it at:

https://valkyrie-valhalla.vercel.app/

Note that:

● Some of the rubrics could be overfitting

● The tool might also provide criteria phrased as "The


response should not do X," which is best phrased positively
and given negative weight

⚠️ IMPORTANT: This tool does not replace your own


process of creating the criteria, but it will help you to
Published using Google Docs improve them. When in doubt, ALWAYS use this Report abuse Learn more
instruction as the SSOT.

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

⚠️ RANK THE MODEL


RESPONSES ⚠️
🚨 Remember Your Goal: If all responses handle your Core
Criteria well, you will need to adjust your prompt to make it more
complicated before moving forward.

Evaluate each Response

Once you've established your rubric, evaluate each of the model


responses against your Rubric Criteria. Each criterion is binary
(Yes or No): either the response meets the condition outlined in
the Criterion or it does not. If you find yourself wanting to choose
a third option like “partially”, consider splitting that criterion into
more atomic pieces.

Here's how two different responses might be evaluated using our


example Supreme Court ruling prompt:

Response Evaluation Scoring

The 2021 Supreme ❌ [10] The response


Court case addressed failed to identify the Score: 28 / 45
the copyrightability of ruling by name. => 56%
Java APIs. The court ✅ [10] It clearly
ruled that Google's summarizes central
use of Oracle's Java decision (fair use,
API was protected transformative use).
under fair use, ✅ [10] It provides
emphasizing verifiable source.
transformative use. ✅ [10] It discusses
This decision broader legal
significantly implications (software
influenced the copyright boundaries).
boundaries of ❌ [5] It does not
software copyrights. compare ruling
In fact, there were explicitly to other
over 10,000 patents cases.
granted the year the ❌ [-5] Does not state
case was decided. that Oracle lost due to
(source) patent invalidation.
✅ [-2] The model
discusses the number
of patent grants in
2021, which is not
relevant to this
copyright case.

In the recent ❌ [10] The response


Supreme Court case fails to identify the Score: 25 / 50
between Google and ruling explicitly by => 50%
Oracle, the decision name and year
said Google's copying
of Oracle's Java API ("recent" is
code was fair. Similar insufficient).
Published using Google
to Docs
the Alice Corp. v. ✅ [10] It correctly Report abuse Learn more
CLS Bank ruling, the summarizes central
decision impacts decision (fair use).
software ❌ [10] It does not Updated automatically every 5
Valkyrie: Project Specifications - V2
patentability. provide a publicly minutes
Additionally, Oracle available source.
lost due to patent ✅ [10] It discusses
invalidation, affecting broader legal
their control over implications (impact on
Java programming. software patents).
✅ [10] It compares
the ruling to Alice Corp.
v. CLS Bank.
✅ [-5] Incorrectly
states that Oracle lost
due to patent
invalidation.
❌ [-2] It includes no
irrelevant or off-topic
information.

⚠️ CONFIRM YOUR EVALUATION


⚠️

Review your evaluation and make corrections if necessary,


keeping the following thresholds in mind:

b. Warning Threshold: For any Response, if > 60% of


rubric criteria are passing, or the rubric score is >
60%, your prompt may not be difficult enough or
the rubric may not fully capture what makes the
prompt difficult. Consider revising.
c. Rejection Threshold: For any response, if 70% of
rubric criteria are passing, or the rubric score is
>70%, you will not be allowed to proceed until
revisions are made.

Finalize Ranking

Finally, revisit your initial Rankings and confirm your rubrics align.
The final ranking produced from your evaluation does not
necessarily need to match your initial impressions, however it’s
best to double check and make sure your rubric is exhaustive:

Key Questions:

1. Was there something that influenced your ranking but isn’t


captured in the rubric?

2. Are all scoring decisions grounded in your rubric criteria


and not intuition/preferences outside of it?
Published using GoogleThis
Docsfinal step is about ensuring internal consistency: your rubric
Report abuse Learn more
should fully account for the qualities that separate strong
responses from weaker ones. If it does, your rankings will
naturally follow. If it doesn't, revise it so that the final rankings are
Updated automatically every 5
Valkyrie: Project Specifications - V2 and comprehensive.
more aligned
minutes

Rubric Best Practices

Overview
Rubrics provide an objective structure to evaluate what makes a
good response to a given Prompt.
The rubric is a list of specific, simple, and declarative criteria that
can be evaluated as true (present) or false (not present) for
any given response to the prompt. Each criterion has weight
associated with how critical it is to fulfilling the prompt. The
criteria, when taken as a whole, should be able to evaluate any
possible response and rank different responses.
⚠️ If a task has the same criteria marked as "True" and the same
ones marked as "False" (or nearly identical), it is likely that the
task is not sufficiently good.

Writing Good Rubrics

General Principle

● MECE: Mutually Exclusive, Collectively Exhaustive

○ Completeness: Consider all the elements you


would want to include to create a perfect response
and put them into the rubric. This means including
not only the facts and statements directly
requested by the prompt but also the supporting
details that provide justification, reasoning, and
logic for your response. Each of these elements
should have a criterion because each criterion
helps to develop the answer to the question from a
slightly different angle.

○ No overlapping: the same error from a model


shouldn’t be punished multiple times.

● Diversity
○ The rubric items should include variable types of Report abuse
Published using Google Docs Learn more
information. If all criteria are like “the response
mentions A”, “the response mentions B”, then this
is not a good rubric.
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
● How many rubric items for each prompt

○ As many as needed. There is no golden standard,


and the desired number of rubrics varies by
accounts and task types. 10-30 is a good range, but
there is no strict limit. The principle here is to write
rubrics that cover all aspects of an ideal response.

■ In general, tasks that can be fully evaluated


with less than 10 rubric items are not
complicated enough. In such cases, we
should think about whether the prompt is
difficult enough rather than blindly adding
more rubrics.

Atomic / Non-stacked

● Each rubric criterion should evaluate exactly one distinct


aspect. Avoid bundling multiple criteria into a single rubric.
Most stacked criteria with the word “and” can be broken
up into multiple pieces.

❎ Response identifies George Washington as the first


U.S. president and mentions he served two terms.

✅ Response identifies George Washington as the first


U.S. president.
✅ Response mentions that George Washington served
two terms.

Specific

● Criteria should be binary (true or false) and objective.

● Avoid vague descriptions (e.g., "the response must be


accurate" is vague).

● Define precisely what is expected.

○ Example: "The response should list exactly three


examples."
Published using GoogleSelf-contained
Docs Report abuse Learn more

● Each criterion should contain all the information neededUpdated


to automatically every 5
Valkyrie: Project Specifications -evaluate
V2 a response, e.g. minutes
❎ Mentions the capital city of Canada.
✅ Mentions the capital city of Canada is Ottawa.

● Criteria should be verifiable without requiring an external


search.
❎ Response names any of the Nobel Prize winners in
Physics in 2023

✅ Response names any of the following Nobel Prize


winners in Physics in 2023: Pierre Agostini, Ferenc
Krausz, or Anne L’Huillier.

Classifying Rubrics

Objective vs Subjective

● Objective Rubric: Evaluation criteria directly link to explicit


instructions or measurable conditions. Evaluators
consistently arrive at the same conclusion.

○ Example: "The response must contain exactly five


items."

● Subjective Rubric: Evaluation criteria require evaluator


judgment, inference, or interpretation. Evaluations might
vary slightly between evaluators.

○ Example: "The response tone should feel


professional and polite."

Explicit vs Implicit

● Explicit Rubric: Criteria explicitly stated within the prompt


or instructions.

○ Example: Prompt explicitly asks for a table, the


rubric verifies table formatting.

● Implicit Rubric: Criteria inferred from cultural norms,


language conventions, or general expectations, not directly
stated in the prompt.
Published using Google Docs Report abuse Learn more
○ Example: Correct grammar and punctuation are
typically implicit unless explicitly required
Updated automatically every 5
Valkyrie: Project Specifications - V2 otherwise.
minutes

Rubric Dimensional Ratings

What dimension does each criterion evaluate the model on?

Dimensional Rating Definition

Accuracy This assesses whether responses


include only factually correct
information, aligned with current
expert consensus. This dimension also
covers recognizing uncertainty when
the evidence on a topic is weak or
evolving.

Completeness Criteria in this dimension examine


whether a response includes all
important information needed to be
safe and helpful to the user. Even if
accurate, a response that is
incomplete (e.g., omitting key steps or
red flags) can still result in low-quality
advice or harm.

Communication This dimension captures whether the


Quality response is well-structured and
concise, and whether it uses a level of
technical depth and vocabulary that is
well-matched to the user.

Context Awareness This captures whether the model


appropriately responds to contextual
cues that are present (e.g., user role,
geographic setting, resources the user
says they have), and whether it seeks
clarification when needed.

Instruction Following This dimension evaluates whether the


model adheres to instructions while
still prioritizing safety.

Common Pitfalls and How to Avoid


Them

🚫 Pitfall: Overly Complex or Ambiguous


Criteria

● Keep each criterion simple, atomic, and verifiable.


● Provide examples to minimize subjective interpretation. Report abuse
Published using Google Docs Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2
🚫 Pitfall: Subjectivity Confusion minutes

● Provide clear examples and clarifications for subjective


rubrics.

● Consider breaking subjective criteria into explicit, objective


sub-components when possible.

🚫 Pitfall: Too few or too many Criteria

● Rubrics should be designed in such a way that a model


response that satisfies all of the criteria is close to a
perfect response.

🚫 Pitfall: Double Counting Criteria

● Negative Criteria should not penalize a model for NOT


doing something already prescribed in positive criteria.
Bad Example:

○ [+5] The response mentions the Statue of Liberty is


a gift to the U.S.

○ [-5] The response does not mention the Statue of


Liberty is a gift to the U.S.

● Instead, penalize a model for doing something wrong:

○ [+5] The response mentions the Statue of Liberty is


a gift to the U.S.

○ [-5] The response mentions the Statue of Liberty


was stolen by the U.S.
Prompt + Rubric Examples
Published using Google Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2
✅ Good Examples minutes

Prompt: Scientific Research

I have been tasked with implementing a clinical research study


and need help creating project launch plan for my PI that includes
the following:
1. Recruitment plan to enroll 20 NYC based MSM between
the ages of 18 and 35 who are sexually active and engage
in recreational methamphetamine use. Include 3-4
locations for in person recruitment (we do not have funding
for a multimedia campaign). I'll also need to know specific
local regulations that impact this type of recruiting before
we implement this plan.
2. Clinical appointment schedule. M-F 9a to 6p. 2 clinical
psychologists and 2 undergraduate research assistants.
Each appointment lasts 4 hours with 45 minutes dedicated
to professional therapy (the rest is handled by the project
team), 1h for laboratory work and study drug
administration, 15min for check-in, and the remaining time
for async evaluation forms and observation. Appointments
occur weekly for 6 weeks. We want an even distribution of
clients each week and our current staffing is sufficient to
run 2 clients in tandem.
3. Exclusion criteria for initial screening. This study will
involve weekly IV injections of naltrexone (XR) in
conjunction with therapy, clinical laboratory collection, and
daily self reporting of suicidality, sexual activity, and
substance use via mobile application

Finally, I'll need to provide each client with a warning card


indicating participation in the study and major risks, let's add that
to the proposal as a client facing addendum.

Rubric

Weight
# Criteria Category Objectivity Explicity Weight
(Normalized)

1 The response Instructive Subjective Explicit 10 10.53%


provides at least 3
real locations or
venue types likely
to draw MSM
individuals from
18 to 35 such as:
nightlife venues,
health centers, or
community
groups.

2 The response Instructive Objective Explicit 5 5.26%


omits multimedia
recruitment
strategies such as
social media or
television ads.
3 The response Instructive Subjective Explicit 10 10.53%
Published using Google Docs
provides specific Report abuse Learn more
New York
regulations that
govern research
Updated automatically every 5
Valkyrie: Project Specificationsrecruitment
- V2 such
minutes
as a prohibition on
penalties for
withdrawing from
a research study.

4 The response Instructive Objective Implicit 10 10.53%


identifies 4 slots in
which research
participants can be
seen each day.

5 The response Instructive Objective Explicit 10 10.53%


suggests a
schedule allowing
for all 20
participants to be
seen each week.

6 The response Instructive Objective Implicit 5 5.26%


mentions that
non-MSM
individuals should
be excluded from
the study.

7 The response Instructive Objective Implicit 5 5.26%


mentions that
individuals
currently on
opiates must be
excluded from the
study.

8 The response Instructive Objective Implicit 5 5.26%


mentions that
anyone allergic to
naltrexone must
be excluded from
the study.

9 The response Instructive Objective Implicit 5 5.26%


mentions
individuals unable
to provide
informed consent
must be excluded
from the study.

10 The response Instructive Objective Implicit 5 5.26%


must mention that
anyone unable to
use a mobile app
for self reporting
must be excluded
from the study.

11 The response Instructive Objective Explicit 10 10.53%


provides an
addendum
directed at clients,
mentioning risks of
study participation
such as the side
effects of
naltrexone or the
risks involved with
opiates when
combined with the
study drug.

12 The response Instructive Objective Implicit 5 5.26%


mentions
excluding
individuals outside
of the 18-35 age
Published using Google Docs
requirement. Report abuse Learn more

13 The response Instructive Objective Implicit 5 5.26%


mentions that Updated automatically every 5
Valkyrie: Project Specificationsnon-active
- V2 minutes
methamphetamine
users must be
excluded.

14 The response Instructive Objective Implicit 5 5.26%


mentions those
unable to attend
weekly clinical
appointments for 6
weeks must be
excluded.

15 The response Objective Explicit -10


mentions that
opiate users
should refrain
from using for at
least 12 hours

16 The response -5
suggests carrying
narcan as a
precaution while
on the study.

17 The response -3
suggests hiring
additional staff in
order to meet
weekly
appointment
goals.

Why is it good?

The core criteria (C1, C3, C4, C5, C11) have the most weight whilst
the rest of the “nice to have” criteria have less weight. A good
rubric assigns logical weights based on the importance of the
criteria

Prompt: Medicine

As part of an ethics board review, you are asked to evaluate a


hospital’s plan to deploy a generative AI tool for drafting clinical
notes, starting with a pilot in the emergency department. Identify
the top three ethical risks associated with AI-generated
documentation in patient care. Explain how these risks might
affect patient experiences, clinician's perspective, and
administrator's perspective. Recommend safeguards or oversight
processes to address these risks while still enabling responsible
innovation. Discuss how your proposed safeguards would function
in a fast-paced clinical environment and any trade-offs they might
involve. Support your recommendations with references to
healthcare AI governance frameworks or academic literature.
Published using GoogleRubric
Docs Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes
Weight
# Criteria Category Objectivity Explicity Weight
(Normalized)

1 The response Instructive Objective Explicit 10 10%


must identify 3
ethical risks
associated with
AI-generated
documentation
in patient care,
such as data
bias, patient
privacy, or error
liability.

2 The response Reasoning Subjective Explicit 10 10%


addresses how
AI-generated
documentation
might affect
patient
experiences,
such as trust,
autonomy,
safety, or data
privacy.

3 The response Reasoning Subjective Explicit 10 10%


addresses how
AI-generated
documentation
might affect
the clinician's
perspective,
such as clinical
judgment,
liability,
workflow
burden, or
documentation
accuracy.

4 The response Reasoning Subjective Explicit 10 10%


addresses how
AI-generated
documentation
might affect
the hospital
administrator's
perspective,
such as
institutional
accountability
or risk
exposure.

5 The response Instructive Subjective Explicit 10 10%


must provide at
least one
safeguard or
oversight
process to
address the
risk, such as
human
oversight, bias
audits, or
transparency
measures.

6 The response Reasoning Subjective Explicit 15 15%


clearly
describes how
each safeguard
would
Published using Google Docs
realistically Report abuse Learn more
function in a
high-pressure
setting like an Updated automatically every 5
Valkyrie: Project Specificationsemergency
- V2 minutes
department or
hospital ward,
such as
integrating it
into existing
EHR workflows
—such as
highlighting
only AI-
suggested edits
or flagging
uncertain
content.

7 The response Reasoning Objective Explicit 10 10%


must identify
and discuss
trade-offs or
unintended
consequences
of
implementing
safeguards,
such as slowing
documentation,
increasing
clinician
workload, and
reducing
flexibility.

8 The response Instructive Subjective Explicit 10 10%


must reference
at least one
healthcare AI
governance
framework or
academic
literature
source, such as
WHO AI
guidelines,
AMA policy, or
EU AI Act.

9 The response Reasoning Subjective Implicit 5 5%


must consider
the long-term
impact of
implementing
AI for clinical
documentation,
such as the
potential for
systemic
changes in
healthcare
delivery,
physician-
patient
relationships,
and hospital
infrastructure.
10 The response Reasoning Objective Implicit 10 10%
Published using Google Docs
must discuss Report abuse Learn more
mechanisms
for ensuring
Updated automatically every 5
Valkyrie: Project Specificationstransparency
- V2
and minutes
accountability
of the AI tool,
especially in
the event of
errors, system
failures, or
negative
outcomes.

Prompt: Administration & Logistics

Walmart has reported updates regarding the recent litigation


issues with the Department of Justice in its form 10-K. Pull the
dates from the most recent form specifically for the DOJ Civil
Litigation only. Compare how this case has affected the financials
from 2023-2024 relating to operating expenses, net cash, and
accrued liabilities.

Rubric

Weight
# Criteria Category Objectivity Explicity Weight
(Normalized)

1 Response Instructive Objective Explicit 10 8%


defines
December
22, 2020, as
the date that
the DOJ filed
a civil
complaint in
the U.S.
District
Court for the
District of
Delaware
against
Walmart.

2 Response Instructive Objective Explicit 5 4%


defines
February 22,
2021, as the
date that
Walmart
initially
moved to
dismiss the
DOJ
complaint.

3 Response Instructive Objective Explicit 10 8%


defines
October 7,
2022, as the
date that the
Published using Google Docs
DOJ filed an Report abuse Learn more
amended
complaint
against Updated automatically every 5
Valkyrie: Project SpecificationsWalmart.
- V2 minutes
4 Response Instructive Objective Explicit 5 4%
defines
November 7,
2022, as the
date that
Walmart
filed a
partial
motion to
dismiss the
amended
complaint
filed by the
DOJ.

5 Response Instructive Objective Explicit 5 4%


must define
January 18,
2024, as the
date that the
Court held a
hearing on
the partial
motion to
dismiss filed
November 7,
2022.

6 Response Instructive Objective Explicit 5 4%


must define
January 18,
2024, as the
date that the
Court
ordered the
DOJ to file
an amended
complaint
against
Walmart.

7 Response Instructive Objective Explicit 5 4%


must define
February 1,
2024, as the
date the DOJ
filed an
amended
complaint
against
Walmart.

8 Response Instructive Objective Explicit 5 4%


must define
February 6,
2024, as the
date
Walmart
filed a
partial
motion to
dismiss the
amended
complaint
filed by the
DOJ on
February 1,
2024.

9 Response Instructive Objective Explicit 5 4%


must define
March 11,
2024, as the
date that the
Published using Google Docs
Court Report abuse Learn more
granted in-
part
Walmart's Updated automatically every 5
Valkyrie: Project Specificationsmotion
- V2 by minutes
dismissing
the entirety
of the DOJ's
claims
related to
distribution
and
dismissing
the DOJ's
claims
arising
under one of
the DOJ's
two
dispensing
liability
theories.

10 Response Instructive Objective Explicit 10 8%


must define
November
15, 2022 as
the date that
Walmart the
Company
announced
that it had
agreed to a
Settlement
Framework
to resolve
substantially
all opioids-
related
lawsuits
filed against
the
Company by
states,
political
subdivisions,
and Native
American
tribes

11 Response Reasoning Objective Explicit 10 8%


must
mention that
the
operating
expense as a
percentage
of net sales
decreased
significantly
from 2023
to 2024
indicating
that the
litigation
matter with
the DOJ
affected the
2023
operating
expenses
more
severely.

12 The Instructive Objective Explicit 10 8%


response
must
mention that
in 2024
Walmart's
Published using Google Docs
net cash Report abuse Learn more
provided by
operating
activities Updated automatically every 5
Valkyrie: Project Specificationswas
- V2 minutes
increased
from 2023
partially due
to finalizing
accrued
opioid legal
charges.

13 The Reasoning Objective Explicit 5 4%


response
should
mention that
the effects
of net cash
relating to
any legal
cases was
not explicitly
stated by
Walmart in
2023.

14 The Reasoning Objective Explicit 5 4%


response
should
mention that
net cash in
2023 was
likely
reduced due
to charges
relating to
the DOJ
legal
implications
and
settlement.

15 The Instructive Objective Explicit 10 8%


response
must
mention that
cash
outflows
increased in
2024 from
2023, likely
from
litigation
and
settlement
costs.

16 The Instructive Objective Explicit 10 8%


response
must
mention that
accrued
liability
increased in
2023 from
2022 due to
costs
relating to
the
Settlement
Framework
which
became
effective in
that year.
17 The Instructive Objective Explicit 10 8%
Published using Google Docs
response Report abuse Learn more
must
mention that
accrued
Updated automatically every 5
Valkyrie: Project Specificationsliabilities
- V2 minutes
decreased
from 2023
to 2024.

18 The Instructive Objective Explicit 5 4%


response
must
mention that
Walmart
confirms the
change in
liabilities
from 2023
to 2024 is
partially due
to the
payment of
the
remaining
accrued
opioid legal
charges.

🚫 Bad Examples

Prompt: I'm a big fan of Harry Potter. I wanted to make a video


for my channel and I need some information. Which spells
are most often used and mentioned by wizards (not
Witches) by name? Do a check and show the order of the
10 most mentioned in the saga. Next to each one, describe
what it is for and what it does to the opponent. Which
character uses each spell the most? Give me a table of the
spells and the wizards that use them.

Why? Too contrived, too many questions, unrealistic ask

Prompt: I want to start a diet to gain muscle mass, I currently


weigh 70kg. I'm celiac, but I like to explore foods that
contain 'a little' gluten. Prioritize carbohydrates, but don't
overdo the calories, staying close to 3000 or less, but
without restricting protein (consider 2g of protein per kg of
body). I need something easy to prepare, but also gourmet
and includes fresh, frozen and ready-made foods. I don't
want to repeat meals throughout the day, but they should
be small and work-appropriate. Please organize everything
and detail the calories, macronutrients and complete
recipes.

Why? Bad because there is too much backstory, context,


justification of constraints/asks, and question-stacking.

Prompt: Give me 3 vegan cookie recipes. They don't contain


nuts, I'm allergic to nuts.
Published using GoogleWhy?
DocsThe prompt is too simplistic Report abuse Learn more

Updated automatically every 5


Valkyrie: Project Specifications - V2 minutes

Technical Issues

Linters

What is a Linter?

Linters are automatic checks that run when a step is saved, like a
prompt or rubric, to help catch issues and keep things consistent.

Example linters:

Warning Linter
Information Linter

Severe warning Linter


Published using Google Docs Report abuse Learn more

How to dismiss a linter


Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
If a linter pops up and you’re sure there’s no real issue, then
follow these steps:

● Step 1: Dismiss (yellow and blue ones) / reject feedback


(red ones).
● Step 2: Immediately save the same step without making
any other changes.

This workflow will allow you to save changes you have made
and proceed with the next step in the task.

Weighting Criteria.

Adding weights to the rubric criteria allows for a fairer assessment of


model responses. Imagine you ask two LLMs to give you the highest-
rated movie according to IMDb:
● 🤖 R1: The highest-rated movie on IMDb's top

250 movies list is The Shawshank Redemption.

● 🤖 R2: The highest-rated movie on IMDb's top

250 movies list is The Shawshank

Redemption with a rating of 9.3.

● 🤖 R3: The highest-rated movie on IMDb's top

250 movies list is The Godfather with a

rating of 9.2.

Note that R1 and R2 correctly answer the "core ask" of your

request (you can validate it here: Top 250 movies), but as a user,

we know that a good response might also give you the rating of

the highest-rated movie according to IMDb (this is an implicit ask).

Two rubric criteria for this simple prompt could be:

● C1: The response must state that the highest-rated movie

according to IMDb is The Shawshank Redemption.

● C2: The response should provide the IMDb rating for the

movie it gives as the highest rated, for example, The

Shawshank Redemption has a 9.3 rating.

(Other possible criteria we omit for brevity.)

Imagine if we didn't assign weights to it, then:

● R1 would fail the second criterion and score 50%

● R2 would get both criteria correct and score 100%


● R3 would fail the first criterion, but pass the second
Published using Google Docs Report abuse Learn more
criterion and score 50%

Notice that R1 and R3 both score the same, even though R1 is


Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
clearly better.

💡 This is why we introduce weights: we want to give important


criteria more weight than less important criteria (In our example,

we want to give criterion 1 more importance than criterion 2).

The whole assigning weights concept is all about knowing which

criteria are the most important.

For this trivial example, and considering only these two criteria, a

good choice for the weights is:

C1 - 70%

C2 - 30%

In which case,

R1 would fail the second criterion and score 70%

R2 would get both criteria correct and score 100%

R3 would fail the first criterion, but pass the second

criterion and score 30%

Features of an Ideal Response


Each prompt has an ideal response, which you must have a clear

picture of. Your goal is to pick out all the features that make up an

ideal response:

The single most important rule for weighting criteria is:

Core criteria (those that evaluate core features) must

be weighted MORE than nice-to-have criteria (those


that evaluate nice-to-have features).
Published using Google Docs Report abuse Learn more
When it comes to real tasks, you'll rarely get the rubric and the

weights all in the first try. What ends up happening is that once Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
you read the model responses and compare them, you'll find

features that make for a better response (most of them nice-to-

have features) and that further help differentiate the responses

based on score. All you have to do is add those criteria and

recalibrate the weights.

In general, nice-to-have criteria should have very little weight, just

a few percentage points. On that note, if you find anything that one

response did better than the others, don't be afraid to create a

nice-to-have criterion that addresses that, just make sure to not

put a lot of weight on it.

Avoid:

❌ Giving less important criteria more weight than more


important criteria.

❌ Changing the weights just to get the scores you want; only
adjust them based on how important each criterion actually is. If

we detect you're doing this on purpose, you may be removed from

the project.

❌ Giving an excessive amount of weight to a single criterion.

Scoring

Additional
Criteria 1-2 (Fail) 3 (Okay) 4-5 (Good/ Perfect) Notes
[Prompt] - [Major Clarity - [Minor Clarity / - There is little to no
Clarity and Issues] The prompt Specificity room for
Specificity contains significant Issues] It's misinterpretation of the
spelling/grammar errors that mostly clear specific request
render it unintelligible. what is being - Prompt has a specific
- The prompt has major asked but the request that doesn't
ambiguities which render request could require more than one
the problem ill-defined. reasonably be minor assumption to
interpreted answer it.
multiple ways
- Most experts
when presented
with the prompt’s
ambiguity would
make the same
assumption to
render it well-
defined
- Prompt has one
or two
spelling/grammar
errors which
have a minor
Published using Google Docs impact on clarity Report abuse Learn more
Temporal
example:
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
Hey! I have
been pretty
disconnected
from the
entire
presidential
election. I'm
going to do
some
- [Major Feasibility
research on
Issues] Prompt gives
it. First, tell
conflicting/contradicting
me who the
instructions that can't be
candidates
fulfilled simultaneously
this year are.
(unless specifically
Then, tell me
instructed to do so) [Minor
what some
- [Temporal Prompt] The Feasibility
of the latest
prompt contains a temporal Issues]
polls have
question. It uses time- - Prompt's
- The prompt is said on who
relative terms (e.g., "latest") request is
completely actionable will win?
where the response verging on being
by an LLM or chatbot
depends on when the impractical and
[Prompt] - The prompt contains “Harmful
question was asked or asks the LLM won't be
Feasibility no conflicting Content” for
for information able to
instructions/statements this project is
- [Harmful Content] The completely fulfill
- The prompt contains simply
prompt contains any everything asked
no harmful content described as
harmful content in the prompt,
“unsafe
but the prompt is
content.”
Note: it is okay for a prompt still answerable
This could
to be “tricky” or misleading with concessions
include:
as long as it is not explicitly
self-contradictory (e.g. 1. Content
asking the model “what was harms -
the name of the 53rd state
unsafe text
to join the US?”)
(bigotry,
conspiracy
theories)
2.
Facilitations
harms - text
that enables
unsafe
behavior
(how to
make a
bomb)

- [Minor
Contrivance]
Prompt contains
one or two
[Contrivedness/Unnatural] constraints that
are a little
- Prompt is reasonable
- The prompt and its
contrived, but and natural. It may
constraints are overly
prompt as a contain constraints or
[Prompt] restrictive, contrived,
whole is multiple questions,
Contrivedness unrealistic, or does not reasonable and and is clear in its ask
reflect something a real clear. It may be a to the user. It’s not
user might plausibly ask of little confusing, overly generic
a model but the model is
able to create
reasonable and
consistent
responses.

[Prompt] [Not Professional] - The prompt is clearly


Professionalism - The prompt does not designed by a
sound like a question asked professional in the
Published using Google Docs for work / business chosenabuse
Report domain Learn more
purposes - The prompt would
- The prompt could be need some level of
reasonably answered by the professional
Updated automatically every 5
Valkyrie: Project Specifications - V2 general population without experience/domain
minutes
the need for specific domain knowledge to answer
knowledge OR background successfully.
research (research needed
to understand the prompt
before answering)

[Major Contradiction]
- If at least one criterion
contradicts what is required
in prompt

[Counterproductive
Criteria]
- At least one criterion is
objectively incorrect, or it
makes the response worse
when true. -[Irrelevant
- At least one rubric Criteria] If
question is framed in such a criteria
way that your answer to it being
while evaluating an ideal -The rubric covers
defined is the core
response would be “fail”
only instruction
partially - The rubric does
- [Missing Criteria] Criteria
connected
that are objectively 100% not contain
to the objective
essential to meet the
prompt, inaccuracies
fundamental requirements
and/or -The rubric covers
of the prompt are missing,
having it all prompt
preventing the response
defined constraints
from successfully fulfilling
does not -The rubric is
the core task if not included
make the
specific and
[Rubric] response
[Closed Ended Prompt] relevant
Rubric Criteria objectively
- If the prompt is close -If needed the
ended (has a short GTFA) better. rubric provides
but does not provide the direct answers
answer. - [Broad but -The rubric
Ratable provides binary
Example criteria
Criteria] The
Bad Rubric Criteria: "The -The rubric covers
criteria is broad
response should identify the all the necessary
but most people
first U.S. secretary of defense" criteria to create
would be able to
determine if it a perfect
Fixed Rubric Criteria: "The response.
response should state that the
was met (e.g.,
first U.S. secretary of defense
“the response
was James V. Forrestal"
should be
humorous”)
[Very Vague Criteria]
- The criteria is overly vague
such that it is unclear if one
could determine whether it
was fulfilled

[Atomicity]
- There are multiple
unrelated prompt requests
combined into one rubric
criteria

[Rubric] [Major Inbalance] [Minor The rubric criteria


Rubric Criteria - There are obvious Inbalance] weights accurately
Weights imbalances in the rubric - The rubric reflect their importance
criteria weights. criteria weights to a good response.
- A criteria that clearly does are roughly
Published using Google Docs not pertain to a core request correlated with Report abuse Learn more
of the prompt is weighted their importance,
the same or higher than with some room
those pertaining to the core for debate but no
Updated automatically every 5
Valkyrie: Project Specifications - V2 requests. glaring errors. minutes
[Major Objective Criteria
Rating Disagreement]
- The contributor marks
"Pass" when it should
definitely be "Fail" or vice
[Rating] versa. Only applicable to
- Your rating matches
All Objective Objective Criteria.
the contributor's
Criteria
Note: The discrepancy has
to be accepted by the
majority of experts in the
field, if not, go with the CBs
choice.

- All responses score


- 2+ Responses
[Rating] - [Trivial Task] 1+ less than 70%. Only 1
have a Rubric
All Criteria Response scores >= 70% response may score
Evaluation Score
on Rubric Evaluation 60-69% on Rubric
of 60-69%
Evaluation

You might also like