Valkyrie - Project Specifications - V2
Valkyrie - Project Specifications - V2
Project Specifications
Important Information
Project Update Note - 5/13
Example: “Provide 3
quotes from War &
Peace related to life
in Russia during
Napoleon’s invasion”
Index
Tasking Overview
Craft a Prompt
🔧 Guidelines for Creating High-Complexity Prompts
🎯 Objective
🪜 How to Build One Step-by-Step
Definition of a professional prompt:
Prompt Domains
First Principles Thinking
Image Specifications
How to add an image
How to upload your image
Image Guidelines
Image Tips
Define + Weight Rubric Criteria
Weight
Mental Model:
Requirements
1. Outline Core Criteria
Example Prompt:
2. Add Additional Criteria
TOOL: Valkyrie Valhalla Rubrics Generation
Evaluate each Response
⚠️ CONFIRM YOUR EVALUATION ⚠️
Finalize Ranking
Published using GoogleOverview
Docs Report abuse Learn more
Writing Good Rubrics
General Principle
Atomic Updated automatically every 5
Valkyrie: Project Specifications - V2/ Non-stacked minutes
Specific
Self-contained
Classifying Rubrics
Objective vs Subjective
Explicit vs Implicit
Rubric Dimensional Ratings
Common Pitfalls and How to Avoid Them
🚫 Pitfall: Overly Complex or Ambiguous Criteria
🚫 Pitfall: Subjectivity Confusion
🚫 Pitfall: Too few or too many Criteria
🚫 Pitfall: Double Counting Criteria
✅ Good Examples
Prompt: Scientific Research
Rubric
Prompt: Medicine
Rubric
Prompt: Administration & Logistics
Rubric
🚫 Bad Examples
Linters
What is a Linter?
Example linters:
How to dismiss a linter
Weighting Criteria.
Scoring
Task Workflow
Tasking Overview
1. Read the tasking instructions carefully.
2. Create a Prompt that fits your Domain of expertise; a task
that you or a colleague may come across in your day to day
work, but might require research and would take quite
some time to complete.
3. Establish the Core Criteria needed to respond to your
Prompt. These criteria should represent a bare minimum,
but not necessarily stellar, Response.
Craft a Prompt
🎯 Objective
Definition of a professional
prompt:
Realistic
1. Reflects genuine scenarios
professionals might
encounter in their field.
Challenging
1. Requires deep analysis,
reasoning, and/or data
synthesis - taking a few
hours for a knowledgeable
human to answer
thoroughly
3. Avoids analysis of
excessively large amounts
of information such as
entire books or data sets.
Subsets of large data and
chapters of books are
acceptable.
Clearly Defined
1. Clearly stated and specific
enough that other experts
would agree on what a
good answer looks like.
3. Amenable to developing
clear, objective evaluation
criteria.
Prompt Domains
The following are all the allowed domains for tasking in this
project:
● Accounting ● Law
● Literature
● Agriculture
● Materials Science
● Art
● Mathematics
● Biochemistry
● Bioinformatics
Published using Google Docs ● Mechanical Report abuse Learn more
Engineering
● Biology
Updated automatically every 5
Valkyrie: Project Specifications - V2 ● Media & Journalism
minutes
● Biomedical Science
● Medicine
● Chemistry ● Music
● Design ● Operations
Management &
Research
● Ecology
● Pharmacy
● Economics
● Physics
● Electrical Engineering
● Psychology
● Engineering (Other)
● Public Health
● Entrepreneurship
● Quantitative Business
● Genetics & Finance
● HR ● Scientific Research
(Other)
● Humanities Research
(Other) ● Social Sciences
Research (Other)
● Immunology
● Strategic & Managerial
Business
● World History
🚫These prompts are out of the scope of this project and are
not allowed:
● Any code editing/generation. This also includes providingReport abuse
Published using Google Docs Learn more
code to the model for further analysis in any way (No rubric
item should be a code test)
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
● Questions that have a verifiable GTFA (Ground Truth Final
Answer)
Below the prompt box, you will find two different buttons.
To add a reference text
To attach an image
To correctly add your image, you will need to upload it and provide
the URL as seen in the following example
Published using Google Docs Report abuse Learn more
If you are using your own image, enter your preferred image
website uploader (like Imgur), and upload it there. After that, follow
the previous step.
Image Guidelines
When writing a prompt, don’t describe the image in so much detail that
the image itself becomes unnecessary. The image must be needed to
solve the prompt.
Published using GoogleImage
Docs Tips Report abuse Learn more
Also, make sure that, even when all 3 models can see the image, after
you add the whole prompt, the models should acknowledge the image. It
is not acceptable to fail a model just because it punts or says that there is
no image.
Weight
You will weigh each criterion on a -10 to +10 scale. Note: Criteria
cannot have a weight of “0”.
Mental Model:
In the example above, the last constraint does not test for
anything that isn’t already covered by the previous three
constraints, so it is redundant.
Atomic ❌ Response
Each rubric criterion identifies George
should evaluate Washington as the
exactly one distinct first U.S. president
aspect. Avoid and mentions he
bundling multiple served two terms.
criteria into a single
rubric. Most stacked ✅ Response
criteria with the identifies George
word “and” can be Washington as the
broken up into first U.S. president.
multiple pieces. ✅ Response
mentions that
George Washington
served two terms.
Self-contained
Each criterion ❌ Mentions the
should contain all capital city of
the information Canada.
needed to evaluate a
response and it
should be verifiable Why is it wrong?
without requiring an
external search.
It doesn’t contain
the answer which
is Ottawa.
✅ Mentions the
capital city of
Canada is Ottawa.
Published using Google Docs ⚠️This self- Report abuse Learn more
contained
condition means
that you must Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
provide examples ❌ Response
or final answer of names any of the
what the criterion Nobel Prize winners
is evaluating. in Physics in 2023
Ideally, most, if
not all, of the
criteria should be Why is it wrong?
supported with
examples or final
It doesn’t contain
answers.
the examples of
possible Nobel
Laureates.
Exclusion criteria
like:
✅ Response
names any of the
“The response following Nobel
must not mention Prize winners in
Michael Jackson Physics in 2023:
songs”. Pierre Agostini,
Ferenc Krausz, or
Anne L’Huillier.
don’t need
examples since
adding those
doesn’t provide
any new insight.
Your initial Rubric draft will define exactly what is critical for a
response to be considered minimally viable - meaning it meets
the essential requirements to answer your prompt sufficiently.
Avoid criteria representing enhancements ("nice-to-haves") or
negative criteria pointing out flaws or mistakes at this stage.
Instead, focus exclusively on essential content or components
without which the response would be incomplete or unacceptable.
These Core Criteria will likely have the highest Weight values, as
they are the most critical to creating a perfect Response.
Example Prompt:
Enhancements
Negatives
🚫Bad Example:
[+5] Notes that vasopressin is no longer recommended
as a substitute for epinephrine, though it may be
considered in combination with epinephrine (Class 2b)
[-5] Does not note that vasopressin is no longer
recommended as a substitute for epinephrine, though it
may be considered in combination with epinephrine
(Class 2b)
✅Instead:
[+5] Notes that vasopressin is no longer recommended
as a substitute for epinephrine, though it may be
considered in combination with epinephrine (Class 2b).
[-5] Claims vasopressin alone should be used instead of
epinephrine.
The team developed a new application to help you with your rubric
creation. You can find it at:
https://valkyrie-valhalla.vercel.app/
Note that:
Finalize Ranking
Finally, revisit your initial Rankings and confirm your rubrics align.
The final ranking produced from your evaluation does not
necessarily need to match your initial impressions, however it’s
best to double check and make sure your rubric is exhaustive:
Key Questions:
Overview
Rubrics provide an objective structure to evaluate what makes a
good response to a given Prompt.
The rubric is a list of specific, simple, and declarative criteria that
can be evaluated as true (present) or false (not present) for
any given response to the prompt. Each criterion has weight
associated with how critical it is to fulfilling the prompt. The
criteria, when taken as a whole, should be able to evaluate any
possible response and rank different responses.
⚠️ If a task has the same criteria marked as "True" and the same
ones marked as "False" (or nearly identical), it is likely that the
task is not sufficiently good.
General Principle
● Diversity
○ The rubric items should include variable types of Report abuse
Published using Google Docs Learn more
information. If all criteria are like “the response
mentions A”, “the response mentions B”, then this
is not a good rubric.
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
● How many rubric items for each prompt
Atomic / Non-stacked
Specific
Classifying Rubrics
Objective vs Subjective
Explicit vs Implicit
Rubric
Weight
# Criteria Category Objectivity Explicity Weight
(Normalized)
16 The response -5
suggests carrying
narcan as a
precaution while
on the study.
17 The response -3
suggests hiring
additional staff in
order to meet
weekly
appointment
goals.
Why is it good?
The core criteria (C1, C3, C4, C5, C11) have the most weight whilst
the rest of the “nice to have” criteria have less weight. A good
rubric assigns logical weights based on the importance of the
criteria
Prompt: Medicine
Rubric
Weight
# Criteria Category Objectivity Explicity Weight
(Normalized)
🚫 Bad Examples
Technical Issues
Linters
What is a Linter?
Linters are automatic checks that run when a step is saved, like a
prompt or rubric, to help catch issues and keep things consistent.
Example linters:
Warning Linter
Information Linter
This workflow will allow you to save changes you have made
and proceed with the next step in the task.
Weighting Criteria.
rating of 9.2.
request (you can validate it here: Top 250 movies), but as a user,
we know that a good response might also give you the rating of
● C2: The response should provide the IMDb rating for the
For this trivial example, and considering only these two criteria, a
C1 - 70%
C2 - 30%
In which case,
picture of. Your goal is to pick out all the features that make up an
ideal response:
weights all in the first try. What ends up happening is that once Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
you read the model responses and compare them, you'll find
a few percentage points. On that note, if you find anything that one
Avoid:
❌ Changing the weights just to get the scores you want; only
adjust them based on how important each criterion actually is. If
the project.
Scoring
Additional
Criteria 1-2 (Fail) 3 (Okay) 4-5 (Good/ Perfect) Notes
[Prompt] - [Major Clarity - [Minor Clarity / - There is little to no
Clarity and Issues] The prompt Specificity room for
Specificity contains significant Issues] It's misinterpretation of the
spelling/grammar errors that mostly clear specific request
render it unintelligible. what is being - Prompt has a specific
- The prompt has major asked but the request that doesn't
ambiguities which render request could require more than one
the problem ill-defined. reasonably be minor assumption to
interpreted answer it.
multiple ways
- Most experts
when presented
with the prompt’s
ambiguity would
make the same
assumption to
render it well-
defined
- Prompt has one
or two
spelling/grammar
errors which
have a minor
Published using Google Docs impact on clarity Report abuse Learn more
Temporal
example:
Updated automatically every 5
Valkyrie: Project Specifications - V2 minutes
Hey! I have
been pretty
disconnected
from the
entire
presidential
election. I'm
going to do
some
- [Major Feasibility
research on
Issues] Prompt gives
it. First, tell
conflicting/contradicting
me who the
instructions that can't be
candidates
fulfilled simultaneously
this year are.
(unless specifically
Then, tell me
instructed to do so) [Minor
what some
- [Temporal Prompt] The Feasibility
of the latest
prompt contains a temporal Issues]
polls have
question. It uses time- - Prompt's
- The prompt is said on who
relative terms (e.g., "latest") request is
completely actionable will win?
where the response verging on being
by an LLM or chatbot
depends on when the impractical and
[Prompt] - The prompt contains “Harmful
question was asked or asks the LLM won't be
Feasibility no conflicting Content” for
for information able to
instructions/statements this project is
- [Harmful Content] The completely fulfill
- The prompt contains simply
prompt contains any everything asked
no harmful content described as
harmful content in the prompt,
“unsafe
but the prompt is
content.”
Note: it is okay for a prompt still answerable
This could
to be “tricky” or misleading with concessions
include:
as long as it is not explicitly
self-contradictory (e.g. 1. Content
asking the model “what was harms -
the name of the 53rd state
unsafe text
to join the US?”)
(bigotry,
conspiracy
theories)
2.
Facilitations
harms - text
that enables
unsafe
behavior
(how to
make a
bomb)
- [Minor
Contrivance]
Prompt contains
one or two
[Contrivedness/Unnatural] constraints that
are a little
- Prompt is reasonable
- The prompt and its
contrived, but and natural. It may
constraints are overly
prompt as a contain constraints or
[Prompt] restrictive, contrived,
whole is multiple questions,
Contrivedness unrealistic, or does not reasonable and and is clear in its ask
reflect something a real clear. It may be a to the user. It’s not
user might plausibly ask of little confusing, overly generic
a model but the model is
able to create
reasonable and
consistent
responses.
[Major Contradiction]
- If at least one criterion
contradicts what is required
in prompt
[Counterproductive
Criteria]
- At least one criterion is
objectively incorrect, or it
makes the response worse
when true. -[Irrelevant
- At least one rubric Criteria] If
question is framed in such a criteria
way that your answer to it being
while evaluating an ideal -The rubric covers
defined is the core
response would be “fail”
only instruction
partially - The rubric does
- [Missing Criteria] Criteria
connected
that are objectively 100% not contain
to the objective
essential to meet the
prompt, inaccuracies
fundamental requirements
and/or -The rubric covers
of the prompt are missing,
having it all prompt
preventing the response
defined constraints
from successfully fulfilling
does not -The rubric is
the core task if not included
make the
specific and
[Rubric] response
[Closed Ended Prompt] relevant
Rubric Criteria objectively
- If the prompt is close -If needed the
ended (has a short GTFA) better. rubric provides
but does not provide the direct answers
answer. - [Broad but -The rubric
Ratable provides binary
Example criteria
Criteria] The
Bad Rubric Criteria: "The -The rubric covers
criteria is broad
response should identify the all the necessary
but most people
first U.S. secretary of defense" criteria to create
would be able to
determine if it a perfect
Fixed Rubric Criteria: "The response.
response should state that the
was met (e.g.,
first U.S. secretary of defense
“the response
was James V. Forrestal"
should be
humorous”)
[Very Vague Criteria]
- The criteria is overly vague
such that it is unclear if one
could determine whether it
was fulfilled
[Atomicity]
- There are multiple
unrelated prompt requests
combined into one rubric
criteria