0% found this document useful (0 votes)

62 views16 pages

Workspace Projects 1

The document outlines various Workspace Projects focused on evaluating AI-generated responses across different tasks, including API evaluations, email analysis, and spreadsheet calculations. Contributors are instructed to assess responses based on quality dimensions such as completeness, understanding, truthfulness, and groundedness, with specific guidelines for each project. Additionally, it emphasizes the importance of reviewing project instructions and maintaining consistency in ratings and justifications.

Uploaded by

roni bot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views16 pages

Workspace Projects 1

Uploaded by

roni bot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Workspace Projects

📌 Here is a list of all the Workspace Projects you might encounter!:

🤖 Side-by-Side API and Model Evaluation 🤖

The goal of this project is to evaluate a prompt, two API calls generated by the model to retrieve emails
or files from the user’s workspace, and 2 model responses that search through a user’s emails or files to
respond to the prompt.

📄 Link to the specific project instructions HERE

🤖 Workspace SxS Eval Instructions 🤖

You will see a user prompt and two AI-generated responses, sometimes with additional context. Your
task is to judge each response based on several quality dimensions. Finally, you will decide which
response is better based on your individual ratings for each response.

📄 Link to the specific project instructions HERE

🤖 Bardkick File Q&A 🤖

Your job in this project is to evaluate the performance of an AI chatbot across two distinct tasks: Corpus
Q&A and Chosen Files Q&A.

 In Corpus Q&A, your role is to assess how well the AI synthesizes information from
multiple model documents to generate accurate, contextually relevant, and well-structured
responses.

 In Chosen Files Q&A, your job is to determine how effectively the AI retrieves and summarizes
relevant information from specific golden documents, focusing on the precision, relevance, and
quality of the output.

📄 Link to the specific project instructions HERE

🤖 Email Geese Instructions 🤖

In these projects contributors analyze and evaluate prompts and responses related to emails. They will
assess the relevance, diversity, and quality of replies to certain email threads, or judge the effectiveness
of chatbot features like summarization, AMA (Ask Me

Anything), and action items. Contributors will often compare different

responses and summaries to determine which one is most accurate,

informative, or helpful.
📄 Link to the specific project instructions HERE

🤖 Sheets Workspace Instructions 🤖

These are projects where contributors assess and evaluate the output of AI models generating formulas
and calculations in response to user queries. The tasks may involve reviewing the generated formula,
result, or summary provided by the side panel, and determining its accuracy, relevance, and usefulness
for a specific spreadsheet.

📄 Link to the specific project instructions HERE

🤖 Slides Geese Instructions 🤖

In these projects, you'll analyze and compare different types of content, including images, text, and
presentations. Some tasks involve generating new content based on a prompt or context, while others
require evaluating existing content against specific criteria. The goal is to assess the quality, relevance,
and effectiveness of the generated or provided content, using critical thinking and contextual
understanding to make informed judgments.

📄 Link to the specific project instructions HERE

📔 Overview of the Project

Your task is to evaluate each response based on various quality dimensions.

This project has different work streams and quality dimensions tailored to their specific nature.

You will be presented with:

 User Prompt

 Possible context (e.g., active files or context sources)

 Two model responses

📣 Important Note:

Before starting any work on a workstream, make sure to "Review the Instructions" section at the
beginning. This step is crucial for understanding the guidelines and requirements!

📌 Link to the instructions for all projects HERE

How to Get Started

You must access the reference files/documents for each task.

 Ensure you are signed into your browser and Google account with the email used for
Remotasks/Outlier.

 If you cannot access the reference files, stop tasking and request access from your QM
immediately.

 NOTE: Attachments will appear in the response section of the task most of the time, but may
also appear in the prompt section.

Ground Rules:

 All comments/justifications that you write must be in English, so that the project team can
understand them.

 Evaluate responses based on specified task dimensions and descriptions ONLY.

 Consider the user's perspective and what they are looking for before reviewing the answers.

 Ensure final Side-by-Side ratings align with individual response ratings. E.g., if Response A does
better on groundedness, completeness, understanding, etc. it should also be selected with
higher preference in the SxS rating.
How to Assess the Prompt

Is the prompt ratable❓

Yes, a prompt is ratable when:

1. The prompt is in the target language (your native language or English). 📣Remember! English
Prompts are also OK!📣

2. The prompt is understandable - you understand what the prompt is asking you to do. If the
prompt is nonsense or incomplete, mark the prompt as not rateable.

3. Mid-conversation prompts are ratable.

4. For tasks that include email responses ONLY - a prompt can simply include the content of the
email that the user wants to send. For example: "I would like a refund within 2 weeks." Even
though no explicit action is specified, it is clear that the user wants to generate an email with this
content. A prompt like this will be considered rateable.
How to Evaluate Individual Responses

Rate each Model Response individually based on the following:

 Step 1: Based on Specified dimensions (can vary across projects)

 E.g.: Completeness, Groundedness, Understanding

 Next to each option in a dimension, there is an information icon that you can hover over
to see its description.

 Step 2: Provide an overall rating with justification

 Make sure the overall rating is aligned with your prior ratings + justification. Explain why you
chose specific dimensional ratings.

Important to note:

 When a response makes a factual statement, confirm that the statement is true via outside
research.

 Example: if the response claims that the capital of Colombia is Bogotá, Google this to confirm it
is true.

 Example: if the response is summarizing claims about a document, verify the claims inside the
actual document.

 First, rate each response as a standalone. Do not compare right away.

Dimensions to Evaluate the Response

IMPORTANT NOTE: SPECIFIC PROJECT INSTRUCTIONS TAKE PRECEDENCE OVER THE COURSE IF THEY ARE
IN ANY WAY NOT ALIGNED

📍 Ratability

 Responses must be in the target language: Even though the prompts can be in English, the
responses cannot be in English.

 The responses and attachments must be understandable enough to assess if they successfully
answer the user's prompt.

 No Personally Identifiable Information (PII): (names and surnames are not considered PII).

 Examples of PII are:

 Social Security number, physical address, email address, driver's license number, bank account
number, passport number, Date of birth

 Non-Harmful: The response should not be harmful.

Completeness Dimension

Completeness asks you to evaluate whether the response addresses all parts of the user’s request.

A response can be COMPLETE, PARTIALLY COMPLETE, or INCOMPLETE.

For example, if the prompt asks for 3 things, we must ensure that the response returns the 3 things that
were asked for.

Example Prompt:

Good Completeness Response:

This response answers all parts of the user’s question and adds no unnecessary info.
Understanding Dimension

This dimension assesses how well YOU understand this response.

For example, if the prompt asks about summarizing an attachment, the response must be well presented
so that YOU can understand the information in the response about what is asked for in the prompt.

An understandable response should be easy to understand, well-written, coherently organized, and not
repetitive. It should use bullet points correctly and when necessary.

Rating options would be UNDERSTANDABLE, PARTIALLY UNDERSTANDABLE, or NOT UNDERSTANDABLE.

Example prompt:

Good Understanding Response:

Despite the truthfulness of the response, the information is perfectly understandable.

Truthfulness Dimension

What are Factual Claims?

Factual claims are information expected to be correct and verifiable through research. Focus on objective
claims that can be confirmed or refuted with specific information. Consider subjective claims accurate
unless strong evidence suggests otherwise.

Not all responses will have factual claims. If none are present or if the prompt asks for creative ideas,
mark this question as “N/A.”

⚠️Verify claims using tools like Google to research!

Rating options would be NO ISSUES, MINOR ISSUES, or MAJOR ISSUES.

Prompt example:

Bad Truthfulness example:

Explanation: The United States never had a feudal system. Therefore, by looking this up on the internet,
we can verify that this is not true.
Groundedness Dimension

Groundedness rates accuracy of the response based on information WITHIN the context of the
files/emails.

The difference between the Groundedness and Truthfulness Dimensions is:

 Truthfulness is based on the information we can find on the internet.

 Groundedness is based on the information we can find within the attachments.

Rating Options for Groundedness would be: COMPLETELY GROUNDED, REASONABLY GROUNDED, or NOT
GROUNDED

Prompt example:

Bad Groundedness example:

Document:

In this case, the model is wrong, as the document clearly states that Argentina's independence was in
1816.
GROUNDEDNESS vs COMPLETENESS

🤔 What is the difference between Groundedness and Completeness? When you do tasks, you will find
them similar, but they have their notable differences.

Groundedness

Groundedness asks you, based on the context files, to evaluate how accurately the response represents
information from sources that are provided. Look at the context files to determine how grounded the
response is.

Completeness

Completeness asks you to evaluate if the response addresses all parts of the user’s request. Regardless
of whether the information is true or not. Look at the response and assess whether it addressed ALL
PARTS of the prompt by the user.
Final LIKERT + JUSTIFICATION: How to Compare and Rank Responses Side-by-Side

In the final section of the task, you will compare the 2 responses on a scale of 1-7 to decide which
response is better and provide a justification.

🔴 The side-by-side justification should be 2-3 sentences and highlight the most important factors
influencing your preference.

🔴 A few guidelines for what a ‘better’ response looks like:

 “Better’ responses are grounded and truthful, meaning that the presented facts are based on the
context provided and verifiable using external sources.

 If multiple responses are similarly (in)correct, consider which response is most likely to be
helpful, meaning the response matches the prompt and provides a useful starting point for the
user.

! Please make sure your final rating is consistent with the ratings you gave for each response individually.

Example: You rated

 Response 1 'Okay'

 Response 2 'Amazing'

❌ You should not then rate Response 1 as better than Response 2.

* If your Response comparison rating does not match your individual ratings for each Response, you will
see a red 'Linter' error.
How to write good justifications

🥸 To write a good justification, it is very important to detail the dimensions rated for each response.

✍️It is also important to explain why and what in the response is wrong/right.

🦾 And finally, and if necessary, what would be the things to modify for a perfect response.

Good Justification Example:

Code Extensions - Instructions
No ratings yet
Code Extensions - Instructions
19 pages
Code Extensions - Instructions
No ratings yet
Code Extensions - Instructions
16 pages
Makeup Delight Instructions (One-Sided)
No ratings yet
Makeup Delight Instructions (One-Sided)
10 pages
Tasking Guidelines - Project Shield
No ratings yet
Tasking Guidelines - Project Shield
10 pages
Project Instructions
No ratings yet
Project Instructions
12 pages
Nightgown Standoff
No ratings yet
Nightgown Standoff
7 pages
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
No ratings yet
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
26 pages
Exposition Wholesale
No ratings yet
Exposition Wholesale
5 pages
Core Evals English Instructions
No ratings yet
Core Evals English Instructions
24 pages
Tasking Guidelines - Project Shield
No ratings yet
Tasking Guidelines - Project Shield
12 pages
(Internal) I18n Code Evals Instructions
No ratings yet
(Internal) I18n Code Evals Instructions
18 pages
Instructions
No ratings yet
Instructions
19 pages
Tasking Guidelines - Project Shield
No ratings yet
Tasking Guidelines - Project Shield
13 pages
Nightingale RLHF Code Onboarding WIP
No ratings yet
Nightingale RLHF Code Onboarding WIP
26 pages
Blackhat OpenEval Instructions
No ratings yet
Blackhat OpenEval Instructions
5 pages
Clsami46c02qx072ibtdtavm1 - Project Blackhat Code Eval Correctness
No ratings yet
Clsami46c02qx072ibtdtavm1 - Project Blackhat Code Eval Correctness
5 pages
Align
No ratings yet
Align
5 pages
Code V Code Official Instructions
No ratings yet
Code V Code Official Instructions
43 pages
Extensions V2 Tool Log
No ratings yet
Extensions V2 Tool Log
6 pages
Makeup Delight Feedback&FAQ
No ratings yet
Makeup Delight Feedback&FAQ
30 pages
Prompt Cook Book
No ratings yet
Prompt Cook Book
24 pages
User Experience Design Project Guide
No ratings yet
User Experience Design Project Guide
70 pages
Outlier - Ratable Prompts
No ratings yet
Outlier - Ratable Prompts
5 pages
Outlier 2
No ratings yet
Outlier 2
20 pages
Clv595vf907tb07414j7u68vj - Labelbox Project Flywheel - Labeling Instructions
No ratings yet
Clv595vf907tb07414j7u68vj - Labelbox Project Flywheel - Labeling Instructions
13 pages
I18n Evals - LLM Prompt and Response Evaluation - Guideline - v.1.2
No ratings yet
I18n Evals - LLM Prompt and Response Evaluation - Guideline - v.1.2
17 pages
Side by Side Evals
No ratings yet
Side by Side Evals
25 pages
Makeup Delight Project (Two-Sided Evals)
No ratings yet
Makeup Delight Project (Two-Sided Evals)
11 pages
Principals and Heuristics of Effective Prompt Engineering
No ratings yet
Principals and Heuristics of Effective Prompt Engineering
10 pages
1 UsingLLMs
No ratings yet
1 UsingLLMs
24 pages
M2-V2 Development Approach & Lifecycle Tutorial Prompts
No ratings yet
M2-V2 Development Approach & Lifecycle Tutorial Prompts
5 pages
Mastering Reactive Prompting For AI Agents
No ratings yet
Mastering Reactive Prompting For AI Agents
11 pages
Crypto
No ratings yet
Crypto
36 pages
PWA 1 - AI Prompts Fall 2025
No ratings yet
PWA 1 - AI Prompts Fall 2025
3 pages
Wa0002.
No ratings yet
Wa0002.
6 pages
Comprehensive Writing Guide For English Benchmark 1
No ratings yet
Comprehensive Writing Guide For English Benchmark 1
18 pages
The Snake Eyes Project Tasking Handbook
No ratings yet
The Snake Eyes Project Tasking Handbook
27 pages
Prompt Genie & Promptimize Ai - Feature Research & Ux Recommendations
No ratings yet
Prompt Genie & Promptimize Ai - Feature Research & Ux Recommendations
7 pages
Omni Project Summarized Rules
No ratings yet
Omni Project Summarized Rules
7 pages
Pangolin Safety - Handbook
No ratings yet
Pangolin Safety - Handbook
9 pages
Kh6003cem - CW2 - Draft
No ratings yet
Kh6003cem - CW2 - Draft
8 pages
LLM SFT Data Guideline v2.0
No ratings yet
LLM SFT Data Guideline v2.0
13 pages
Annotations
No ratings yet
Annotations
11 pages
Things To Remember
No ratings yet
Things To Remember
1 page
Gen Builder - Prompt Canvas
No ratings yet
Gen Builder - Prompt Canvas
1 page
(Centific Version) Model Safety Quality SXS Eval (X2T) - V2
No ratings yet
(Centific Version) Model Safety Quality SXS Eval (X2T) - V2
58 pages
ProjectSpec - FirstEvaluation - en FINAL P2 Y P3
No ratings yet
ProjectSpec - FirstEvaluation - en FINAL P2 Y P3
2 pages
Interactive Persona Certification Guide
No ratings yet
Interactive Persona Certification Guide
15 pages
Section Advanced AI Prompt Writing Student Activity Packet Apr 2024
No ratings yet
Section Advanced AI Prompt Writing Student Activity Packet Apr 2024
16 pages
Cluster3 Prompt Engineering Generative AI Practice Summary
No ratings yet
Cluster3 Prompt Engineering Generative AI Practice Summary
6 pages
HTTP Plugin Notes
No ratings yet
HTTP Plugin Notes
7 pages
Prompt Eng Notes
No ratings yet
Prompt Eng Notes
5 pages
Outlier Cheat Sheets
100% (4)
Outlier Cheat Sheets
42 pages
How I Won Singapore's GPT-4 Prompt Engineering Competition - by Sheila Teo - in Towards Data Science - Freedium
No ratings yet
How I Won Singapore's GPT-4 Prompt Engineering Competition - by Sheila Teo - in Towards Data Science - Freedium
36 pages
Excerpt
No ratings yet
Excerpt
10 pages
Exercise Workbook2 Basic
No ratings yet
Exercise Workbook2 Basic
90 pages
Diversity Models and Dimensions Guide
No ratings yet
Diversity Models and Dimensions Guide
4 pages
DAA Experiment - 3
No ratings yet
DAA Experiment - 3
40 pages
Action Plan AP
No ratings yet
Action Plan AP
3 pages
A Bitemark On Ear: Case Report
No ratings yet
A Bitemark On Ear: Case Report
3 pages
Starfinder Alien Archive 4 Pawn Collection 3 4
No ratings yet
Starfinder Alien Archive 4 Pawn Collection 3 4
2 pages
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
No ratings yet
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
6 pages
2024 Assessment Handbook
No ratings yet
2024 Assessment Handbook
20 pages
B2 UNIT 2 Test Answer Key Standard
No ratings yet
B2 UNIT 2 Test Answer Key Standard
2 pages
Susanne K. Langer: THE Symbol OF Feeling
No ratings yet
Susanne K. Langer: THE Symbol OF Feeling
15 pages
CONFIGURATION UPLOAD LSMW
No ratings yet
CONFIGURATION UPLOAD LSMW
31 pages
Chromotherapy
100% (1)
Chromotherapy
10 pages
Encyclopedia of Philosophy 2nd Ed Volume 10 Appendix Additional Articles Thematic Outline Bibliographies Index 2nd Ed Edition Donald M. Borchert Instant Download
100% (5)
Encyclopedia of Philosophy 2nd Ed Volume 10 Appendix Additional Articles Thematic Outline Bibliographies Index 2nd Ed Edition Donald M. Borchert Instant Download
76 pages
6630-Article Text-12424-1-10-20180412
No ratings yet
6630-Article Text-12424-1-10-20180412
13 pages
25x26 House Plan From House Construction Telegu YouTube Channel
No ratings yet
25x26 House Plan From House Construction Telegu YouTube Channel
1 page
Biodiversity and The Healthy Society
No ratings yet
Biodiversity and The Healthy Society
26 pages
AICh EWeir Loading SPR 2009
No ratings yet
AICh EWeir Loading SPR 2009
13 pages
Class XI Admission Fees 2022-23
No ratings yet
Class XI Admission Fees 2022-23
1 page
Fzo Ain Rep
No ratings yet
Fzo Ain Rep
42 pages
AI in Medical Imaging Challenges
No ratings yet
AI in Medical Imaging Challenges
21 pages
Assessing Wildfire Vulnerability of Vegetated Serpentine Soils in The Balkan Peninsula
No ratings yet
Assessing Wildfire Vulnerability of Vegetated Serpentine Soils in The Balkan Peninsula
13 pages
DS1720 01
No ratings yet
DS1720 01
19 pages
Ohmicide Ref Man
100% (1)
Ohmicide Ref Man
33 pages
T-Root Blades in A Steam Turbine Rotor A
No ratings yet
T-Root Blades in A Steam Turbine Rotor A
8 pages
Class 12 Geography: Planning & Sustainable Development
No ratings yet
Class 12 Geography: Planning & Sustainable Development
40 pages
Straight-Line Motion Crossword Puzzle
No ratings yet
Straight-Line Motion Crossword Puzzle
2 pages
Curriculum Vitae Of: MD. Shafiqul Islam
No ratings yet
Curriculum Vitae Of: MD. Shafiqul Islam
5 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Mastering Relative Clauses
No ratings yet
Mastering Relative Clauses
10 pages

Workspace Projects 1

Uploaded by

Workspace Projects 1

Uploaded by

Workspace Projects

📌 Here is a list of all the Workspace Projects you might encounter!:

🤖 Side-by-Side API and Model Evaluation 🤖

📄 Link to the specific project instructions HERE

🤖 Workspace SxS Eval Instructions 🤖

📄 Link to the specific project instructions HERE

🤖 Bardkick File Q&A 🤖

📄 Link to the specific project instructions HERE

🤖 Email Geese Instructions 🤖

Anything), and action items. Contributors will often compare different

responses and summaries to determine which one is most accurate,

🤖 Sheets Workspace Instructions 🤖

📄 Link to the specific project instructions HERE

🤖 Slides Geese Instructions 🤖

📄 Link to the specific project instructions HERE

Your task is to evaluate each response based on various quality dimensions.

You will be presented with:

 Possible context (e.g., active files or context sources)

 Two model responses

📌 Link to the instructions for all projects HERE

You must access the reference files/documents for each task.

 Evaluate responses based on specified task dimensions and descriptions ONLY.

Is the prompt ratable❓

Yes, a prompt is ratable when:

3. Mid-conversation prompts are ratable.

Rate each Model Response individually based on the following:

 Step 1: Based on Specified dimensions (can vary across projects)

 E.g.: Completeness, Groundedness, Understanding

 Step 2: Provide an overall rating with justification

 First, rate each response as a standalone. Do not compare right away.

 Examples of PII are:

 Non-Harmful: The response should not be harmful.

A response can be COMPLETE, PARTIALLY COMPLETE, or INCOMPLETE.

Good Completeness Response:

This dimension assesses how well YOU understand this response.

Rating options would be UNDERSTANDABLE, PARTIALLY UNDERSTANDABLE, or NOT UNDERSTANDABLE.

Good Understanding Response:

Despite the truthfulness of the response, the information is perfectly understandable.

What are Factual Claims?

⚠️Verify claims using tools like Google to research!

Rating options would be NO ISSUES, MINOR ISSUES, or MAJOR ISSUES.

Bad Truthfulness example:

The difference between the Groundedness and Truthfulness Dimensions is:

 Truthfulness is based on the information we can find on the internet.

 Groundedness is based on the information we can find within the attachments.

Bad Groundedness example:

🔴 A few guidelines for what a ‘better’ response looks like:

Example: You rated

❌ You should not then rate Response 1 as better than Response 2.

Good Justification Example:

You might also like