Workspace Projects
📌 Here is a list of all the Workspace Projects you might encounter!:
🤖 Side-by-Side API and Model Evaluation 🤖
The goal of this project is to evaluate a prompt, two API calls generated by the model to retrieve emails
or files from the user’s workspace, and 2 model responses that search through a user’s emails or files to
respond to the prompt.
📄 Link to the specific project instructions HERE
🤖 Workspace SxS Eval Instructions 🤖
You will see a user prompt and two AI-generated responses, sometimes with additional context. Your
task is to judge each response based on several quality dimensions. Finally, you will decide which
response is better based on your individual ratings for each response.
📄 Link to the specific project instructions HERE
🤖 Bardkick File Q&A 🤖
Your job in this project is to evaluate the performance of an AI chatbot across two distinct tasks: Corpus
Q&A and Chosen Files Q&A.
In Corpus Q&A, your role is to assess how well the AI synthesizes information from
multiple model documents to generate accurate, contextually relevant, and well-structured
responses.
In Chosen Files Q&A, your job is to determine how effectively the AI retrieves and summarizes
relevant information from specific golden documents, focusing on the precision, relevance, and
quality of the output.
📄 Link to the specific project instructions HERE
🤖 Email Geese Instructions 🤖
In these projects contributors analyze and evaluate prompts and responses related to emails. They will
assess the relevance, diversity, and quality of replies to certain email threads, or judge the effectiveness
of chatbot features like summarization, AMA (Ask Me
Anything), and action items. Contributors will often compare different
responses and summaries to determine which one is most accurate,
informative, or helpful.
📄 Link to the specific project instructions HERE
🤖 Sheets Workspace Instructions 🤖
These are projects where contributors assess and evaluate the output of AI models generating formulas
and calculations in response to user queries. The tasks may involve reviewing the generated formula,
result, or summary provided by the side panel, and determining its accuracy, relevance, and usefulness
for a specific spreadsheet.
📄 Link to the specific project instructions HERE
🤖 Slides Geese Instructions 🤖
In these projects, you'll analyze and compare different types of content, including images, text, and
presentations. Some tasks involve generating new content based on a prompt or context, while others
require evaluating existing content against specific criteria. The goal is to assess the quality, relevance,
and effectiveness of the generated or provided content, using critical thinking and contextual
understanding to make informed judgments.
📄 Link to the specific project instructions HERE
📔 Overview of the Project
Your task is to evaluate each response based on various quality dimensions.
This project has different work streams and quality dimensions tailored to their specific nature.
You will be presented with:
User Prompt
Possible context (e.g., active files or context sources)
Two model responses
📣 Important Note:
Before starting any work on a workstream, make sure to "Review the Instructions" section at the
beginning. This step is crucial for understanding the guidelines and requirements!
📌 Link to the instructions for all projects HERE
How to Get Started
You must access the reference files/documents for each task.
Ensure you are signed into your browser and Google account with the email used for
Remotasks/Outlier.
If you cannot access the reference files, stop tasking and request access from your QM
immediately.
NOTE: Attachments will appear in the response section of the task most of the time, but may
also appear in the prompt section.
Ground Rules:
All comments/justifications that you write must be in English, so that the project team can
understand them.
Evaluate responses based on specified task dimensions and descriptions ONLY.
Consider the user's perspective and what they are looking for before reviewing the answers.
Ensure final Side-by-Side ratings align with individual response ratings. E.g., if Response A does
better on groundedness, completeness, understanding, etc. it should also be selected with
higher preference in the SxS rating.
How to Assess the Prompt
Is the prompt ratable❓
Yes, a prompt is ratable when:
1. The prompt is in the target language (your native language or English). 📣Remember! English
Prompts are also OK!📣
2. The prompt is understandable - you understand what the prompt is asking you to do. If the
prompt is nonsense or incomplete, mark the prompt as not rateable.
3. Mid-conversation prompts are ratable.
4. For tasks that include email responses ONLY - a prompt can simply include the content of the
email that the user wants to send. For example: "I would like a refund within 2 weeks." Even
though no explicit action is specified, it is clear that the user wants to generate an email with this
content. A prompt like this will be considered rateable.
How to Evaluate Individual Responses
Rate each Model Response individually based on the following:
Step 1: Based on Specified dimensions (can vary across projects)
E.g.: Completeness, Groundedness, Understanding
Next to each option in a dimension, there is an information icon that you can hover over
to see its description.
Step 2: Provide an overall rating with justification
Make sure the overall rating is aligned with your prior ratings + justification. Explain why you
chose specific dimensional ratings.
Important to note:
When a response makes a factual statement, confirm that the statement is true via outside
research.
Example: if the response claims that the capital of Colombia is Bogotá, Google this to confirm it
is true.
Example: if the response is summarizing claims about a document, verify the claims inside the
actual document.
First, rate each response as a standalone. Do not compare right away.
Dimensions to Evaluate the Response
IMPORTANT NOTE: SPECIFIC PROJECT INSTRUCTIONS TAKE PRECEDENCE OVER THE COURSE IF THEY ARE
IN ANY WAY NOT ALIGNED
📍 Ratability
Responses must be in the target language: Even though the prompts can be in English, the
responses cannot be in English.
The responses and attachments must be understandable enough to assess if they successfully
answer the user's prompt.
No Personally Identifiable Information (PII): (names and surnames are not considered PII).
Examples of PII are:
Social Security number, physical address, email address, driver's license number, bank account
number, passport number, Date of birth
Non-Harmful: The response should not be harmful.
Completeness Dimension
Completeness asks you to evaluate whether the response addresses all parts of the user’s request.
A response can be COMPLETE, PARTIALLY COMPLETE, or INCOMPLETE.
For example, if the prompt asks for 3 things, we must ensure that the response returns the 3 things that
were asked for.
Example Prompt:
Good Completeness Response:
This response answers all parts of the user’s question and adds no unnecessary info.
Understanding Dimension
This dimension assesses how well YOU understand this response.
For example, if the prompt asks about summarizing an attachment, the response must be well presented
so that YOU can understand the information in the response about what is asked for in the prompt.
An understandable response should be easy to understand, well-written, coherently organized, and not
repetitive. It should use bullet points correctly and when necessary.
Rating options would be UNDERSTANDABLE, PARTIALLY UNDERSTANDABLE, or NOT UNDERSTANDABLE.
Example prompt:
Good Understanding Response:
Despite the truthfulness of the response, the information is perfectly understandable.
Truthfulness Dimension
What are Factual Claims?
Factual claims are information expected to be correct and verifiable through research. Focus on objective
claims that can be confirmed or refuted with specific information. Consider subjective claims accurate
unless strong evidence suggests otherwise.
Not all responses will have factual claims. If none are present or if the prompt asks for creative ideas,
mark this question as “N/A.”
⚠️Verify claims using tools like Google to research!
Rating options would be NO ISSUES, MINOR ISSUES, or MAJOR ISSUES.
Prompt example:
Bad Truthfulness example:
Explanation: The United States never had a feudal system. Therefore, by looking this up on the internet,
we can verify that this is not true.
Groundedness Dimension
Groundedness rates accuracy of the response based on information WITHIN the context of the
files/emails.
The difference between the Groundedness and Truthfulness Dimensions is:
Truthfulness is based on the information we can find on the internet.
Groundedness is based on the information we can find within the attachments.
Rating Options for Groundedness would be: COMPLETELY GROUNDED, REASONABLY GROUNDED, or NOT
GROUNDED
Prompt example:
Bad Groundedness example:
Document:
In this case, the model is wrong, as the document clearly states that Argentina's independence was in
1816.
GROUNDEDNESS vs COMPLETENESS
🤔 What is the difference between Groundedness and Completeness? When you do tasks, you will find
them similar, but they have their notable differences.
Groundedness
Groundedness asks you, based on the context files, to evaluate how accurately the response represents
information from sources that are provided. Look at the context files to determine how grounded the
response is.
Completeness
Completeness asks you to evaluate if the response addresses all parts of the user’s request. Regardless
of whether the information is true or not. Look at the response and assess whether it addressed ALL
PARTS of the prompt by the user.
Final LIKERT + JUSTIFICATION: How to Compare and Rank Responses Side-by-Side
In the final section of the task, you will compare the 2 responses on a scale of 1-7 to decide which
response is better and provide a justification.
🔴 The side-by-side justification should be 2-3 sentences and highlight the most important factors
influencing your preference.
🔴 A few guidelines for what a ‘better’ response looks like:
“Better’ responses are grounded and truthful, meaning that the presented facts are based on the
context provided and verifiable using external sources.
If multiple responses are similarly (in)correct, consider which response is most likely to be
helpful, meaning the response matches the prompt and provides a useful starting point for the
user.
! Please make sure your final rating is consistent with the ratings you gave for each response individually.
Example: You rated
Response 1 'Okay'
Response 2 'Amazing'
❌ You should not then rate Response 1 as better than Response 2.
* If your Response comparison rating does not match your individual ratings for each Response, you will
see a red 'Linter' error.
How to write good justifications
🥸 To write a good justification, it is very important to detail the dimensions rated for each response.
✍️It is also important to explain why and what in the response is wrong/right.
🦾 And finally, and if necessary, what would be the things to modify for a perfect response.
Good Justification Example: