KEMBAR78
Reference 1 | PDF | Data | Information
0% found this document useful (0 votes)
11 views17 pages

Reference 1

Uploaded by

malleshnethi630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

Reference 1

Uploaded by

malleshnethi630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 1

Recipe1M+: A Dataset for Learning


Cross-Modal Embeddings for Cooking Recipes
and Food Images
Javier Marı́n1 *, Aritro Biswas1 *, Ferda Ofli2 , Nicholas Hynes1 , Amaia Salvador3 , Yusuf Aytar1 , Ingmar
Weber2 , Antonio Torralba1
1 Massachusetts Institute of Technology 2 Qatar Computing Research Institute, HBKU
3 Universitat Politècnica de Catalunya
{abiswas,nhynes}@mit.edu, {jmarin,yusuf,torralba}@csail.mit.edu, amaia.salvador@upc.edu, {fofli,iweber}@hbku.edu.qa

Abstract—In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13
million food images. As the largest publicly available collection of recipe data, Recipe1M+ affords the ability to train high-capacity models
on aligned, multimodal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields
impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level
classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate
that these embeddings will provide a basis for further exploration of the Recipe1M+ dataset and food and cooking in general. Code, data
and models are publicly available.

Index Terms—Cross-modal, deep learning, cooking recipes, food images

1 I NTRODUCTION

T HERE are few things so fundamental to the human experience


as food. Its consumption is intricately linked to our health,
our feelings and our culture. Even migrants starting a new life in a
foreign country often hold on to their ethnic food longer than to
their native language. Vital as it is to our lives, food also offers new
perspectives on topical challenges in computer vision like finding
representations that are robust to occlusion and deformation (as
occur during ingredient processing).
The profusion of online recipe collections with user-submitted
photos presents the possibility of training machines to automatically
understand food preparation by jointly analyzing ingredient lists,
cooking instructions and food images. Far beyond applications
solely in the realm of culinary arts, such a tool may also be applied
Fig. 1. Learning cross-modal embeddings from recipe-image pairs
to the plethora of food images shared on social media to achieve collected from online resources. These embeddings enable us to achieve
insight into the significance of food and its preparation on public in-depth understanding of food from its ingredients to its preparation.
health [1] and cultural heritage [2]. Developing a tool for automated
analysis requires large and well-curated datasets.
The emergence of massive labeled datasets [3], [4] and deeply- Hence, we argue that food images must be analyzed together
learned representations [5], [6], [7] have redefined the state-of- with accompanying recipe ingredients and instructions in order
the-art in object recognition and scene classification. Moreover, to acquire a comprehensive understanding of “behind-the-scene”
the same techniques have enabled progress in new domains like cooking process as illustrated in Fig. 1.
dense labeling and image segmentation. Perhaps the introduction of Existing work, however, has focused largely on the use of
a new large-scale food dataset–complete with its own intrinsic medium-scale image datasets for performing food categorization.
challenges–will yield a similar advancement of the field. For For instance, Bossard et al. [8] introduced the Food-101 visual
instance, categorizing an ingredient’s state (e.g., sliced, diced, classification dataset and set a baseline of 50.8% accuracy. Even
raw, baked, grilled, or boiled) provides a unique challenge in with the impetus for food image categorization, subsequent work
attribute recognition–one that is not well posed by existing datasets. by [9], [10] and [11] could only improve this result to 77.4%,
Furthermore, the free-form nature of food suggests a departure 79% and 80.9%, respectively, which indicates that the size of the
from the concrete task of classification in favor of a more dataset may be the limiting factor. Although Myers et al. [10] built
nuanced objective that integrates variation in a recipe’s structure. upon Food-101 to tackle the novel challenge of estimating a meal’s
energy content, the segmentation and depth information used in
*contributed equally. their work are not made available for further exploration.

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 2

In this work, we address data limitations by introducing important to the appearance of food, other attributes such as the
the large-scale Recipe1M+ dataset which contains one million manner of cutting and manner of cooking ingredients also play a
structured cooking recipes and their images. Additionally, to role in forming the food’s appearance. Given a food image, they
demonstrate its utility, we present the im2recipe retrieval task which attempt to predict ingredient, cutting and cooking attributes, and use
leverages the full dataset–images and text–to solve the practical these predictions to help retrieve the correct corresponding recipe.
and socially relevant problem of demystifying the creation of a With our model, we attempt to retrieve the recipe directly, without
dish that can be seen but not necessarily described. To this end, we first predicting attributes like ingredients, cutting and cooking
have developed a multimodal neural model which jointly learns to attributes, separately. Furthermore, along with retrieving the recipe
embed images and recipes in a common space which is semantically matching an image, our model also allow to retrieve the image
regularized by the addition of a high-level classification task. The matching a corresponding recipe.
performance of the resulting embeddings is thoroughly evaluated The two most relevant studies to the current one are presented
against baselines and humans, showing remarkable improvement in [18] and [19]. Different from our work, Chen et al. [18] approach
over the former while faring comparably to the latter. With the the image-to-recipe retrieval problem from the perspective of
release of Recipe1M+, we hope to spur advancement on not only attention modeling where they incorporate word-level and sentence-
the im2recipe task but also heretofore unimagined objectives which level attentions into their recipe representation and align them with
require a deep understanding of the domain and its modalities. the corresponding image representation such that both text and
visual features have high similarity in a multi-dimensional space.
1.1 Related Work Another difference is that they employ a rank loss instead of a
Since we presented our initial work on the topic back in 2017 [12], pairwise similarity loss as we do. These improvements effectively
several related studies have been published and we feel obliged to lead to slight performance increases in both image-to-recipe and
provide a brief discussion about them. recipe-to-image retrieval tasks.
Herranz et al. [13], besides providing a detailed description on On the other hand, building upon the same network architecture
recent work focusing on food applications, propose an extended as in our original work [12] to represent the image and text (recipe)
multimodal framework that relies on food imagery, recipe and modalities, Carvalho et al. [19] improve our initial results further
nutritional information, geolocation and time, restaurant menus by proposing a new objective function that combines retrieval
and food styles. In another study, Min et al. [14] present a multi- and classification tasks in a double-triplet learning scheme. This
attribute theme modeling (MATM) approach that incorporates food new scheme captures both instance-based (i.e., fine-grained) and
attributes such as cuisine style, course type, flavors or ingredient semantic-based (i.e., high-level) structure simultaneously in the
types. Then, similar to our work, they train a multimodal embedding latent space since the semantic information is directly injected into
which learns a common space between the different food attributes the cross-modal metric learning problem as opposed to our use of
and the corresponding food image. Most interesting applications of classification task as semantic regularization. Additionally, they
their model include flavor analysis, region-oriented food summary, follow an adaptive training strategy to account for the vanishing
and recipe recommendation. In order to build their model, they gradient problem of the triplet losses and use the MedR score
collect all their data from a single data source, i.e., Yummly1 , which instead of the original loss in the validation phase for early stopping.
is an online recipe recommendation system. We also find that using the MedR score as the performance measure
In another interesting study, Chang et al. [15] focus on analyz- in the validation phase is more stable. However, our work is
ing several possible preparations of a single dish, like “chocolate orthogonal to both of these studies, i.e., their performances can
chip cookie.” The authors design an interface that allows users to be further improved with the use of our expanded dataset and the
explore the similarities and differences between such recipes by quality of their embeddings can be further explored with various
visualizing the structural similarity between recipes as points in arithmetics presented in this submission.
a space, in which clusters are formed according to how similar The rest of the paper is organized as follows. In Section 2,
recipes are. Furthermore, they examine how cooking instructions we introduce our large-scale, multimodal cooking recipe dataset
overlap between two recipes to measure recipe similarity. Our work and provide details about its collection process. We describe our
is of a different flavor, as the features they use to measure similarity recipe and image representations in Section 3 and present our
are manually picked by humans, while ours are automatically neural joint embedding model in Section 4. Then, in Section 5,
learned by a multimodal network. we discuss our semantic regularization approach to enhance our
Getting closer to the information retrieval domain, Engilberge joint embedding model. In Section 6, we present results from our
et al. [16] examine the problem of retrieving the best matching various experiments and conclude the paper in Section 7.
caption for an image. In order to do so, they use neural networks
to create embeddings for each caption, and retrieve the one whose 2 DATASET
embedding most closely matches the embedding of the original
image. In our work, we aim to also use embeddings to retrieve the Due to their complexity, textually and visually, (e.g., ingredient-
recipe matching an image, or vice versa. However, since our domain based variants of the same dish, different presentations, or multiple
involves cooking recipes while theirs only involves captions, we ways of cooking a recipe), understanding food recipes demands
account for two separate types of text – ingredients and cooking a large, general collection of recipe data. Hence, it should not
instructions – and combine them in a different way in our model. be surprising that the lack of a larger body of work on the topic
Alternatively, Chen et al. [17] study the task of retrieving a could be the result of missing such a collection. To our knowledge,
recipe matching a corresponding food image in a slightly different practically all the datasets publicly available in the research field
way. The authors find that, although ingredient composition is either contain only categorized images [8], [10], [20], [21] or
simply recipe text [22]. Only recently have a few datasets been
1. https://www.yummly.com/ released that include both recipes and images. For instance, Wang

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 3

et al. [23] released a multimodal food dataset which has 101k


images divided equally among 101 food categories; the recipes for
each are however raw HTML. In a later work, Chen and Ngo [24]
presented a dataset containing 110,241 images annotated with 353
ingredient labels and 65,284 recipes, each with a brief introduction,
ingredient list, and preparation instructions. Of note is that the
dataset only contains recipes for Chinese cuisine.
Although the aforementioned datasets constitute a large step
towards learning richer recipe representations, they are still limited
in either generality or size. As the ability to learn effective
representations is largely a function of the quantity (especially
when learning features using deep architectures) and quality of the
available data, we create and release publicly a new, large-scale
corpus of structured recipe data that includes over 1M recipes and
13M images. In comparison to the current largest datasets in this
domain, the Recipe1M+ includes twice as many recipes as [22]
Fig. 2. Google image search results. The query used is chicken wings.
and 130 times as many images as [24].
We created the Recipe1M+ dataset in two phases. In the first TABLE 1
phase, we collected a large dataset of cooking recipes paired Dataset sizes. Number of recipes and images in training, validation and
with food images, all scraped from a number of popular cooking test sets of each dataset.
websites, which resulted in more than 1M cooking recipes and
800K food images (i.e., Recipe1M [12]). Then, in the second Recipe1M intersection Recipe1M+
phase, we augmented each recipe in this initial collection with Partition # Recipes # Images # Images # Images
food images downloaded from the Web using a popular image Training 720,639 619,508 493,339 9,727,961
search engine, which amounted to over 13M food images after Validation 155,036 133,860 107,708 1,918,890
cleaning and removing exact-and-near duplicates. In the following Test 154,045 134,338 115,373 2,088,828
subsections, we elaborate further on these data collection phases, Total 1,029,720 887,706 716,480 13,735,679
outline how the dataset is organized, and provide analysis of its
contents.
Internet looking for websites, videos, images and any other type of
content that matches a text query (some of them also support image
2.1 Data Collection from Recipe Websites
queries). Looking at the search results for a given recipe title (e.g.,
The recipes were scraped from over two dozen popular cook- “chicken wings”) in Fig. 2, one can say that the retrieved images
ing websites and processed through a pipeline that extracted are generally of very good quality. We also observed during the
relevant text from the raw HTML, downloaded linked images, first phase of data collection from recipe websites that users were
and assembled the data into a compact JSON schema in which often using images from other recipes of the same dish (sometimes
each datum was uniquely identified. As part of the extraction with slight differences) to visually describe theirs. Motivated by
process, excessive whitespace, HTML entities, and non-ASCII these insights, we downloaded a large amount of images using as
characters were removed from the recipe text. Finally, after queries the recipe titles collected from the recipe websites in the
removing duplicates and near-matches (constituting roughly 2% of first phase.
the original data), the retained dataset contained over 1M cooking
Data Download. We targeted collecting 50M images, i.e., 50
recipes and 800K food images (i.e., Recipe1M [12]). Although
images per recipe in the initial collection. In order to amass such
the resulting dataset is already larger than any other dataset in this
a quantity of images, we chose the Google search engine. As
particular domain (i.e., includes twice as many recipes as [22] and
mentioned before, we used the title of each recipe as a query.
eight times as many images as [24]), the total number of images is
Out of the Google search results, we selected the top 50 retrieved
not yet at the same scale as the largest publicly available datasets
images and stored locally their image URLs. For this task, we used
such as ImageNet [3] and Places [25], which contain tens of
publicly available Python libraries on ten servers in parallel for
millions of images, in the computer vision community. Therefore,
several days. Then, to download images simultaneously, we made
in the next phase, we aimed to extend the initial collection of
use of Aria23 , a publicly available download utility. In the end,
images by querying for food images through an image search
we managed to download over 47M images as some of the image
engine.
URLs either were corrupted or did not exist any more.
Data Consolidation. One of the first tasks, besides removing
2.2 Data Extension using Image Search Engine corrupted or wrong format images, was eliminating the duplicate
Thanks to the latest technological infrastructure advances, half images. For this task, we simply used a pre-trained ResNet18 [7]
the population of the entire world have become Internet users2 . as a feature extractor (by removing its last layer for classification)
Online services ranging from social networks to simple websites and computed pairwise euclidean distances between the collected
have grown into data containers where users share images, videos, images. During this cleanse process, we combined the initial set of
or documents. Companies like Google, Yahoo, and Microsoft, images collected from recipe websites and the new ones collected
among others, offer public search engines that go through the entire via Google image search. After this first stage, we removed over

2. https://www.internetworldstats.com/stats.htm 3. https://aria2.github.io/

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 4

32M duplicate images (those with an euclidean distance of 0). TABLE 2


We only kept one representative for each duplicate cluster. Later, Recipe1M+ units. The 20 measurable units isolated in the dataset.
we visually inspected the remaining images and realized that a
significant amount of them were still either duplicates or near- units
duplicates. The main reason we could not detect some of these bushel, cup, dash, drop, fl. oz, g, gallon, glass,
duplicates in the first stage was due to compression or rescaling kg, liter, ml, ounce, pinch, pint, pound, quart,
operations applied to the images, which cause slight modifications scoop, shot, tablespoon, teaspoon
to their feature representation. By using distances between them,
and removing those that were close enough, we managed to
eliminate these duplicates. Near-duplicates, instead, were due to recipes had measurable units and numerical quantities defined for
distortions (i.e., aspect-ratio changes), crops, added text into the all their ingredients. Regarding numerical quantities, these recipes
image, and other alterations. To remove near-duplicates, after trying contained 1,002 different ones.
different strategies, we chose a harsh distance threshold between Once we finished the previous stage, we matched thousands of
images, which meant we had to eliminate a certain amount of ingredient names with a publicly available nutrient database [26]
good examples, as well. This strategy was used between different assembled by the United States Department of Agriculture (USDA).
partitions (i.e., training, test and validation). That is, we allowed This database provides the nutritional content of over 8,000 generic
near-duplicates within a partition to a certain extent (using a relaxed and proprietary-branded foods. In order to facilitate the matching
threshold). Additionally, we ran a face detector over the images process, we first reduced the ingredient list to contain only the
and removed those that had a face with high confidence. Thanks first word within the sentence (after removing quantities and units),
to computing distances, we also found non-recipe images such as obtaining a total of 6,856 unique words. Then, for each unique
images with nutritional facts. Images containing only text were ingredient we picked, when available, the second word of the
close to each other within the feature space. In order to compute the sentence. Due to multiple different sentences having the same
distances between images, we used C++ over Python for efficiency first word, we did only take one example out of the possible
purposes. ones. We went through each single bigram and only selected
Regarding the recipes sharing the same title, we uniformly those that were food ingredients, e.g., apple juice or cayenne
distributed the queried images for a particular non-unique title pepper. If the second word was nonexistent, e.g., 1/2 spoon of
among the recipes sharing it. This helped us to avoid having sugar, or was not part of a standard ingredient name, e.g., 1 cup
different recipes with the exact same food images. In the last two of water at 40 ◦C, we only selected the first word, i.e., sugar
paragraphs of Section 2.5, we describe an experiment performed and water, respectively. We created a corpus of 2,057 unique
by humans that supports the validity of spreading them uniformly. different ingredients with their singular and plural versions, and,
In order to re-balance the dataset in terms of partitions, we in some cases, synonyms or translations, e.g., cassava can be also
slightly modified the images belonging to each partition. For a called yuca, manioc or mandioca. We found ingredient names
fair comparison between the Recipe1M and Recipe1M+ in our from different nationalities and cultures, such as Spanish, Turkish,
experiments, we created an intersection version of the initial dataset, German, French, Polish, American, Mexican, Jewish, Indian, Arab,
which simply contains the images that were common between both Chinese or Japanese among others. Using the ingredient corpus we
of them. One would expect Recipe1M images to be a subset of assigned to each ingredient sentence the closest ingredient name by
Recipe1M+ images, but due to the re-balance and the cleanse of simply verifying that all the words describing the ingredient name
near-duplicates, which were not done in the original Recipe1M were within the original ingredient sentence. We found 68,450
dataset, this was no longer true. Table 1 shows the small differences recipes with all their ingredients within the corpus. The matching
in numbers. between the USDA database and the new assigned ingredient
names, similarly as before, was done by confirming that all the
words describing the ingredient name were within one of the
2.3 Nutritional Information
USDA database food instances. We inspected the matching results
The ingredient lists in the recipes scraped from the recipe websites to assure the correctness. In the end, we obtained 50,637 recipes
include the ingredient, quantity and unit information altogether with nutritional information (mapping example: American cheese
in a single sentence in several cases. In order to simplify the ⇒ cheese, pasteurized process, American, without added vitamin
task of automatically computing the nutritional information of a d). In Fig. 3, we can see a 2D visualization of the embeddings of
recipe, we decided to encapsulate these three different fields, i.e., these recipes that also include images, using t-SNE [27]. Recipes
(i) the ingredient, (ii) the units, and (iii) the quantity, separately in are shown in different colors based on their semantic category (see
the dataset structure. After identifying different type of sentences Section 5). In Fig. 4, we can see the same embedding but this
that followed the ‘quantity-unit-ingredient’ sequence pattern in time showing the same recipes on different colors depending on
the recipe ingredient lists, we used a natural language processing how healthy they are in terms of sugar, fat, saturates, and salt. We
toolkit4 to tag every single word within each of these sentences (e.g., used the traffic lights5 definition established by the Food Standards
[(‘2’, ‘CD’), (‘cups’, ‘NNS’), (‘of’, ‘IN’), (‘milk’, ‘NN’)]). Every Agency (FSA).
ingredient in the dataset that followed the sentence structure (e.g.,
‘4 teaspoons of honey’) of one of those we identified, was selected
for further processing. We then went through the unit candidates 2.4 Data Structure
of these sentences and chose only the measurable ones (some The contents of the Recipe1M dataset can logically be grouped into
non-measurable units are for instance, a bunch, a slice or a loaf ). two layers. The first layer (i.e., Layer 1) contains basic information
Table 2 shows the 20 different units we found. 103,152 unique including a title, a list of ingredients, and a sequence of instructions

4. http://www.nltk.org/ 5. https://www.resourcesorg.co.uk/assets/pdfs/foodtrafficlight1107.pdf

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 5

after the data extension phase, only around 2% of the recipes are
left without any associated images. Regarding the experiments,
we carefully removed any exact duplicates or recipes sharing the
same image in order to avoid overlapping between training and
test sets. As detailed earlier in Table 1, around 70% of the data
is labeled as training, and the remainder is split equally between
the validation and test sets. During the dataset extension, as we
mentioned earlier, we also created an intersection dataset in order
to have a fair comparison of the experimental results on both the
initial and the extended versions of the dataset.
According to Fig. 5, the average recipe in the dataset consists
of nine ingredients which are transformed over the course of ten
instructions. One can also observe that the distributions of data
are heavy tailed. For instance, of the 16k ingredients identified
as unique (in terms of phrasing), only 4,000 account for 95% of
Fig. 3. Embedding visualization using t-SNE. Legend depicts the
recipes that belong to the top 12 semantic categories used in our occurrences. At the low end of instruction count–particularly those
semantic regularization (see Section 5 for more details). with one step–one will find the dreaded Combine all ingredients.
At the other end are lengthy recipes and ingredient lists associated
with recipes that include sub-recipes.
A similar issue of outliers exists also for images: as several
of the included recipe collections curate user-submitted images,
popular recipes like chocolate chip cookies have orders of mag-
nitude more images than the average. Notably, the number of
unique recipes that came with associated food images in the initial
data collection phase was 333K, whilst after the data extension
phase, this number reached to more than 1M recipes. On average,
the Recipe1M+ dataset contains 13 images per recipe whereas
Recipe1M has less than one image per recipe, 0.86 to be exact.
Fig. 5 also depicts the images vs recipes histogram for Recipe1M+,
where over half million recipes contain more than 12 images each.
To evaluate further the quality of match between the queried
Fig. 4. Healthiness within the embedding. Recipe health is rep- images and the recipes, we performed an experiment on Amazon
resented within the embedding visualization in terms of sugar, salt, Mechanical Turk (AMT) platform6 . We randomly picked 3,455
saturates, and fat. We follow FSA traffic light system to determine how
healthy a recipe is. recipes, containing at most ten ingredients and ten instructions,
from the pool of recipes with non-unique titles. Then, for each
one of these recipes, we showed AMT workers a pair of images
for preparing a dish; all of these data are provided as free text.
and asked them to choose which image, A or B, was the best
Additional fields such as unit and quantity are also available in this
match for the corresponding recipe. The workers also had the
layer. In cases where we were unable to extract unit and quantity
options of selecting ‘both images’ or ‘none of them’. Image A
from the ingredient description, these two fields were simply left
and image B were randomly chosen; one from the original recipe
empty for the corresponding ingredient. Nutritional information
(i.e., Recipe1M) images and the other one from the queried images
(i.e., total energy, protein, sugar, fat, saturates, and salt content) is
collected during the dataset expansion for the corresponding recipe
only added for those recipes that contained both units and quantities
title. We also changed the order of image A and image B randomly.
as described in Section 2.3. FSA traffic lights are also available for
We explicitly asked the workers to check all the ingredients and
such recipes. The second layer (i.e., Layer 2) builds upon the first
instructions. Only master workers were selected for this experiment.
layer and includes all images with which the recipe is associated–
Out of 3,455 recipes, the workers chose 971 times the original
these images are provided as RGB in JPEG format. Additionally, a
recipe image (28.1%); 821 times the queried one (23.8%); 1581
subset of recipes are annotated with course labels (e.g., appetizer,
times both of them (45.8%); and 82 times none of them (2.4%).
side dish, or dessert), the prevalence of which are summarized in
Given the difference between the original recipe image vs. the
Fig. 5. For Recipe1M+, we provide same Layer 1 as described
queried image is less than 5%, these results show that the extended
above with different partition assignments and Layer 2 including
dataset is not much noisier than the original Recipe1M.
the 13M images.

3 L EARNING E MBEDDINGS
2.5 Analysis
In this section, we describe our neural joint embedding model.
Recipe1M (hence Recipe1M+) includes approximately 0.4% dupli- Here, we utilize the paired (recipe and image) data in order to
cate recipes and, excluding those duplicate recipes, 20% of recipes learn a common embedding space as illustrated in Fig. 1. Next, we
have non-unique titles but symmetrically differ by a median of 16 discuss recipe and image representations, and then, we describe our
ingredients. 0.2% of recipes share the same ingredients but are neural joint embedding model that builds upon recipe and image
relatively simple (e.g., spaghetti, or granola), having a median of representations.
six ingredients. Approximately half of the recipes did not have any
images in the initial data collection from recipe websites. However, 6. http://mturk.com

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 6

Fig. 5. Dataset statistics. Prevalence of course categories and number of instructions, ingredients and images per recipe in Recipe1M+.

3.1 Representation of Recipes 4 J OINT N EURAL E MBEDDING


There are two major components of a recipe: its ingredients and Building upon the previously described recipe and image repre-
cooking instructions. We develop a suitable representation for each sentations, we now introduce our joint embedding method. The
of these components. recipe model, displayed in Fig. 6, includes two encoders: one for
Ingredients. Each recipe contains a set of ingredient text as ingredients and one for instructions, the combination of which are
shown in Fig. 1. For each ingredient we learn an ingredient designed to learn a recipe level representation. The ingredients
level word2vec [28] representation. In order to do so, the actual encoder combines the sequence of ingredient word vectors. Since
ingredient names are extracted from each ingredient text. For the ingredient list is an unordered set, we choose to utilize a
instance in “2 tbsp of olive oil” the olive oil is extracted as bidirectional LSTM model, which considers both forward and
the ingredient name and treated as a single word for word2vec backward orderings. The instructions encoder is implemented as a
computation. The initial ingredient name extraction task is solved forward LSTM model over skip-instructions vectors. The outputs
by a bi-directional LSTM that performs logistic regression on each of both encoders are concatenated and embedded into a recipe-
word in the ingredient text. Training is performed on a subset image joint space. The image representation is simply projected
of our training set for which we have the annotation for actual into this space through a linear transformation. The goal is to learn
ingredient names. Ingredient name extraction module works with transformations to make the embeddings for a given recipe-image
99.5% accuracy tested on a held-out set. pair “close.”
Cooking Instructions. Each recipe also has a list of cooking Formally, assume that we are given a set of the recipe-image
instructions. As the instructions are quite lengthy (averaging ∼208 pairs, (rk , vk ) in which rk is the k th recipe and vk is the associated
words) a single LSTM is not well suited to their representation image. Further, let rk = ({stk }n t mk t nk
t=1 , {gk }t=1 ), where {sk }t=1 is
k

m
as gradients are diminished over the many time steps. Instead, we the sequence of nk cooking instructions, {gkt }t=1 k
is the sequence
propose a two-stage LSTM model which is designed to encode a of mk ingredient tokens. The objective is to maximize the cosine
sequence of sequences. First, each instruction/sentence is repre- similarity between positive recipe-image pairs, and minimize it
sented as a skip-instructions vector, and then, an LSTM is trained between all non-matching recipe-image pairs, up to a specified
over the sequence of these vectors to obtain the representation of margin.
all instructions. The resulting fixed-length representation is fed into The ingredients encoder is implemented using a bi-directional
to our joint embedding model (see instructions-encoder in Fig. 6). LSTM: at each time step it takes two ingredient-word2vec rep-
resentations of gkt and gkmk −t+1 , and eventually, it produces the
Skip-instructions. Our cooking instruction representation, referred g
fixed-length representation hk for ingredients. The instructions
to as skip-instructions, is the product of a sequence-to-sequence
encoder is implemented through a regular LSTM. At each time step
model [29]. Specifically, we build upon the technique of skip-
it receives an instruction representation from the skip-instructions
thoughts [30] which encodes a sentence and uses that encoding as
encoder, and finally it produces the fixed-length representation
context when decoding/predicting the previous and next sentences
hsk . hgk and hsk are concatenated in order to obtain the recipe
(see Fig. 7). Our modifications to this method include adding start-
representation hrk . On the image side, the image encoder simply
and end-of-recipe “instructions” and using an LSTM instead of
produces the fixed-length representation hvk . Then, the recipe and
a GRU. In either case, the representation of a single instruction
image representations are mapped into the joint embedding space
is the final output of the encoder. As before, this is used as the
as: φr = W r hrk + br and φv = W v hvk + bv , respectively. Note
instructions input to our embedding model.
that W r and W v are embedding matrices which are also learned.
3.2 Representation of Food Images Finally, the complete model is trained end-to-end with positive and
negative recipe-image pairs (φr , φv ) using the cosine similarity
For the image representation we adopt two major state-of-the-art
loss with margin defined as follows:
deep convolutional networks, namely VGG-16 [6] and Resnet- 
50 [7] models. In particular, the deep residual networks have r v 1 − cos(φr , φv ), if y = 1
Lcos (φ , φ , y) = max(0, cos(φr , φv ) − α), if y = −1
a proven record of success on a variety of benchmarks [7].
Although [6] suggests training very deep networks with small where cos(.) is the normalized cosine similarity and α is the
convolutional filters, deep residual networks take it to another level margin.
using ubiquitous identity mappings that enable training of much
deeper architectures (e.g., with 50, 101, or 152 layers) with better
performance. We incorporate these models by removing the last 5 S EMANTIC R EGULARIZATION
softmax classification layer and connecting the rest to our joint We incorporate additional regularization on our embedding through
embedding model as shown in the right side of Fig. 6. solving the same high-level classification problem in multiple

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 7

Fig. 6. Joint neural embedding model with semantic regularization. Our model learns a joint embedding space for food images and cooking
recipes.

Fig. 7. Skip-instructions model. During training the encoder learns to predict the next instruction.

modalities with shared high-level weights. We refer to this method frequent bigrams in recipe titles from our training set. We manually
as semantic regularization. The key idea is that if high-level remove those that contain unwanted characters (e.g., n’, !, ? or
discriminative weights are shared, then both of the modalities &) and those that do not have discriminative food properties (e.g.,
(recipe and image embeddings) should utilize these weights in best pizza, super easy or 5 minutes). We then assign each of the
a similar way which brings another level of alignment based on remaining bigrams as the semantic category to all recipes that
discrimination. We optimize this objective together with our joint include it in their title. By using bigrams and Food-101 categories
embedding loss. Essentially the model also learns to classify any together we obtain a total of 1,047 categories, which cover 50%
image or recipe embedding into one of the food-related semantic of the dataset. chicken salad, grilled vegetable, chocolate cake
categories. We limit the effect of semantic regularization as it is and fried fish are some examples among the categories we collect
not the main problem that we aim to solve. using this procedure. All those recipes without a semantic category
Semantic Categories. We start by assigning Food-101 categories are assigned to an additional background class. Although there is
to those recipes that contain them in their title. However, after this some overlap in the generated categories, 73% of the recipes in
procedure we are only able to annotate 13% of our dataset, which our dataset (excluding those in the background class) belong to
we argue is not enough labeled data for a good regularization. a single category (i.e., only one of the generated classes appears
Hence, we compose a larger set of semantic categories purely in their title). For recipes where two or more categories appear in
extracted from recipe titles. We first obtain the top 2,000 most the title, the category with highest frequency rate in the dataset is

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 8

chosen.
Classification. To incorporate semantic regularization to the joint
embedding, we use a single fully connected layer. Given the
embeddings φv and φr , class probabilities are obtained with
pr = W c φr and pv = W c φv followed by a softmax activation.
W c is the matrix of learned weights, which are shared between
image and recipe embeddings to promote semantic alignment
between them. Formally, we express the semantic regularization
loss as Lreg (φr , φv , cr , cv ) where cr and cv are the semantic
category labels for recipe and image, respectively. Note that cr and
cv are the same if (φr , φv ) is a positive pair. Then, we can write
the final objective as:

L(φr , φv , cr , cv , y) = Lcos (φr , φv , y)+


λLreg (φr , φv , cr , cv )

Optimization. We follow a two-stage optimization procedure while


learning the model. If we update both the recipe encoding and
image network at the same time, optimization becomes oscillatory
and even divergent. Previous work on cross-modality training [31],
[32] suggests training models for different modalities separately and Fig. 8. Im2recipe retrieval examples. From left to right: (1) the query
fine tuning them jointly afterwards to allow alignment. Following image, (2) its associated ingredient list, (3) the retrieved ingredients, and
(4) the image associated to the retrieved recipe.
this insight, we adopt a similar procedure when training our model.
We first fix the weights of the image network, which are found from
pre-training on the ImageNet object classification task, and learn the
6 E XPERIMENTS
recipe encodings. This way the recipe network learns to align itself
to the image representations and also learns semantic regularization We begin with the evaluation of our learned embeddings for the
parameters (W c ). Then we freeze the recipe encoding and semantic im2recipe retrieval task on the initial (i.e., recipe-website-only)
regularization weights, and learn the image network. This two- version of our dataset (i.e., Recipe1M). Specifically, we study the ef-
stage process is crucial for successful optimization of the objective fect of each component of our model and compare our final system
function. After this initial alignment stage, we release all the against human performance for the im2recipe retrieval task. Then,
weights to be learned. However, the results do not change much in using the best model architecture trained on the recipe-website-
this final, joint optimization. We take a step further from [12] in our only version of the dataset, we compare its retrieval performance
extended study and change the validation procedure to use median with the same one trained on the extended version of the dataset
rank (MedR) score as our performance measure, like in [19], while (i.e., Recipe1M+) to evaluate the benefit of data extension through
reimplementing our source code in PyTorch. This strategy appears an image search engine. We further evaluate the two models on
to be more stable than using the validation loss. Food-101 dataset to assess their generalization ability. Finally,
we analyze the properties of our learned embeddings through
Implementation Details. All the neural network models are
unit visualizations and explore different vector arithmetics in the
implemented using Torch77 and PyTorch8 frameworks. The margin
embedding space on both the initial (Recipe1M) and the extended
α is selected as 0.1 in joint neural embedding models. The
(Recipe1M+) datasets.
regularization hyper-parameter is set as λ = 0.02 in all our
experiments. While optimizing the cosine loss, we pick a positive
recipe-image pairs with 20% probability and a random negative 6.1 Im2recipe Retrieval
recipe-image pair with 80% probability from the training set.
We evaluate all the recipe representations for im2recipe retrieval.
The models in Torch7 are trained on 4 NVIDIA Titan X with Given a food image, the task is to retrieve its recipe from a
12GB of memory for three days. The models in PyTorch are trained collection of test recipes. We also perform recipe2im retrieval
on 4 NVIDIA GTX 1080 with 8GB of memory for two and a using the same setting. All results are reported for the test set.
half days (using a bigger batch size, i.e., 256 pairs instead of 150).
When using Recipe1M+, the training in PyTorch tends to take over Comparison with the Baselines. Canonical Correlation Analysis
a week, using a batch size of 256. For efficiency purposes, we store (CCA) is one of the strongest statistical models for learning joint
the recipe text part of the dataset in LMDB9 format and load the embeddings for different feature spaces when paired data are
images on the fly using DataLoader function of the PyTorch provided. We use CCA over many high-level recipe and image
library. This way our PyTorch code does not require as much RAM representations as our baseline. These CCA embeddings are learned
as our Torch7 code does. As a side note, between the two reference using recipe-image pairs from the training data. In each recipe,
libraries, we did experience that PyTorch in general uses less GPU the ingredients are represented with the mean word2vec across
memory. all its ingredients in the manner of [33]. The cooking instructions
are represented with mean skip-thoughts vectors [30] across the
7. http://torch.ch/ cooking instructions. A recipe is then represented as concatenation
8. https://pytorch.org/ of these two features. We also evaluate CCA over mean ingredient
9. https://lmdb.readthedocs.io/en/release/ word2vec and skip-instructions features as another baseline. The

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 9

TABLE 3
Im2recipe retrieval comparisons on Recipe1M. Median ranks and recall rate at top K are reported for baselines and our method. Note that the
joint neural embedding models consistently outperform all the baseline methods.

im2recipe recipe2im
medR R@1 R@5 R@10 medR R@1 R@5 R@10
random ranking 500 0.001 0.005 0.01 500 0.001 0.005 0.01
CCA w/ skip-thoughts + word2vec (GoogleNews) + image features 25.2 0.11 0.26 0.35 37.0 0.07 0.20 0.29
CCA w/ skip-instructions + ingredient word2vec + image features 15.7 0.14 0.32 0.43 24.8 0.09 0.24 0.35
joint emb. only 7.2 0.20 0.45 0.58 6.9 0.20 0.46 0.58
joint emb. + semantic 5.2 0.24 0.51 0.65 5.1 0.25 0.52 0.65
attention + SR. [18] 4.6 0.26 0.54 0.67 4.6 0.26 0.54 0.67
AdaMine [19] 1.0 0.40 0.69 0.77 1.0 0.40 0.68 0.79

image features utilized in the CCA baselines are the ResNet- in our our model in several optimization stages. The results are
50 features before the softmax layer. Although they are learned reported in Table 4. Note that here we also report medR with 1K ,
for visual object categorization tasks on ImageNet dataset, these 5K and 10K random selections to show how the results scale
features are widely adopted by the computer vision community, in larger retrieval problems. As expected, visual features from
and they have been shown to generalize well to different visual the ResNet-50 model show a substantial improvement in retrieval
recognition tasks [34]. performance when compared to VGG-16 features. Even with “fixed
For evaluation, given a test query image, we use cosine vision” networks the joint embedding achieved 7.9 medR using
similarity in the common space for ranking the relevant recipes ResNet-50 architecture. Further “fine-tuning” of vision networks
and perform im2recipe retrieval. The recipe2im retrieval setting is slightly improves the results. Although it becomes a lot harder
evaluated likewise. We adopt the test procedure from image2caption to decrease the medR in small numbers, additional “semantic
retrieval task [35], [36]. We report results on a subset of randomly regularization” improves the medR in both cases.
selected 1,000 recipe-image pairs from the test set. We repeat the Comparison with Human Performance. In order to better assess
experiments 10 times and report the mean results. We report median the quality of our embeddings we also evaluate the performance
rank (MedR), and recall rate at top K (R@K) for all the retrieval of humans on the im2recipe task. The experiments are performed
experiments. To clarify, R@5 in the im2recipe task represents the through AMT. For quality purposes, we require each AMT worker
percentage of all the image queries where the corresponding recipe to have at least 97% approval rate and have performed at least
is retrieved in the top 5, hence higher is better. The quantitative 500 tasks before our experiment. In a single evaluation batch, we
results for im2recipe retrieval are shown in Table 3. first randomly choose 10 recipes and their corresponding images.
Our model outperforms the CCA baselines in all measures. As We then ask an AMT worker to choose the correct recipe, out of
expected, CCA over ingredient word2vec and skip-instructions per- the 10 provided recipes, for the given food image. This multiple
form better than CCA over word2vec trained on GoogleNews [28] choice selection task is performed 10 times for each food image
and skip-thoughts vectors that are learned over a large-scale book in the batch. The accuracy of an evaluation batch is defined
corpus [30]. In 65% of all evaluated queries, our method can as the percentage of image queries correctly assigned to their
retrieve the correct recipe given a food image. The semantic corresponding recipe.
regularization notably improves the quality of our embedding The evaluations are performed for three levels of difficulty. The
for im2recipe task which is quantified with the medR drop from batches (of 10 recipes) are randomly chosen from either all the test
7.2 to 5.2 in Table 3. The results for recipe2im task are also similar recipes (easy), recipes sharing the same course (e.g., soup, salad,
to those in the im2recipe retrieval setting. or beverage; medium), or recipes sharing the name of the dish (e.g.,
Table 3 also presents results originally reported in [18] and [19] salmon, pizza, or ravioli; hard). As expected–for our model as well
on Recipe1M. Attention-based modeling of [18] achieves slight as the AMT workers–the accuracies decrease as tasks become more
performance increases whereas double-triplet learning scheme of specific. In both coarse and fine-grained tests, our method performs
[19] leads to larger performance gains in both retrieval settings. comparably to or better than the AMT workers. As hypothesized,
Since neither [18] nor [19] made their codes publicly available, semantic regularization further improves the results (see Table 5).
we could not evaluate their algorithms on our datasets for further In the “all recipes” condition, 25 random evaluation batches
comparative analyses. (25 × 10 individual tasks in total) are selected from the entire test
Fig. 8 compares the ingredients from the original recipes (true set. Joint embedding with semantic regularization performs the
recipes) with the retrieved recipes (coupled with their corresponding best with 3.2 percentage points improvement over average human
image) for different image queries. As can be observed in Fig. 8, accuracy. For the course-specific tests, 5 batches are randomly
our embeddings generalize well and allow overall satisfactory selected within each given meal course. Although, on average, our
recipe retrieval results. However, at the ingredient level, one can joint embedding’s performance is slightly lower than the humans’,
find that in some cases our model retrieves recipes with missing with semantic regularization our joint embedding surpasses humans’
ingredients. This usually occurs due to the lack of fine-grained performance by 6.8 percentage points. In dish-specific tests, five
features (e.g., confusion between shrimps and salmon) or simply random batches are selected if they have the dish name (e.g.,
because the ingredients are not visible in the query image (e.g., pizza) in their title. With slightly lower accuracies in general, dish-
blueberries in a smoothie or beef in a lasagna). specific results also show similar behavior. Particularly for the
Ablation Studies. We also analyze the effect of each component “beverage” and “smoothie” results, human performance is better

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 10

TABLE 4
Ablation studies on Recipe1M. Effect of the different model components to the median rank, medR (the lower is better).

im2recipe recipe2im
Joint emb. methods
medR-1K medR-5K medR-10K medR-1K medR-5K medR-10K
fixed vision 15.3 71.8 143.6 16.4 76.8 152.8
VGG-16 finetuning (ft) 12.1 56.1 111.4 10.5 51.0 101.4
ft + semantic reg. 8.2 36.4 72.4 7.3 33.4 64.9
fixed vision 7.9 35.7 71.2 9.3 41.9 83.1
ResNet-50 finetuning (ft) 7.2 31.5 62.8 6.9 29.8 58.8
ft + semantic reg. 5.2 21.2 41.9 5.1 20.2 39.2

TABLE 5
Comparison with human performance on im2recipe task on Recipe1M. The mean results are highlighted as bold for better visualization. Note
that on average our method with semantic regularization performs better than average AMT worker.

all recipes course-specific recipes dish-specific recipes


dessert salad bread beverage soup-stew course-mean pasta pizza steak salmon smoothie hamburger ravioli sushi dish-mean
human 81.6 ± 8.9 52.0 70.0 34.0 58.0 56.0 54.0 ± 13.0 54.0 48.0 58.0 52.0 48.0 46.0 54.0 58.0 52.2 ± 04.6
joint-emb. only 83.6 ± 3.0 76.0 68.0 38.0 24.0 62.0 53.6 ± 21.8 58.0 58.0 58.0 64.0 38.0 58.0 62.0 42.0 54.8 ± 09.4
joint-emb.+semantic 84.8 ± 2.7 74.0 82.0 56.0 30.0 62.0 60.8 ± 20.0 52.0 60.0 62.0 68.0 42.0 68.0 62.0 44.0 57.2 ± 10.1

TABLE 6
Comparison between models trained on Recipe1M vs. Recipe1M+. Median ranks and recall rate at top K are reported for both models. They
have similar performance on the Recipe1M test set in terms of medR and R@K. However, when testing on the Recipe1M+ test set, the model trained
on Recipe1M+ yields significantly better medR and better R@5 and R@10 scores. In this table, Recipe1M refers to the intersection dataset.

Recipe1M test set Recipe1M+ test set


im2recipe
medR R@1 R@5 R@10 medR R@1 R@5 R@10
Recipe1M training set 5.1 0.24 0.52 0.64 13.6 0.15 0.35 0.46
Recipe1M+ training set 5.7 0.21 0.49 0.62 8.6 0.17 0.42 0.54
recipe2im
medR R@1 R@5 R@10 medR R@1 R@5 R@10
Recipe1M training set 4.8 0.27 0.54 0.65 11.9 0.17 0.38 0.48
Recipe1M+ training set 4.6 0.26 0.54 0.66 6.8 0.21 0.46 0.58

than our method, possibly because detailed analysis is needed to TABLE 7


elicit the homogenized ingredients in drinks. Similar behavior is Im2recipe retrieval comparisons on Food-101 dataset. Median
ranks and recall rate at top K are reported for both models. Note that
also observed for the “sushi” results where fine-grained features the model trained on Recipe1M+ performs better than the model trained
of the sushi roll’s center are crucial to identify the correct sushi on Recipe1M. In this table, Recipe1M refers to the intersection dataset.
recipe.
im2recipe
Recipe1M vs. Recipe1M+ Comparison. One of the main ques-
medR R@1 R@5 R@10
tions of the current study is how beneficial it is to incorporate im-
ages coming from a Web search engine into the initial collection of Recipe1M training set 17.35 16.13 33.68 42.53
images obtained from recipe websites. One way to assess this is to Recipe1M+ training set 10.15 21.89 42.31 51.14
compare im2recipe retrieval performance of a network architecture recipe2im
trained on Recipe1M with im2recipe retrieval performance of the Recipe1M training set 4.75 26.19 54.52 67.50
same network architecture trained on Recipe1M+. In Table 6, we
Recipe1M+ training set 2.60 37.38 65.00 76.31
present im2recipe retrieval results achieved on both test sets. As can
be seen, there is a clear benefit when we evaluate both models on
the Recipe1M+ test set. The model trained on Recipe1M+ obtains a
significantly better medR, 5 points lower in both retrieval tasks, and explained earlier in Section 2, this is done mainly to have a fair
higher R@5 and R@10, in some cases up to a 10 percentage point comparison of im2recipe retrieval results on both versions of the
increase. When looking into the Recipe1M test set, both models dataset.
perform similarly. These results clearly demonstrate the benefit Model Generalization Ability Comparison. We experiment
of using external search engines to extend the imagery content of further to evaluate whether Recipe1M+ dataset improves the
Recipe1M. Note that the retrieval results on Tables 3 and 6 slightly performance of our model on other food image datasets. For this
differ due to the fact that we use a modified version of the dataset purpose, we evaluate both our trained models on the popular Food-
(see intersection dataset in Table 1) in the latter experiment. As we 101 dataset [8]. The Food-101 dataset is a classification dataset

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 11

Fig. 9. Localized unit activations. We find that ingredient detectors emerge in different units in our embeddings, which are aligned across modalities
(e.g., unit 352: “cream”, unit 22: “sponge cake” or unit 571: “steak”).

containing 101 food categories and 1,000 images for each one of 6.2 Analysis of the Learned Embedding
these 101 food categories, totaling up to 101,000 images. To gain further insight into our neural embedding, we perform a
Our method of evaluation involves randomly sampling an image series of qualitative analysis experiments. We explore whether any
and a recipe corresponding to each of the Food-101 categories. The semantic concepts emerge in the neuron activations and whether
images are taken from the Food-101 dataset, while the recipes are the embedding space has certain arithmetic properties.
taken from the test partition of the intersection dataset. Here, a Neuron Visualizations. Through neural activation visualization,
recipe is considered to belong to a category only if the recipe title we investigate if any semantic concepts emerge in the neurons
string matches with the Food-101 category. Here, we only sample in our embedding vector despite not being explicitly trained for
images and recipes from those categories that correspond to at least that purpose. We pick the top activating images, ingredient lists,
N recipes among the test recipes that we sample from. and cooking instructions for a given neuron. Then we use the
After sampling an image and a corresponding recipe for each methodology introduced by Zhou et al. [37] to visualize image
category that is common enough, we evaluate our models on the regions that contribute the most to the activation of specific units in
retrieval task. In the im2recipe direction, we provide our model our learned visual embeddings. We apply the same procedure on the
with the image and expect it to retrieve the corresponding recipe. recipe side to also obtain those ingredients and recipe instructions
In the recipe2im direction, we provide our model with the recipe to which certain units react the most. Fig. 9 shows the results for
and expect it to retrieve the corresponding image. We show the the same unit in both the image and recipe embedding. We find
retrieval results of both models in Table 7. Note that the model that certain units display localized semantic alignment between the
trained on Recipe1M+ consistently outperforms the model trained embeddings of the two modalities.
on Recipe1M. Semantic Vector Arithmetic. Different works in the literature
One possible explanation for Recipe1M+ dataset giving an [28], [38] have used simple arithmetic operations to demon-
advantage on the Food-101 task is that there might be an overlap strate the capabilities of their learned representations. In the
between the images used to train the model on the Recipe1M+ and context of food recipes, one would expect that v(“chicken
the Food-101 images. Further, it is possible that there might be pizza”) − v(“pizza”) + v(“salad”) = v(“chicken salad”), where
images in Recipe1M+ training set that overlap with the Food-101 v represents the map into the embedding space. We demonstrate
dataset that are not in the initial training set. This would give the that our learned embeddings have such properties by applying the
model trained on Recipe1M+ an unfair advantage. We perform the previous equation template to the averaged vectors of recipes that
following procedure to test whether this is true. First, we feed in contain the queried words in their title. We apply this procedure
all of the images in the Recipe1M+ training set and the Food-101 in the recipe and image embedding spaces and show results in
images into an 18 layer residual network that was pre-trained on Fig. 10 and Fig. 11, respectively. Our findings suggest that the
ImageNet. The network outputs a prediction vector for each of these learned embeddings have semantic properties that translate to
images. We next note that if an image in the extended training set simple geometric transformations in the learned space. Furthermore,
has an exact copy in the Food-101 dataset, then both images must the model trained on Recipe1M+ is better able to capture these
have the same prediction vector. When checking the prediction semantic properties in the embedding space. The improvement is
vectors of the images in Food-101 and the Recipe1M+ training set, most seriously observable on the recipe arithmetic. Among the
we did not find any overlapping prediction vectors, meaning that recipe analogy examples, notice that the result for the Recipe1M+
the images between Food-101 and Recipe1M+ training set do not dataset for “chicken quesadilla” - “wrap” + “rice” returns a
overlap. casserole dish, while for the Recipe1M dataset we have a quesadilla

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 12

Fig. 10. Analogy arithmetic results using recipe embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. We represent the average vector
of a query with the images from its 4 nearest neighbors. In the case of the arithmetic result, we show the nearest neighbor only.

dish. The casserole dish is much closer to matching the “chicken top of a large piece of lettuce. This result is similar in a way to
rice” result that we expect in this instance. Additionally, note how the lettuce wrap result, as the piece of lettuce is not just mixed in
“taco” - “tortilla” + “lettuce” returns a salad for the Recipe1M with the other ingredients, but acts as more of an object into which
model and a lettuce wrap for the Recipe1M+ model. Here, the other ingredients are placed. All in all, the Recipe1M+ training set
former model is likely doing arithmetic over the ingredients in allows our model to better capture high level semantic concepts.
the dish - a taco without tortilla likely comprises of a salad, into
which lettuce is added to give a salad-like dish. On the other hand, Fractional Arithmetic. Another type of arithmetic we examine
the Recipe1M+ model does arithmetic over higher level semantic is fractional arithmetic, in which our model interpolates across
concepts - it returns a lettuce wrap, which is the closest analogue the vector representations of two concepts in the embedding
to a taco which has the tortilla substituted out with lettuce. We space. Specifically, we examine the results for x × v(“concept
can thus see how the Recipe1M+ model has a greater ability to 1”) + (1 − x) × v(“concept 2”), where x varies from 0 to 1.
capture semantic concepts in the recipe embedding space, and also We expect this to have interesting applications in spanning the
performs somewhat better in general. If we examine the results space across two food concepts, such as pasta and salad, by
of both models for the analogy task with image embeddings, then adjusting the value of x to make the dish more “pasta-like” or
the Recipe1M+ model shows less of an improvement in general. “salad-like” for example. We apply this procedure in the recipe and
However, we can still see differences between the two models. For image embedding spaces and show results in Fig. 12 and Fig. 13,
instance, if we examine the “taco” - “tortilla” + “lettuce” analogy, respectively. With both fractional image arithmetic and fractional
then the Recipe1M model returns a result in which the lettuce recipe arithmetic, we hope that adjusting the fractional coefficient
is mixed in with other ingredients to form a salad. However, the will allow us to explore more fine-grained combinations of two
Recipe1M+ model returns a result in which a salad is placed on concepts. However, the results are often not so fine-grained. For

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 13

Fig. 11. Analogy arithmetic results using image embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. We represent the average vector
of a query with the images from its four nearest neighbors. In the case of the arithmetic result, we show the nearest neighbor only.

instance, in the “burrito” and “sandwich” example for the model 7 C ONCLUSION
trained on the Recipe1M dataset for recipe fractional arithmetic,
choosing a burrito coefficient of 0 does not yield different results In this paper, we present Recipe1M+, the largest structured recipe
from choosing the coefficient to be 0.5. Note that on the other hand, dataset to date, the im2recipe problem, and neural embedding mod-
the model trained on the Recipe1M+ dataset is able to provide els with semantic regularization which achieve impressive results
distinct results for each fractional coefficient value for this example. for the im2recipe task. The experiments conducted using AMT,
In general though, both models are able to effectively explore the together with the fact that on the Recipe1M test set we obtain the
gradient of recipes or food images between two different food same test performance using Recipe1M+, show that the extended
concepts. For instance, note the models’ results for the “curry” and dataset is not much noisier. Moreover, the fact that this expansion
“soup” examples, in both the image and recipe modalities. The most strategy greatly helps on the Food 101 dataset demonstrates the
“curry-like” image tends to have some broth, but is much chunkier value for generalizability. Additionally, we explored the properties
than the images. As we increase the coefficient of “soup”, we see of the resulting recipe and food representations by evaluating
the food becoming less chunky and more broth-like. Such examples different vector arithmetics on the learned embeddings, which
reflect the ability of our model to explore the space between food hinted at the possibility of applications such as recipe modification
concepts in general. or even cross-modal recipe generation.
The results of our fractional arithmetic experiments suggest More generally, the methods presented here could be gainfully
that the recipe and image embeddings learned in our model are applied to other “recipes” like assembly instructions, tutorials,
semantically aligned, which broaches the possibility of applica- and industrial processes. Further, we hope that our contributions
tions in recipe modification (e.g., ingredient replacement, calorie will support the creation of automated tools for food and recipe
adjustment) or even cross-modal generation. understanding and open doors for many less explored aspects of

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 14

Fig. 12. Fractional arithmetic results using recipe embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. For each model, we fractionally
interpolate across two example concepts (for instance, “salad” and “pasta”). We find the retrieved results for x × v(“concept 1”) + (1 − x) × v(“concept
2”), where x varies from 0 to 1.

learning such as compositional creativity and predicting visual R EFERENCES


outcomes of action sequences.
[1] V. R. K. Garimella, A. Alfayad, and I. Weber, “Social media image
analysis for public health,” in CHI, 2016, pp. 5543–5547.
[2] Y. Mejova, S. Abbar, and H. Haddadi, “Fetishizing food in digital age:
ACKNOWLEDGMENTS #foodporn around the world,” in ICWSM, 2016, pp. 250–258.
[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
This work was supported by CSAIL-QCRI collaboration project. A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 15

Fig. 13. Fractional arithmetic results using image embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. For each model, we fractionally
interpolate across two example concepts (for instance, “salad” and “pasta”). We find the retrieved results for x × v(“concept 1”) + (1 − x) × v(“concept
2”), where x varies from 0 to 1.

recognition challenge,” International Journal of Computer Vision, vol. recognition,” arXiv preprint arXiv:1512.03385, 2015.
115, no. 3, pp. 211–252, 2015. [8] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining
[4] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning discriminative components with random forests,” in European Conference
deep features for scene recognition using places database,” in Advances on Computer Vision. Springer, 2014, pp. 446–461.
in neural information processing systems, 2014, pp. 487–495. [9] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, and Y. Ma, “Deepfood:
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification Deep learning-based food image recognition for computer-aided dietary
with deep convolutional neural networks,” in NIPS, 2012. assessment,” in International Conference on Smart Homes and Health
[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Telematics. Springer, 2016, pp. 37–48.
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [10] A. Myers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban, N. Sil-
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image berman, S. Guadarrama, G. Papandreou, J. Huang, and K. Murphy,

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 16

“Im2calories: Towards an automated mobile vision food diary,” in ICCV, [32] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba,
2015, pp. 1233–1241. “Cross-modal scene networks,” IEEE Trans. Pattern Anal. Mach.
[11] F. Ofli, Y. Aytar, I. Weber, R. Hammouri, and A. Torralba, “Is saki Intell., vol. 40, no. 10, pp. 2303–2314, 2018. [Online]. Available:
#delicious? the food perception gap on instagram and its relation to health,” https://doi.org/10.1109/TPAMI.2017.2753232
in Proceedings of the 26th International Conference on World Wide Web. [33] Q. V. Le and T. Mikolov, “Distributed representations of sentences and
International World Wide Web Conferences Steering Committee, 2017. documents,” arXiv preprint arXiv:1405.4053, 2014.
[12] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and [34] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
A. Torralba, “Learning cross-modal embeddings for cooking recipes T. Darrell, “Decaf: A deep convolutional activation feature for generic
and food images,” in Proceedings of the IEEE Conference on Computer visual recognition,” arXiv preprint arXiv:1310.1531, 2013.
Vision and Pattern Recognition, July 2017. [35] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
[13] L. Herranz, W. Min, and S. Jiang, “Food recognition and generating image descriptions,” in Proceedings of the IEEE Conference
recipe analysis: integrating visual content, context and external on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
knowledge,” CoRR, vol. abs/1801.07239, 2018. [Online]. Available: [36] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
http://arxiv.org/abs/1801.07239 image caption generator,” in Proceedings of the IEEE Conference on
[14] W. Min, S. Jiang, S. Wang, J. Sang, and S. Mei, “A delicious recipe Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
analysis framework for exploring multi-modal recipes with various [37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object
attributes,” in Proceedings of the 2017 ACM on Multimedia Conference, detectors emerge in deep scene cnns,” International Conference on
ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 402–410. [Online]. Learning Representations, 2015.
Available: http://doi.acm.org/10.1145/3123266.3123272 [38] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” arXiv
[15] M. Chang, L. V. Guillain, H. Jung, V. M. Hare, J. Kim, and
preprint arXiv:1511.06434, 2015.
M. Agrawala, “Recipescape: An interactive tool for analyzing cooking
instructions at scale,” in Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, ser. CHI ’18. New
York, NY, USA: ACM, 2018, pp. 451:1–451:12. [Online]. Available:
http://doi.acm.org/10.1145/3173574.3174025
[16] M. Engilberge, L. Chevallier, P. Pérez, and M. Cord, “Finding beans
in burgers: Deep semantic-visual embedding with localization,” in
Javier Marı́n received the B.Sc. degree in Math-
Proceedings of the IEEE Conference on Computer Vision and Pattern
ematics at the Universitat de les Illes Balears
Recognition, June 2018.
in 2007. In June 2013 he received his Ph.D.
[17] J.-j. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval with in computer vision at the Universitat Autónoma
rich food attributes,” in Proceedings of the 2017 ACM on Multimedia de Barcelona. In 2017 he was a postdoctoral
Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1771– research associate at the Massachusetts Institute
1779. [Online]. Available: http://doi.acm.org/10.1145/3123266.3123428 of Technology (MIT). Before that, he worked as
[18] J.-J. Chen, C.-W. Ngo, F.-L. Feng, and T.-S. Chua, “Deep understanding an algorithm development engineer in the auto-
of cooking procedure for cross-modal recipe retrieval,” in Proceedings of motive sector, and as a researcher and project
the 26th ACM International Conference on Multimedia, ser. MM ’18. manager in both neuroscience and space fields.
New York, NY, USA: ACM, 2018, pp. 1020–1028. [Online]. Available: He currently combines working in the private
http://doi.acm.org/10.1145/3240508.3240627 sector as a senior data scientist at Satellogic Solutions with being a
[19] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord, research affiliate at MIT. His research interests lie mainly in the area of
“Cross-modal retrieval in the cooking context: Learning semantic text- computer vision and machine learning, focusing recently in cross-modal
image embeddings,” in Proceedings of the 41st International ACM SIGIR learning, object recognition and semantic segmentation.
Conference on Research and Development in Information Retrieval, ser.
SIGIR ’18. New York, NY, USA: ACM, 2018.
[20] Y. Kawano and K. Yanai, “Foodcam: A real-time food recognition system
on a smartphone,” Multimedia Tools and Applications, vol. 74, no. 14, pp.
5263–5287, 2015. Aritro Biswas received a Bachelor’s degree in
[21] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocalized Computer Science at the Massachusetts Institute
modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, no. 8, of Technology (MIT). He received his Master’s
pp. 1187–1199, 2015. degree in Computer Science at MIT. Recently, his
[22] T. Kusmierczyk, C. Trattner, and K. Norvag, “Understanding and research has focused on using computer vision
predicting online food recipe production patterns,” in HyperText, 2016. for two applications: (i) understanding the content
[23] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe of food images and (ii) disaster recognition for
recognition with large multimodal food dataset,” in ICME Workshops, images of humanitarian disasters.
2015, pp. 1–6.
[24] C.-w. N. Jing-jing Chen, “Deep-based ingredient recognition for cooking
recipe retrival,” ACM Multimedia, 2016.
[25] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A
10 Million Image Database for Scene Recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464,
Apr. 2018. Ferda Ofli (S’07–M’11–SM’18) received the B.Sc.
[26] US Department of Agriculture, Agricultural Research Service, degrees both in electrical and electronics en-
Nutrient Data Laboratory, “Usda national nutrient database for gineering and computer engineering, and the
standard reference, release 27,” May 2015. [Online]. Available: Ph.D. degree in electrical engineering from Koc
http://www.ars.usda.gov/ba/bhnrc/ndl University, Istanbul, Turkey, in 2005 and 2010, re-
[27] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal spectively. From 2010 to 2014, he was a Postdoc-
of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. toral Researcher at the University of California,
[28] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Berkeley, CA, USA. He is currently a Scientist at
word representations in vector space,” CoRR, vol. abs/1301.3781, 2013. the Qatar Computing Research Institute (QCRI),
[29] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning part of Hamad Bin Khalifa University (HBKU).
with neural networks,” in NIPS, 2014, pp. 3104–3112. His research interests cover computer vision,
[30] R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, machine learning, and multimedia signal processing. He is an IEEE and
and S. Fidler, “Skip-thought vectors,” in NIPS, 2015, pp. 3294–3302. ACM senior member with over 45 publications in refereed conferences
[31] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, and journals including CVPR, WACV, TMM, JBHI, and JVCI. He won
“Learning aligned cross-modal representations from weakly aligned the Elsevier JVCI best paper award in 2015, and IEEE SIU best student
data,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE paper award in 2011. He also received the Graduate Studies Excellence
Conference on. IEEE, 2016. Award in 2010 for outstanding academic achievement at Koc University.

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 17

Nicholas Hynes is a graduate student at the


University of California Berkeley and research
scientist at Oasis Labs. His research interests
are generally in the domain of efficient machine
learning on shared private data.

Amaia Salvador is a PhD candidate at Universi-


tat Politcnica de Catalunya under the advisement
of Professor Xavier Gir and Professor Ferran
Marqus. She obtained her B.S. in Audiovisual
Systems Engineering from Universitat Politcnica
de Catalunya in 2013, after completing her thesis
on interactive object segmentation at the EN-
SEEIHT Engineering School in Toulouse. She
holds a M.S. in Computer Vision from Universitat
Autnoma de Barcelona. She spent the summer
of 2014 at the Insight Centre for Data Analytics
in the Dublin City University, where she worked on her master thesis on
visual instance retrieval. In 2015 and 2016 she was a visiting student at
the National Institute of Informatics and the Massachusetts Institute of
Technology, respectively. During the summer of 2018, she interned at
Facebook AI Research in Montreal.

Yusuf Aytar is a Research Scientist at DeepMind


since July 2017. He was a post-doctoral research
associate at Massachusetts Institute of Technol-
ogy (MIT) between 2014-2017. He received his
D.Phil. degree from University of Oxford. As a
Fulbright scholar, he obtained his M.Sc. degree
from University of Central Florida (UCF), and
his B.E. degree in Computer Engineering in Ege
University. His research is mainly concentrated on
computer vision, machine learning, and transfer
learning.

Ingmar Weber is the Research Director for So-


cial Computing at the Qatar Computing Research
Institute (QCRI). As an undergraduate Ingmar
studied mathematics at Cambridge University,
before pursuing a PhD at the Max-Planck In-
stitute for Computer Science. He subsequently
held positions at the Ecole Polytechnique Fed-
erale de Lausanne (EPFL) and Yahoo Research
Barcelona, as well as a visiting researcher posi-
tion at Microsoft Research Cambridge. He is an
ACM, IEEE and AAAI senior member.

Antonio Torralba received the degree in


telecommunications engineering from Telecom
BCN, Barcelona, Spain, in 1994 and the Ph.D.
degree in signal, image, and speech process-
ing from the Institut National Polytechnique de
Grenoble, France, in 2000. From 2000 to 2005,
he spent postdoctoral training at the Brain and
Cognitive Science Department and the Computer
Science and Artificial Intelligence Laboratory, MIT.
He is now a Professor of Electrical Engineering
and Computer Science at the Massachusetts
Institute of Technology (MIT). Prof. Torralba is an Associate Editor of the
International Journal in Computer Vision, and has served as program
chair for the Computer Vision and Pattern Recognition conference in 2015.
He received the 2008 National Science Foundation (NSF) Career award,
the best student paper award at the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) in 2009, and the 2010 J. K. Aggarwal
Prize from the International Association for Pattern Recognition (IAPR).

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like