Reference 1
Reference 1
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 1
Abstract—In this paper, we introduce Recipe1M+, a new large-scale, structured corpus of over one million cooking recipes and 13
million food images. As the largest publicly available collection of recipe data, Recipe1M+ affords the ability to train high-capacity models
on aligned, multimodal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields
impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level
classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate
that these embeddings will provide a basis for further exploration of the Recipe1M+ dataset and food and cooking in general. Code, data
and models are publicly available.
1 I NTRODUCTION
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 2
In this work, we address data limitations by introducing important to the appearance of food, other attributes such as the
the large-scale Recipe1M+ dataset which contains one million manner of cutting and manner of cooking ingredients also play a
structured cooking recipes and their images. Additionally, to role in forming the food’s appearance. Given a food image, they
demonstrate its utility, we present the im2recipe retrieval task which attempt to predict ingredient, cutting and cooking attributes, and use
leverages the full dataset–images and text–to solve the practical these predictions to help retrieve the correct corresponding recipe.
and socially relevant problem of demystifying the creation of a With our model, we attempt to retrieve the recipe directly, without
dish that can be seen but not necessarily described. To this end, we first predicting attributes like ingredients, cutting and cooking
have developed a multimodal neural model which jointly learns to attributes, separately. Furthermore, along with retrieving the recipe
embed images and recipes in a common space which is semantically matching an image, our model also allow to retrieve the image
regularized by the addition of a high-level classification task. The matching a corresponding recipe.
performance of the resulting embeddings is thoroughly evaluated The two most relevant studies to the current one are presented
against baselines and humans, showing remarkable improvement in [18] and [19]. Different from our work, Chen et al. [18] approach
over the former while faring comparably to the latter. With the the image-to-recipe retrieval problem from the perspective of
release of Recipe1M+, we hope to spur advancement on not only attention modeling where they incorporate word-level and sentence-
the im2recipe task but also heretofore unimagined objectives which level attentions into their recipe representation and align them with
require a deep understanding of the domain and its modalities. the corresponding image representation such that both text and
visual features have high similarity in a multi-dimensional space.
1.1 Related Work Another difference is that they employ a rank loss instead of a
Since we presented our initial work on the topic back in 2017 [12], pairwise similarity loss as we do. These improvements effectively
several related studies have been published and we feel obliged to lead to slight performance increases in both image-to-recipe and
provide a brief discussion about them. recipe-to-image retrieval tasks.
Herranz et al. [13], besides providing a detailed description on On the other hand, building upon the same network architecture
recent work focusing on food applications, propose an extended as in our original work [12] to represent the image and text (recipe)
multimodal framework that relies on food imagery, recipe and modalities, Carvalho et al. [19] improve our initial results further
nutritional information, geolocation and time, restaurant menus by proposing a new objective function that combines retrieval
and food styles. In another study, Min et al. [14] present a multi- and classification tasks in a double-triplet learning scheme. This
attribute theme modeling (MATM) approach that incorporates food new scheme captures both instance-based (i.e., fine-grained) and
attributes such as cuisine style, course type, flavors or ingredient semantic-based (i.e., high-level) structure simultaneously in the
types. Then, similar to our work, they train a multimodal embedding latent space since the semantic information is directly injected into
which learns a common space between the different food attributes the cross-modal metric learning problem as opposed to our use of
and the corresponding food image. Most interesting applications of classification task as semantic regularization. Additionally, they
their model include flavor analysis, region-oriented food summary, follow an adaptive training strategy to account for the vanishing
and recipe recommendation. In order to build their model, they gradient problem of the triplet losses and use the MedR score
collect all their data from a single data source, i.e., Yummly1 , which instead of the original loss in the validation phase for early stopping.
is an online recipe recommendation system. We also find that using the MedR score as the performance measure
In another interesting study, Chang et al. [15] focus on analyz- in the validation phase is more stable. However, our work is
ing several possible preparations of a single dish, like “chocolate orthogonal to both of these studies, i.e., their performances can
chip cookie.” The authors design an interface that allows users to be further improved with the use of our expanded dataset and the
explore the similarities and differences between such recipes by quality of their embeddings can be further explored with various
visualizing the structural similarity between recipes as points in arithmetics presented in this submission.
a space, in which clusters are formed according to how similar The rest of the paper is organized as follows. In Section 2,
recipes are. Furthermore, they examine how cooking instructions we introduce our large-scale, multimodal cooking recipe dataset
overlap between two recipes to measure recipe similarity. Our work and provide details about its collection process. We describe our
is of a different flavor, as the features they use to measure similarity recipe and image representations in Section 3 and present our
are manually picked by humans, while ours are automatically neural joint embedding model in Section 4. Then, in Section 5,
learned by a multimodal network. we discuss our semantic regularization approach to enhance our
Getting closer to the information retrieval domain, Engilberge joint embedding model. In Section 6, we present results from our
et al. [16] examine the problem of retrieving the best matching various experiments and conclude the paper in Section 7.
caption for an image. In order to do so, they use neural networks
to create embeddings for each caption, and retrieve the one whose 2 DATASET
embedding most closely matches the embedding of the original
image. In our work, we aim to also use embeddings to retrieve the Due to their complexity, textually and visually, (e.g., ingredient-
recipe matching an image, or vice versa. However, since our domain based variants of the same dish, different presentations, or multiple
involves cooking recipes while theirs only involves captions, we ways of cooking a recipe), understanding food recipes demands
account for two separate types of text – ingredients and cooking a large, general collection of recipe data. Hence, it should not
instructions – and combine them in a different way in our model. be surprising that the lack of a larger body of work on the topic
Alternatively, Chen et al. [17] study the task of retrieving a could be the result of missing such a collection. To our knowledge,
recipe matching a corresponding food image in a slightly different practically all the datasets publicly available in the research field
way. The authors find that, although ingredient composition is either contain only categorized images [8], [10], [20], [21] or
simply recipe text [22]. Only recently have a few datasets been
1. https://www.yummly.com/ released that include both recipes and images. For instance, Wang
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 3
2. https://www.internetworldstats.com/stats.htm 3. https://aria2.github.io/
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 4
4. http://www.nltk.org/ 5. https://www.resourcesorg.co.uk/assets/pdfs/foodtrafficlight1107.pdf
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 5
after the data extension phase, only around 2% of the recipes are
left without any associated images. Regarding the experiments,
we carefully removed any exact duplicates or recipes sharing the
same image in order to avoid overlapping between training and
test sets. As detailed earlier in Table 1, around 70% of the data
is labeled as training, and the remainder is split equally between
the validation and test sets. During the dataset extension, as we
mentioned earlier, we also created an intersection dataset in order
to have a fair comparison of the experimental results on both the
initial and the extended versions of the dataset.
According to Fig. 5, the average recipe in the dataset consists
of nine ingredients which are transformed over the course of ten
instructions. One can also observe that the distributions of data
are heavy tailed. For instance, of the 16k ingredients identified
as unique (in terms of phrasing), only 4,000 account for 95% of
Fig. 3. Embedding visualization using t-SNE. Legend depicts the
recipes that belong to the top 12 semantic categories used in our occurrences. At the low end of instruction count–particularly those
semantic regularization (see Section 5 for more details). with one step–one will find the dreaded Combine all ingredients.
At the other end are lengthy recipes and ingredient lists associated
with recipes that include sub-recipes.
A similar issue of outliers exists also for images: as several
of the included recipe collections curate user-submitted images,
popular recipes like chocolate chip cookies have orders of mag-
nitude more images than the average. Notably, the number of
unique recipes that came with associated food images in the initial
data collection phase was 333K, whilst after the data extension
phase, this number reached to more than 1M recipes. On average,
the Recipe1M+ dataset contains 13 images per recipe whereas
Recipe1M has less than one image per recipe, 0.86 to be exact.
Fig. 5 also depicts the images vs recipes histogram for Recipe1M+,
where over half million recipes contain more than 12 images each.
To evaluate further the quality of match between the queried
Fig. 4. Healthiness within the embedding. Recipe health is rep- images and the recipes, we performed an experiment on Amazon
resented within the embedding visualization in terms of sugar, salt, Mechanical Turk (AMT) platform6 . We randomly picked 3,455
saturates, and fat. We follow FSA traffic light system to determine how
healthy a recipe is. recipes, containing at most ten ingredients and ten instructions,
from the pool of recipes with non-unique titles. Then, for each
one of these recipes, we showed AMT workers a pair of images
for preparing a dish; all of these data are provided as free text.
and asked them to choose which image, A or B, was the best
Additional fields such as unit and quantity are also available in this
match for the corresponding recipe. The workers also had the
layer. In cases where we were unable to extract unit and quantity
options of selecting ‘both images’ or ‘none of them’. Image A
from the ingredient description, these two fields were simply left
and image B were randomly chosen; one from the original recipe
empty for the corresponding ingredient. Nutritional information
(i.e., Recipe1M) images and the other one from the queried images
(i.e., total energy, protein, sugar, fat, saturates, and salt content) is
collected during the dataset expansion for the corresponding recipe
only added for those recipes that contained both units and quantities
title. We also changed the order of image A and image B randomly.
as described in Section 2.3. FSA traffic lights are also available for
We explicitly asked the workers to check all the ingredients and
such recipes. The second layer (i.e., Layer 2) builds upon the first
instructions. Only master workers were selected for this experiment.
layer and includes all images with which the recipe is associated–
Out of 3,455 recipes, the workers chose 971 times the original
these images are provided as RGB in JPEG format. Additionally, a
recipe image (28.1%); 821 times the queried one (23.8%); 1581
subset of recipes are annotated with course labels (e.g., appetizer,
times both of them (45.8%); and 82 times none of them (2.4%).
side dish, or dessert), the prevalence of which are summarized in
Given the difference between the original recipe image vs. the
Fig. 5. For Recipe1M+, we provide same Layer 1 as described
queried image is less than 5%, these results show that the extended
above with different partition assignments and Layer 2 including
dataset is not much noisier than the original Recipe1M.
the 13M images.
3 L EARNING E MBEDDINGS
2.5 Analysis
In this section, we describe our neural joint embedding model.
Recipe1M (hence Recipe1M+) includes approximately 0.4% dupli- Here, we utilize the paired (recipe and image) data in order to
cate recipes and, excluding those duplicate recipes, 20% of recipes learn a common embedding space as illustrated in Fig. 1. Next, we
have non-unique titles but symmetrically differ by a median of 16 discuss recipe and image representations, and then, we describe our
ingredients. 0.2% of recipes share the same ingredients but are neural joint embedding model that builds upon recipe and image
relatively simple (e.g., spaghetti, or granola), having a median of representations.
six ingredients. Approximately half of the recipes did not have any
images in the initial data collection from recipe websites. However, 6. http://mturk.com
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 6
Fig. 5. Dataset statistics. Prevalence of course categories and number of instructions, ingredients and images per recipe in Recipe1M+.
m
as gradients are diminished over the many time steps. Instead, we the sequence of nk cooking instructions, {gkt }t=1 k
is the sequence
propose a two-stage LSTM model which is designed to encode a of mk ingredient tokens. The objective is to maximize the cosine
sequence of sequences. First, each instruction/sentence is repre- similarity between positive recipe-image pairs, and minimize it
sented as a skip-instructions vector, and then, an LSTM is trained between all non-matching recipe-image pairs, up to a specified
over the sequence of these vectors to obtain the representation of margin.
all instructions. The resulting fixed-length representation is fed into The ingredients encoder is implemented using a bi-directional
to our joint embedding model (see instructions-encoder in Fig. 6). LSTM: at each time step it takes two ingredient-word2vec rep-
resentations of gkt and gkmk −t+1 , and eventually, it produces the
Skip-instructions. Our cooking instruction representation, referred g
fixed-length representation hk for ingredients. The instructions
to as skip-instructions, is the product of a sequence-to-sequence
encoder is implemented through a regular LSTM. At each time step
model [29]. Specifically, we build upon the technique of skip-
it receives an instruction representation from the skip-instructions
thoughts [30] which encodes a sentence and uses that encoding as
encoder, and finally it produces the fixed-length representation
context when decoding/predicting the previous and next sentences
hsk . hgk and hsk are concatenated in order to obtain the recipe
(see Fig. 7). Our modifications to this method include adding start-
representation hrk . On the image side, the image encoder simply
and end-of-recipe “instructions” and using an LSTM instead of
produces the fixed-length representation hvk . Then, the recipe and
a GRU. In either case, the representation of a single instruction
image representations are mapped into the joint embedding space
is the final output of the encoder. As before, this is used as the
as: φr = W r hrk + br and φv = W v hvk + bv , respectively. Note
instructions input to our embedding model.
that W r and W v are embedding matrices which are also learned.
3.2 Representation of Food Images Finally, the complete model is trained end-to-end with positive and
negative recipe-image pairs (φr , φv ) using the cosine similarity
For the image representation we adopt two major state-of-the-art
loss with margin defined as follows:
deep convolutional networks, namely VGG-16 [6] and Resnet-
50 [7] models. In particular, the deep residual networks have r v 1 − cos(φr , φv ), if y = 1
Lcos (φ , φ , y) = max(0, cos(φr , φv ) − α), if y = −1
a proven record of success on a variety of benchmarks [7].
Although [6] suggests training very deep networks with small where cos(.) is the normalized cosine similarity and α is the
convolutional filters, deep residual networks take it to another level margin.
using ubiquitous identity mappings that enable training of much
deeper architectures (e.g., with 50, 101, or 152 layers) with better
performance. We incorporate these models by removing the last 5 S EMANTIC R EGULARIZATION
softmax classification layer and connecting the rest to our joint We incorporate additional regularization on our embedding through
embedding model as shown in the right side of Fig. 6. solving the same high-level classification problem in multiple
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 7
Fig. 6. Joint neural embedding model with semantic regularization. Our model learns a joint embedding space for food images and cooking
recipes.
Fig. 7. Skip-instructions model. During training the encoder learns to predict the next instruction.
modalities with shared high-level weights. We refer to this method frequent bigrams in recipe titles from our training set. We manually
as semantic regularization. The key idea is that if high-level remove those that contain unwanted characters (e.g., n’, !, ? or
discriminative weights are shared, then both of the modalities &) and those that do not have discriminative food properties (e.g.,
(recipe and image embeddings) should utilize these weights in best pizza, super easy or 5 minutes). We then assign each of the
a similar way which brings another level of alignment based on remaining bigrams as the semantic category to all recipes that
discrimination. We optimize this objective together with our joint include it in their title. By using bigrams and Food-101 categories
embedding loss. Essentially the model also learns to classify any together we obtain a total of 1,047 categories, which cover 50%
image or recipe embedding into one of the food-related semantic of the dataset. chicken salad, grilled vegetable, chocolate cake
categories. We limit the effect of semantic regularization as it is and fried fish are some examples among the categories we collect
not the main problem that we aim to solve. using this procedure. All those recipes without a semantic category
Semantic Categories. We start by assigning Food-101 categories are assigned to an additional background class. Although there is
to those recipes that contain them in their title. However, after this some overlap in the generated categories, 73% of the recipes in
procedure we are only able to annotate 13% of our dataset, which our dataset (excluding those in the background class) belong to
we argue is not enough labeled data for a good regularization. a single category (i.e., only one of the generated classes appears
Hence, we compose a larger set of semantic categories purely in their title). For recipes where two or more categories appear in
extracted from recipe titles. We first obtain the top 2,000 most the title, the category with highest frequency rate in the dataset is
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 8
chosen.
Classification. To incorporate semantic regularization to the joint
embedding, we use a single fully connected layer. Given the
embeddings φv and φr , class probabilities are obtained with
pr = W c φr and pv = W c φv followed by a softmax activation.
W c is the matrix of learned weights, which are shared between
image and recipe embeddings to promote semantic alignment
between them. Formally, we express the semantic regularization
loss as Lreg (φr , φv , cr , cv ) where cr and cv are the semantic
category labels for recipe and image, respectively. Note that cr and
cv are the same if (φr , φv ) is a positive pair. Then, we can write
the final objective as:
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 9
TABLE 3
Im2recipe retrieval comparisons on Recipe1M. Median ranks and recall rate at top K are reported for baselines and our method. Note that the
joint neural embedding models consistently outperform all the baseline methods.
im2recipe recipe2im
medR R@1 R@5 R@10 medR R@1 R@5 R@10
random ranking 500 0.001 0.005 0.01 500 0.001 0.005 0.01
CCA w/ skip-thoughts + word2vec (GoogleNews) + image features 25.2 0.11 0.26 0.35 37.0 0.07 0.20 0.29
CCA w/ skip-instructions + ingredient word2vec + image features 15.7 0.14 0.32 0.43 24.8 0.09 0.24 0.35
joint emb. only 7.2 0.20 0.45 0.58 6.9 0.20 0.46 0.58
joint emb. + semantic 5.2 0.24 0.51 0.65 5.1 0.25 0.52 0.65
attention + SR. [18] 4.6 0.26 0.54 0.67 4.6 0.26 0.54 0.67
AdaMine [19] 1.0 0.40 0.69 0.77 1.0 0.40 0.68 0.79
image features utilized in the CCA baselines are the ResNet- in our our model in several optimization stages. The results are
50 features before the softmax layer. Although they are learned reported in Table 4. Note that here we also report medR with 1K ,
for visual object categorization tasks on ImageNet dataset, these 5K and 10K random selections to show how the results scale
features are widely adopted by the computer vision community, in larger retrieval problems. As expected, visual features from
and they have been shown to generalize well to different visual the ResNet-50 model show a substantial improvement in retrieval
recognition tasks [34]. performance when compared to VGG-16 features. Even with “fixed
For evaluation, given a test query image, we use cosine vision” networks the joint embedding achieved 7.9 medR using
similarity in the common space for ranking the relevant recipes ResNet-50 architecture. Further “fine-tuning” of vision networks
and perform im2recipe retrieval. The recipe2im retrieval setting is slightly improves the results. Although it becomes a lot harder
evaluated likewise. We adopt the test procedure from image2caption to decrease the medR in small numbers, additional “semantic
retrieval task [35], [36]. We report results on a subset of randomly regularization” improves the medR in both cases.
selected 1,000 recipe-image pairs from the test set. We repeat the Comparison with Human Performance. In order to better assess
experiments 10 times and report the mean results. We report median the quality of our embeddings we also evaluate the performance
rank (MedR), and recall rate at top K (R@K) for all the retrieval of humans on the im2recipe task. The experiments are performed
experiments. To clarify, R@5 in the im2recipe task represents the through AMT. For quality purposes, we require each AMT worker
percentage of all the image queries where the corresponding recipe to have at least 97% approval rate and have performed at least
is retrieved in the top 5, hence higher is better. The quantitative 500 tasks before our experiment. In a single evaluation batch, we
results for im2recipe retrieval are shown in Table 3. first randomly choose 10 recipes and their corresponding images.
Our model outperforms the CCA baselines in all measures. As We then ask an AMT worker to choose the correct recipe, out of
expected, CCA over ingredient word2vec and skip-instructions per- the 10 provided recipes, for the given food image. This multiple
form better than CCA over word2vec trained on GoogleNews [28] choice selection task is performed 10 times for each food image
and skip-thoughts vectors that are learned over a large-scale book in the batch. The accuracy of an evaluation batch is defined
corpus [30]. In 65% of all evaluated queries, our method can as the percentage of image queries correctly assigned to their
retrieve the correct recipe given a food image. The semantic corresponding recipe.
regularization notably improves the quality of our embedding The evaluations are performed for three levels of difficulty. The
for im2recipe task which is quantified with the medR drop from batches (of 10 recipes) are randomly chosen from either all the test
7.2 to 5.2 in Table 3. The results for recipe2im task are also similar recipes (easy), recipes sharing the same course (e.g., soup, salad,
to those in the im2recipe retrieval setting. or beverage; medium), or recipes sharing the name of the dish (e.g.,
Table 3 also presents results originally reported in [18] and [19] salmon, pizza, or ravioli; hard). As expected–for our model as well
on Recipe1M. Attention-based modeling of [18] achieves slight as the AMT workers–the accuracies decrease as tasks become more
performance increases whereas double-triplet learning scheme of specific. In both coarse and fine-grained tests, our method performs
[19] leads to larger performance gains in both retrieval settings. comparably to or better than the AMT workers. As hypothesized,
Since neither [18] nor [19] made their codes publicly available, semantic regularization further improves the results (see Table 5).
we could not evaluate their algorithms on our datasets for further In the “all recipes” condition, 25 random evaluation batches
comparative analyses. (25 × 10 individual tasks in total) are selected from the entire test
Fig. 8 compares the ingredients from the original recipes (true set. Joint embedding with semantic regularization performs the
recipes) with the retrieved recipes (coupled with their corresponding best with 3.2 percentage points improvement over average human
image) for different image queries. As can be observed in Fig. 8, accuracy. For the course-specific tests, 5 batches are randomly
our embeddings generalize well and allow overall satisfactory selected within each given meal course. Although, on average, our
recipe retrieval results. However, at the ingredient level, one can joint embedding’s performance is slightly lower than the humans’,
find that in some cases our model retrieves recipes with missing with semantic regularization our joint embedding surpasses humans’
ingredients. This usually occurs due to the lack of fine-grained performance by 6.8 percentage points. In dish-specific tests, five
features (e.g., confusion between shrimps and salmon) or simply random batches are selected if they have the dish name (e.g.,
because the ingredients are not visible in the query image (e.g., pizza) in their title. With slightly lower accuracies in general, dish-
blueberries in a smoothie or beef in a lasagna). specific results also show similar behavior. Particularly for the
Ablation Studies. We also analyze the effect of each component “beverage” and “smoothie” results, human performance is better
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 10
TABLE 4
Ablation studies on Recipe1M. Effect of the different model components to the median rank, medR (the lower is better).
im2recipe recipe2im
Joint emb. methods
medR-1K medR-5K medR-10K medR-1K medR-5K medR-10K
fixed vision 15.3 71.8 143.6 16.4 76.8 152.8
VGG-16 finetuning (ft) 12.1 56.1 111.4 10.5 51.0 101.4
ft + semantic reg. 8.2 36.4 72.4 7.3 33.4 64.9
fixed vision 7.9 35.7 71.2 9.3 41.9 83.1
ResNet-50 finetuning (ft) 7.2 31.5 62.8 6.9 29.8 58.8
ft + semantic reg. 5.2 21.2 41.9 5.1 20.2 39.2
TABLE 5
Comparison with human performance on im2recipe task on Recipe1M. The mean results are highlighted as bold for better visualization. Note
that on average our method with semantic regularization performs better than average AMT worker.
TABLE 6
Comparison between models trained on Recipe1M vs. Recipe1M+. Median ranks and recall rate at top K are reported for both models. They
have similar performance on the Recipe1M test set in terms of medR and R@K. However, when testing on the Recipe1M+ test set, the model trained
on Recipe1M+ yields significantly better medR and better R@5 and R@10 scores. In this table, Recipe1M refers to the intersection dataset.
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 11
Fig. 9. Localized unit activations. We find that ingredient detectors emerge in different units in our embeddings, which are aligned across modalities
(e.g., unit 352: “cream”, unit 22: “sponge cake” or unit 571: “steak”).
containing 101 food categories and 1,000 images for each one of 6.2 Analysis of the Learned Embedding
these 101 food categories, totaling up to 101,000 images. To gain further insight into our neural embedding, we perform a
Our method of evaluation involves randomly sampling an image series of qualitative analysis experiments. We explore whether any
and a recipe corresponding to each of the Food-101 categories. The semantic concepts emerge in the neuron activations and whether
images are taken from the Food-101 dataset, while the recipes are the embedding space has certain arithmetic properties.
taken from the test partition of the intersection dataset. Here, a Neuron Visualizations. Through neural activation visualization,
recipe is considered to belong to a category only if the recipe title we investigate if any semantic concepts emerge in the neurons
string matches with the Food-101 category. Here, we only sample in our embedding vector despite not being explicitly trained for
images and recipes from those categories that correspond to at least that purpose. We pick the top activating images, ingredient lists,
N recipes among the test recipes that we sample from. and cooking instructions for a given neuron. Then we use the
After sampling an image and a corresponding recipe for each methodology introduced by Zhou et al. [37] to visualize image
category that is common enough, we evaluate our models on the regions that contribute the most to the activation of specific units in
retrieval task. In the im2recipe direction, we provide our model our learned visual embeddings. We apply the same procedure on the
with the image and expect it to retrieve the corresponding recipe. recipe side to also obtain those ingredients and recipe instructions
In the recipe2im direction, we provide our model with the recipe to which certain units react the most. Fig. 9 shows the results for
and expect it to retrieve the corresponding image. We show the the same unit in both the image and recipe embedding. We find
retrieval results of both models in Table 7. Note that the model that certain units display localized semantic alignment between the
trained on Recipe1M+ consistently outperforms the model trained embeddings of the two modalities.
on Recipe1M. Semantic Vector Arithmetic. Different works in the literature
One possible explanation for Recipe1M+ dataset giving an [28], [38] have used simple arithmetic operations to demon-
advantage on the Food-101 task is that there might be an overlap strate the capabilities of their learned representations. In the
between the images used to train the model on the Recipe1M+ and context of food recipes, one would expect that v(“chicken
the Food-101 images. Further, it is possible that there might be pizza”) − v(“pizza”) + v(“salad”) = v(“chicken salad”), where
images in Recipe1M+ training set that overlap with the Food-101 v represents the map into the embedding space. We demonstrate
dataset that are not in the initial training set. This would give the that our learned embeddings have such properties by applying the
model trained on Recipe1M+ an unfair advantage. We perform the previous equation template to the averaged vectors of recipes that
following procedure to test whether this is true. First, we feed in contain the queried words in their title. We apply this procedure
all of the images in the Recipe1M+ training set and the Food-101 in the recipe and image embedding spaces and show results in
images into an 18 layer residual network that was pre-trained on Fig. 10 and Fig. 11, respectively. Our findings suggest that the
ImageNet. The network outputs a prediction vector for each of these learned embeddings have semantic properties that translate to
images. We next note that if an image in the extended training set simple geometric transformations in the learned space. Furthermore,
has an exact copy in the Food-101 dataset, then both images must the model trained on Recipe1M+ is better able to capture these
have the same prediction vector. When checking the prediction semantic properties in the embedding space. The improvement is
vectors of the images in Food-101 and the Recipe1M+ training set, most seriously observable on the recipe arithmetic. Among the
we did not find any overlapping prediction vectors, meaning that recipe analogy examples, notice that the result for the Recipe1M+
the images between Food-101 and Recipe1M+ training set do not dataset for “chicken quesadilla” - “wrap” + “rice” returns a
overlap. casserole dish, while for the Recipe1M dataset we have a quesadilla
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 12
Fig. 10. Analogy arithmetic results using recipe embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. We represent the average vector
of a query with the images from its 4 nearest neighbors. In the case of the arithmetic result, we show the nearest neighbor only.
dish. The casserole dish is much closer to matching the “chicken top of a large piece of lettuce. This result is similar in a way to
rice” result that we expect in this instance. Additionally, note how the lettuce wrap result, as the piece of lettuce is not just mixed in
“taco” - “tortilla” + “lettuce” returns a salad for the Recipe1M with the other ingredients, but acts as more of an object into which
model and a lettuce wrap for the Recipe1M+ model. Here, the other ingredients are placed. All in all, the Recipe1M+ training set
former model is likely doing arithmetic over the ingredients in allows our model to better capture high level semantic concepts.
the dish - a taco without tortilla likely comprises of a salad, into
which lettuce is added to give a salad-like dish. On the other hand, Fractional Arithmetic. Another type of arithmetic we examine
the Recipe1M+ model does arithmetic over higher level semantic is fractional arithmetic, in which our model interpolates across
concepts - it returns a lettuce wrap, which is the closest analogue the vector representations of two concepts in the embedding
to a taco which has the tortilla substituted out with lettuce. We space. Specifically, we examine the results for x × v(“concept
can thus see how the Recipe1M+ model has a greater ability to 1”) + (1 − x) × v(“concept 2”), where x varies from 0 to 1.
capture semantic concepts in the recipe embedding space, and also We expect this to have interesting applications in spanning the
performs somewhat better in general. If we examine the results space across two food concepts, such as pasta and salad, by
of both models for the analogy task with image embeddings, then adjusting the value of x to make the dish more “pasta-like” or
the Recipe1M+ model shows less of an improvement in general. “salad-like” for example. We apply this procedure in the recipe and
However, we can still see differences between the two models. For image embedding spaces and show results in Fig. 12 and Fig. 13,
instance, if we examine the “taco” - “tortilla” + “lettuce” analogy, respectively. With both fractional image arithmetic and fractional
then the Recipe1M model returns a result in which the lettuce recipe arithmetic, we hope that adjusting the fractional coefficient
is mixed in with other ingredients to form a salad. However, the will allow us to explore more fine-grained combinations of two
Recipe1M+ model returns a result in which a salad is placed on concepts. However, the results are often not so fine-grained. For
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 13
Fig. 11. Analogy arithmetic results using image embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. We represent the average vector
of a query with the images from its four nearest neighbors. In the case of the arithmetic result, we show the nearest neighbor only.
instance, in the “burrito” and “sandwich” example for the model 7 C ONCLUSION
trained on the Recipe1M dataset for recipe fractional arithmetic,
choosing a burrito coefficient of 0 does not yield different results In this paper, we present Recipe1M+, the largest structured recipe
from choosing the coefficient to be 0.5. Note that on the other hand, dataset to date, the im2recipe problem, and neural embedding mod-
the model trained on the Recipe1M+ dataset is able to provide els with semantic regularization which achieve impressive results
distinct results for each fractional coefficient value for this example. for the im2recipe task. The experiments conducted using AMT,
In general though, both models are able to effectively explore the together with the fact that on the Recipe1M test set we obtain the
gradient of recipes or food images between two different food same test performance using Recipe1M+, show that the extended
concepts. For instance, note the models’ results for the “curry” and dataset is not much noisier. Moreover, the fact that this expansion
“soup” examples, in both the image and recipe modalities. The most strategy greatly helps on the Food 101 dataset demonstrates the
“curry-like” image tends to have some broth, but is much chunkier value for generalizability. Additionally, we explored the properties
than the images. As we increase the coefficient of “soup”, we see of the resulting recipe and food representations by evaluating
the food becoming less chunky and more broth-like. Such examples different vector arithmetics on the learned embeddings, which
reflect the ability of our model to explore the space between food hinted at the possibility of applications such as recipe modification
concepts in general. or even cross-modal recipe generation.
The results of our fractional arithmetic experiments suggest More generally, the methods presented here could be gainfully
that the recipe and image embeddings learned in our model are applied to other “recipes” like assembly instructions, tutorials,
semantically aligned, which broaches the possibility of applica- and industrial processes. Further, we hope that our contributions
tions in recipe modification (e.g., ingredient replacement, calorie will support the creation of automated tools for food and recipe
adjustment) or even cross-modal generation. understanding and open doors for many less explored aspects of
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 14
Fig. 12. Fractional arithmetic results using recipe embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. For each model, we fractionally
interpolate across two example concepts (for instance, “salad” and “pasta”). We find the retrieved results for x × v(“concept 1”) + (1 − x) × v(“concept
2”), where x varies from 0 to 1.
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 15
Fig. 13. Fractional arithmetic results using image embeddings on the Recipe1M test set. On the left hand side are arithmetic results using the
model trained on Recipe1M. On the right hand side are the arithmetic results for the model trained on Recipe1M+. For each model, we fractionally
interpolate across two example concepts (for instance, “salad” and “pasta”). We find the retrieved results for x × v(“concept 1”) + (1 − x) × v(“concept
2”), where x varies from 0 to 1.
recognition challenge,” International Journal of Computer Vision, vol. recognition,” arXiv preprint arXiv:1512.03385, 2015.
115, no. 3, pp. 211–252, 2015. [8] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining
[4] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning discriminative components with random forests,” in European Conference
deep features for scene recognition using places database,” in Advances on Computer Vision. Springer, 2014, pp. 446–461.
in neural information processing systems, 2014, pp. 487–495. [9] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, and Y. Ma, “Deepfood:
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification Deep learning-based food image recognition for computer-aided dietary
with deep convolutional neural networks,” in NIPS, 2012. assessment,” in International Conference on Smart Homes and Health
[6] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Telematics. Springer, 2016, pp. 37–48.
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [10] A. Myers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban, N. Sil-
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image berman, S. Guadarrama, G. Papandreou, J. Huang, and K. Murphy,
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 16
“Im2calories: Towards an automated mobile vision food diary,” in ICCV, [32] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba,
2015, pp. 1233–1241. “Cross-modal scene networks,” IEEE Trans. Pattern Anal. Mach.
[11] F. Ofli, Y. Aytar, I. Weber, R. Hammouri, and A. Torralba, “Is saki Intell., vol. 40, no. 10, pp. 2303–2314, 2018. [Online]. Available:
#delicious? the food perception gap on instagram and its relation to health,” https://doi.org/10.1109/TPAMI.2017.2753232
in Proceedings of the 26th International Conference on World Wide Web. [33] Q. V. Le and T. Mikolov, “Distributed representations of sentences and
International World Wide Web Conferences Steering Committee, 2017. documents,” arXiv preprint arXiv:1405.4053, 2014.
[12] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and [34] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
A. Torralba, “Learning cross-modal embeddings for cooking recipes T. Darrell, “Decaf: A deep convolutional activation feature for generic
and food images,” in Proceedings of the IEEE Conference on Computer visual recognition,” arXiv preprint arXiv:1310.1531, 2013.
Vision and Pattern Recognition, July 2017. [35] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
[13] L. Herranz, W. Min, and S. Jiang, “Food recognition and generating image descriptions,” in Proceedings of the IEEE Conference
recipe analysis: integrating visual content, context and external on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
knowledge,” CoRR, vol. abs/1801.07239, 2018. [Online]. Available: [36] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
http://arxiv.org/abs/1801.07239 image caption generator,” in Proceedings of the IEEE Conference on
[14] W. Min, S. Jiang, S. Wang, J. Sang, and S. Mei, “A delicious recipe Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
analysis framework for exploring multi-modal recipes with various [37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object
attributes,” in Proceedings of the 2017 ACM on Multimedia Conference, detectors emerge in deep scene cnns,” International Conference on
ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 402–410. [Online]. Learning Representations, 2015.
Available: http://doi.acm.org/10.1145/3123266.3123272 [38] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” arXiv
[15] M. Chang, L. V. Guillain, H. Jung, V. M. Hare, J. Kim, and
preprint arXiv:1511.06434, 2015.
M. Agrawala, “Recipescape: An interactive tool for analyzing cooking
instructions at scale,” in Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, ser. CHI ’18. New
York, NY, USA: ACM, 2018, pp. 451:1–451:12. [Online]. Available:
http://doi.acm.org/10.1145/3173574.3174025
[16] M. Engilberge, L. Chevallier, P. Pérez, and M. Cord, “Finding beans
in burgers: Deep semantic-visual embedding with localization,” in
Javier Marı́n received the B.Sc. degree in Math-
Proceedings of the IEEE Conference on Computer Vision and Pattern
ematics at the Universitat de les Illes Balears
Recognition, June 2018.
in 2007. In June 2013 he received his Ph.D.
[17] J.-j. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval with in computer vision at the Universitat Autónoma
rich food attributes,” in Proceedings of the 2017 ACM on Multimedia de Barcelona. In 2017 he was a postdoctoral
Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1771– research associate at the Massachusetts Institute
1779. [Online]. Available: http://doi.acm.org/10.1145/3123266.3123428 of Technology (MIT). Before that, he worked as
[18] J.-J. Chen, C.-W. Ngo, F.-L. Feng, and T.-S. Chua, “Deep understanding an algorithm development engineer in the auto-
of cooking procedure for cross-modal recipe retrieval,” in Proceedings of motive sector, and as a researcher and project
the 26th ACM International Conference on Multimedia, ser. MM ’18. manager in both neuroscience and space fields.
New York, NY, USA: ACM, 2018, pp. 1020–1028. [Online]. Available: He currently combines working in the private
http://doi.acm.org/10.1145/3240508.3240627 sector as a senior data scientist at Satellogic Solutions with being a
[19] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord, research affiliate at MIT. His research interests lie mainly in the area of
“Cross-modal retrieval in the cooking context: Learning semantic text- computer vision and machine learning, focusing recently in cross-modal
image embeddings,” in Proceedings of the 41st International ACM SIGIR learning, object recognition and semantic segmentation.
Conference on Research and Development in Information Retrieval, ser.
SIGIR ’18. New York, NY, USA: ACM, 2018.
[20] Y. Kawano and K. Yanai, “Foodcam: A real-time food recognition system
on a smartphone,” Multimedia Tools and Applications, vol. 74, no. 14, pp.
5263–5287, 2015. Aritro Biswas received a Bachelor’s degree in
[21] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocalized Computer Science at the Massachusetts Institute
modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, no. 8, of Technology (MIT). He received his Master’s
pp. 1187–1199, 2015. degree in Computer Science at MIT. Recently, his
[22] T. Kusmierczyk, C. Trattner, and K. Norvag, “Understanding and research has focused on using computer vision
predicting online food recipe production patterns,” in HyperText, 2016. for two applications: (i) understanding the content
[23] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe of food images and (ii) disaster recognition for
recognition with large multimodal food dataset,” in ICME Workshops, images of humanitarian disasters.
2015, pp. 1–6.
[24] C.-w. N. Jing-jing Chen, “Deep-based ingredient recognition for cooking
recipe retrival,” ACM Multimedia, 2016.
[25] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A
10 Million Image Database for Scene Recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464,
Apr. 2018. Ferda Ofli (S’07–M’11–SM’18) received the B.Sc.
[26] US Department of Agriculture, Agricultural Research Service, degrees both in electrical and electronics en-
Nutrient Data Laboratory, “Usda national nutrient database for gineering and computer engineering, and the
standard reference, release 27,” May 2015. [Online]. Available: Ph.D. degree in electrical engineering from Koc
http://www.ars.usda.gov/ba/bhnrc/ndl University, Istanbul, Turkey, in 2005 and 2010, re-
[27] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal spectively. From 2010 to 2014, he was a Postdoc-
of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. toral Researcher at the University of California,
[28] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Berkeley, CA, USA. He is currently a Scientist at
word representations in vector space,” CoRR, vol. abs/1301.3781, 2013. the Qatar Computing Research Institute (QCRI),
[29] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning part of Hamad Bin Khalifa University (HBKU).
with neural networks,” in NIPS, 2014, pp. 3104–3112. His research interests cover computer vision,
[30] R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, machine learning, and multimedia signal processing. He is an IEEE and
and S. Fidler, “Skip-thought vectors,” in NIPS, 2015, pp. 3294–3302. ACM senior member with over 45 publications in refereed conferences
[31] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba, and journals including CVPR, WACV, TMM, JBHI, and JVCI. He won
“Learning aligned cross-modal representations from weakly aligned the Elsevier JVCI best paper award in 2015, and IEEE SIU best student
data,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE paper award in 2011. He also received the Graduate Studies Excellence
Conference on. IEEE, 2016. Award in 2010 for outstanding academic achievement at Koc University.
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2927476, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH YEAR 17
0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.