KEMBAR78
A Study of Multi - Task and Region-Wise Deep Learning | PDF
0% found this document useful (0 votes)
12 views13 pages

A Study of Multi - Task and Region-Wise Deep Learning

This paper investigates multi-task and region-wise deep learning approaches for food ingredient recognition, emphasizing the importance of ingredient composition over dish names. It introduces the Vireo Food-251 dataset, which contains 169,673 images of 251 popular Chinese food dishes and 406 ingredients, addressing the challenges of ingredient recognition in diverse cooking contexts. The study analyzes various recognition methods and their performance, highlighting the need for more research in ingredient recognition due to its complexity and limited existing datasets.

Uploaded by

c24vishalpandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

A Study of Multi - Task and Region-Wise Deep Learning

This paper investigates multi-task and region-wise deep learning approaches for food ingredient recognition, emphasizing the importance of ingredient composition over dish names. It introduces the Vireo Food-251 dataset, which contains 169,673 images of 251 popular Chinese food dishes and 406 ingredients, addressing the challenges of ingredient recognition in diverse cooking contexts. The study analyzes various recognition methods and their performance, highlighting the need for more research in ingredient recognition due to its complexity and limited existing datasets.

Uploaded by

c24vishalpandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1514 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

30, 2021

A Study of Multi-Task and Region-Wise Deep


Learning for Food Ingredient Recognition
Jingjing Chen , Member, IEEE, Bin Zhu, Graduate Student Member, IEEE, Chong-Wah Ngo, Tat-Seng Chua,
and Yu-Gang Jiang , Member, IEEE

Abstract— Food recognition has captured numerous research


attention for its importance for health-related applications. The
existing approaches mostly focus on the categorization of food
according to dish names, while ignoring the underlying ingredient
composition. In reality, two dishes with the same name do
not necessarily share the exact list of ingredients. Therefore,
the dishes under the same food category are not mandatorily
equal in nutrition content. Nevertheless, due to limited datasets
available with ingredient labels, the problem of ingredient
recognition is often overlooked. Furthermore, as the number
of ingredients is expected to be much less than the number of
food categories, ingredient recognition is more tractable in the
real-world scenario. This paper provides an insightful analysis
of three compelling issues in ingredient recognition. These issues
involve recognition in either image-level or region level, pooling
in either single or multiple image scales, learning in either single
or multi-task manner. The analysis is conducted on a large food
dataset, Vireo Food-251, contributed by this paper. The dataset
is composed of 169,673 images with 251 popular Chinese food
and 406 ingredients. The dataset includes adequate challenges in
scale and complexity to reveal the limit of the current approaches Fig. 1. Variations in visual appearance and composition of ingredients
in ingredient recognition. highlight the challenges of food recognition. The first row shows three
Index Terms— Food images, Chinese food, ingredient recogni- examples of dishes for the category “scrambled egg & cucumber”, followed
by “sour & spicy diced lotus root” and “shredded oyster mushrooms salad”
tion, deep learning. in the second and third rows respectively.

I. I NTRODUCTION prevalence use of mobile devices, a more convenient way is


by taking a picture of a meal for food recognition and logging.
F OOD log management aims to quantify food consumption
and provides services such as advice on weight-loss
strategies. The current practice of logging still relies on manual
The automatic dietary recognition and assessment have been
an active area of research [2]–[7]. These works basically
perform dish recognition, and then search for calories and
food intake, which is cumbersome. For example, manually
nutrition information of a dish from the food composition
inputting the ingredients of a home-cooked dish is required
table (FCT). For dishes with standardized cooking methods
for nutrition estimation. Furthermore, as reported in [1],
such as fast food, such work-flow is simple and effective.
self-reporting data obtained from unfriendly logging processes
Nevertheless, there remain many categories of dishes without
often tends to underestimate the actual food intake. With the
standard cooking methods, food presentation, and composition
Manuscript received January 9, 2020; revised July 30, 2020 and of ingredients. Figure 1 shows some examples of dishes, where
November 1, 2020; accepted December 3, 2020. Date of publication the composition of ingredients within a food category could
December 23, 2020; date of current version December 31, 2020. This
work was supported in part by the Project from the National Science be diverse. Take the category “shredded oyster mushrooms
Foundation (NSF) of China under Grant 62072116 and in part by the salad” (last row of Figure 1) for example, there are very few
Research Grants Council of the Hong Kong Special Administrative Region, overlaps in ingredients among these dishes except shredded
China under Grant CityU 11203517. The associate editor coordinating the
review of this manuscript and approving it for publication was Prof. Zhu Li. oyster mushrooms. This intuition motivates the studies of
(Corresponding author: Jingjing Chen.) ingredient recognition in this paper - a problem deserved more
Jingjing Chen and Yu-Gang Jiang are with the School of Com- research attention particularly for the large-scale recognition
puter Sciences, Fudan University, Shanghai 200433, China (e-mail:
chenjingjing@fudan.edu.cn). of ingredients from images in the wild.
Bin Zhu and Chong-Wah Ngo are with the Department of Computer As observed in Figure 1, the challenges of food recognition
Science, City University of Hong Kong, Hong Kong. come from the large visual variations within the same food cat-
Tat-Seng Chua is with the School of Computing, National University of
Singapore, Singapore 119077. egory. The variations introduced as a result of different cook-
Digital Object Identifier 10.1109/TIP.2020.3045639 ing and cutting methods are hard to be tackled by hand-crafted
1941-0042 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1515

features such as SIFT [8], HOG [9] and color [10]. Thanks to comparative studies of various compelling issues in ingredient
deep learning [11]–[15], there have been several recent studies recognition through the methods of multi-task learning and
[16], [17] that report high accuracy of food recognition of up region-wise recognition. The rest of the paper is organized as
to 80% on medium scale benchmark datasets, such as Vireo follows. Section II reviews related works while Section III
Food-172 [16] and Food-101 [18]. The success of food classi- introduces the extended dataset. Section IV presents two
fication with deep learning techniques has inspired researchers baselines, i.e., multi-task learning and region-wise multi-label
to explore a more challenging problem, i.e., understanding classification, for ingredient recognition. Section V details the
ingredient composition of a dish [16], [19]–[21]. performances of two baselines on Vireo Food-251. Finally,
Ingredient recognition is generally a harder problem than Section VI concludes this paper.
food categorization. The size, shape, and color of an ingredient
can exhibit large visual differences due to diverse ways of II. R ELATED W ORK
cooking and cutting, in addition to changes in viewpoints
Food recognition has become a popular research topic in
and lighting conditions. This paper studies the recognition of
recent years and variants of recognition-centric approaches
ingredients in the domain of Chinese dishes. This domain is
have been investigated for different food-related applications.
particularly challenging because dishes are often composed
These efforts include food quantity estimation based on depth
of a variety of ingredients being fuzzily mixed, rather than
images [3], image segmentation for volume estimation [24],
separated into different food containers or as non-overlapping
[25], context-based recognition by GPS and restaurant menus
food items as frequently seen in Japanese and Western dishes.
[2], [26], [27], taste estimation [28], multi-food recognition
This paper describes two methods for ingredient recog-
[29]–[32], personalized recognition [33], multi-modal fusion
nition. The methods are not completely new in the litera-
[34] and real-time recognition [5]–[7], [35], [36]. This section
ture of food recognition [16], [20]. This paper presents a
mainly reviews previous works on food and ingredient
throughout analysis of both methods, including their strength
recognition.
and limitation in ingredient recognition. The first method
is based on multi-task learning that relies on global image
features for simultaneous food and ingredient classifications. A. Food Recognition
The motivation is to exploit the mutual relationship between The challenge of food recognition comes from visual vari-
the food category and ingredients for better performance. ations in shape, color and texture layout. These variations
The key ingredients of a category remain similar despite are hard to be tackled by hand-crafted features such as SIFT
composing with different auxiliary ingredients. Knowing the [8], HOG [9] and color [10]. Instead, the features extracted
food category basically eases the recognition of ingredients. from deep convolutional neural network (DCNN) [11], which
For example, the ingredient “cherry tomatoes” has a higher is trained on ImageNet [37] and fine-tuned on food images,
chance than “pork” to appear in the food “shredded oyster often exhibit impressive recognition performance [22], [25],
mushroom salad”. Hence, learning ingredients with the food [38]–[42]. Combination of multi-modal features sometimes
category in mind in principle shall lead to better performance. also leads to better recognition performance, as reported
The second method does not leverage food category informa- in [38], [43]. Recent works mostly focus on researching
tion. Ingredient recognition is performed at the image region new architectures for food recognition, such as Wide-slice
level. Instead of globally pooling features for recognition, residual networks [44] and bin-linear CNN models [45].
ingredients are first predicted for each local image patch and As reported in [44], the best performances on both UEC
then pooled across regions as the final recognition result. Food-100 and Food 101 are achieved by wide-slice residual
This paper also contributes a large dataset, Vireo Food- networks that contain two branches: a residual network branch
251, composed of 169,673 images with 251 Chinese food and a slice branch network with slice convolutional layers.
categories and 406 ingredient labels. In terms of the number Apart from deep architectures, different learning strategies are
of food categories, this new dataset is on par with UEC also investigated [16], [19], [46]. For example, in [19] and
Food-256 [22] and ChineseFoodNet [23] with 208 categories. [16], food recognition is formulated as a multi-task learning
Note that ingredient labels are not available on both datasets. problem by leveraging ingredient labels or taste labels as
In the literature, Food-101 [18] also includes ingredient labels. supplementary supervised information. By treating ingredi-
Nevertheless, it is assumed that all dishes under a food ents as priviledge information, Meng et al.,meng2019learning
category share the same list of ingredients, which makes the propose a cross-modal alignment and transfer network for
dataset inappropriate for ingredient recognition. food recognition. In addition to using static dataset for model
We extend the paper by comparing the originally proposed training, zero and few shots learning has also started capturing
multi-task learning framework in [16] to region-wise ingre- research attention [47], [48].
dient recognition. More in-depth studies, including the issues
of image-level versus region-level recognition, single versus B. Ingredient Recognition
multi-scale feature pooling and single versus multi-task learn- Compared to food categories, ingredients exhibit larger
ing, are presented with new empirical insights. Furthermore, visual appearance variations due to different cooking and
we extend the original Vireo Food-172 dataset from 172 to cutting methods. Labeling of ingredients also poses a higher
251 food categories and 353 to 406 ingredient labels. The main challenge and only very few datasets [16] are constructed
contributions are the sharing of a large food dataset, and the for ingredient recognition. An early work is PFD [49],

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1516 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Fig. 2. Examples of food categories in VIREO Food-251.

which leverages the result of ingredient recognition for including DCNN and deep Boltzmann machine (DBM), for
food categorization. In PFD, based upon the appearance this problem [16], [52]. To increase the robustness of recogni-
of image patches, pixels are softly labeled with ingredient tion, multi-task learning, which leverages food category labels
categories. The spatial relationship between pixels is then as supplementary supervised information, is often employed
modeled as a multi-dimensional histogram, characterized by for simultaneous classification of food and ingredient labels
label co-occurrence and their geometric properties such as [16], [19]. As the appearance of an ingredient change depend-
distance and orientation. With this histogram representation, ing on food preparation, cooking and cutting methods are
PFD shows impressive food recognition performance. PFD, also explored as supervised information in [20] for ingredient
nevertheless, is hardly scalable to the number of ingredients. recognition. However, as food preparation is a process, label-
Using only eight categories of ingredients as demonstrated in ing of ingredients with cooking and cutting attributes is com-
[49], the histogram already grows up to tens of thousands plicated and not intuitive. Other supervised information being
of dimensions. Other earlier works explore spatial layout explored in the literature include restaurant menus using bipar-
[50], feature mining [18] and image segmentation [25] for tite graph representation [53], cuisine, and course using DBM
ingredient or food item recognition. In [50], ingredient regions [52]. Another branch of approaches pose ingredient recogni-
are detected by shape and texture models, where the shape tion as cross-modal learning problem [54]–[56]. Specifically,
is based on DPM (deformable part-based model) while the both images and recipes are projected into a joint embedding
texture is based on STF (semantic texton forest). Similar to space for similarity measure. Ingredients are either extracted
PFD [49], the regions are encoded into a histogram modeling from the matched recipe of an image [54], [56] or directly
spatial relationship between them for food recognition. The predicted from the joint space [57]. However, as the perfor-
spatial relationship is not statistically encoded as in [49], but mance is not scalable to large recipe dataset as studied in [58]
rather explicit relationships such as “above”, “below”, and and cross-modal learning is inherently a “black box” model,
“overlapping” are modeled. Such relationships are helpful for the robustness of these approaches is not yet seriously studied.
recognizing food such as dessert and fast food, but diffi-
cult to be generalized such as for Chinese dishes. In [18], III. DATASET
an interesting work that mines the composition of ingredients
We construct a large food dataset specifically for Chinese
as discriminative patterns is proposed for food classification.
dishes, namely Vireo Food-251. Different from other publicly
A drawback of this approach is the requirement of image
available datasets [18], [29], [59], both food categories and
segmentation, which is sensitive to parameter settings and
ingredient labels are included. To the best of our knowledge,
can impact recognition performance. As reported in [18],
this is the largest dataset that provides both food categories
the performance is not better than of DCNN without image
and ingredient labels.
segmentation on the Food-101 dataset. Similar to [24], image
segmentation is employed in [25], but using a more advanced
technique based on conditional random field (CRF) with unary A. Dataset Collection
potentials provided by DCNN [51]. The promising perfor- VIREO Food-251 is extended from the original Vireo
mance in segmentation for western food, nevertheless, comes Food-172 [16]. With the newly added 79 food categories,
from the price for requiring training labels that need manual the dataset covers the most of popular Chinese dishes. The
segmentation of food items for model learning. For Chinese new dataset is compiled from the top-200 popular food
food, collecting such training labels is extremely difficult, listed on the website “Go Cooking”1. The popularity is
given the fuzzy composition and placement of ingredients as sorted based on the number of user-uploaded food images.
shown in Figure 1. The popular food list is further combined with the original
Ingredient recognition is posed as a multi-label learning
problem [16]. More recent works exploit neural networks, 1 https://www.xiachufang.com/category/

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1517

Fig. 3. The distribution of food categories under eight major food groups in Fig. 4. The ingredient “egg” shows large difference in visual appearance
Vireo Food-251. across different kinds of dishes.

172 categories in the old version, resulting in 251 food


categories and 406 ingredient labels. For each newly added
food category, a total of 2,000 user-uploaded images were
crawled. We manually checked each image by excluding
images incorrectly labeled, with multiple dishes, suffered
from blurring, or with resolution lower than 256 × 256.
The 251 categories cover eight major groups of food,
as shown in Figure 3. The group meat contains the most
number of categories, with examples include “braised pork”
and “sautéed shredded pork in sweet bean sauce”. On the
other hand, there are less than ten categories under the group Fig. 5. The distribution of training examples for (a) food categories and
soup, and examples include “lotus root & spare ribs soup” and (b) ingredient labels.
“crap & tofu soup”. Figure 2 shows some examples of food
categories in VIREO Food-251. VIREO Food-251 contains a total of 406 ingredient labels and
169,673 images, with an average of 3 ingredients per image.
Figure 5 shows the distribution of positive examples in food
B. Ingredient Labeling and ingredient categories. As observed, the number of training
We compiled a list of more than 400 ingredients based samples is unbalanced. On average, there are 676 positive
on the recipes of 251 food categories. The ingredients range samples per food category, and 1,196 per ingredient.
from popular items such as “shredded pork” and “shredded
pepper” to rare items such as “codonopsis pilosula” and IV. I NGREDIENT R ECOGNITION
“radix astragali”. Labeling hundreds of ingredients for over
We present two methods for ingredient recognition. The first
hundred thousands of images could be extremely tedious. First,
method is a multi-task model, with two tasks for food and
some ingredients are difficult to be recognized, for example,
ingredient recognitions [16]. The second model is a single-task
ingredients under soup or sauce. Second, some ingredients
model that predicts ingredient labels at local image regions.
are invisible in flour-made food categories such as dumpling
Both models are based on deep convolutional neural networks
and noodle. Third, certain ingredients such as egg exhibit
(DCNNs).
large visual variations (see Figure 4) due to different ways
of cutting and cooking. To address these problems, we label
only those ingredients that are visible. In addition, we create A. Multi-Task Learning
additional labels for ingredients with large visual appearance; The conventional DCNN is an end-to-end system with input
for example, we have 13 different labels for “egg”, such as as picture and output as the prediction scores of class labels.
“preserved egg slices” and “boiled egg”. DCNN models, such as AlexNet [11], VGG [12], and ResNet
We recruited 10 homemakers who have cooking experience [13], are trained under the single-label scenario. Specifically,
for ingredient labeling. The homemakers were instructed to there is an assumption of exactly one label for each input
label only visible and recognizable ingredients. They were also picture. As ingredient recognition is a multi-label problem,
allowed to introduce and annotate new ingredients not in the i.e., more than one label per image, a different loss function
list, which would be explicitly checked by us. To guarantee needs to be used for training DCNN. On the other hand,
the accuracy of labeling, we purposely awarded homemakers directly revising DCNN with appropriate loss function for
with cash bonuses as incentives to provide quality annota- ingredient recognition may not yield satisfactory performance,
tion, in addition to the regular payment. For this purpose, given the varying appearances of an ingredient in different
we checked a small subset of labels and provided immediate dishes. To this end, we propose to couple food categoriza-
feedback to homemakers such that they were aware of their tion problem, which is a single-label problem, together with
performance. We spent two months in total to label the ingredient recognition, which is a multi-label problem, for
whole dataset. By excluding images with no ingredient labels, simultaneous learning.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1518 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Fig. 6. Four different deep architectures for multi-task learning of food category and ingredient recognition.

We formulate the food categorization and ingredient recog- food categorization and ingredient recognition layers. RestNet
nition as a multi-task deep learning problem and modify the and SENet, on the other hand, have only one fully connected
architecture of DCNN for our purpose. The modification is not layer. In this case, only Arch-B can be implemented. Due to
straightforward as it involves two design issues. The first is on the different natures of tasks, we adopt multinomial logistic
whether the prediction scores of both tasks should directly or loss function L 1 for single-label food categorization and
indirectly influence each other. Direct influence means that the cross-entropy as the loss function L 2 for multi-label ingredient
input of one task is connected as the output of another task. recognition. Denote N as the total number of training images,
Indirect influence decouples the connection such that each task the overall loss function L is as following:
is on a different path of the network. Both tasks influence each
1 
N
other by updating the shared intermediate layers. The second L=− (L 1 + λL 2 ) (1)
issue is about the degree in which the intermediate layers N
n=1
should be shared. Ideally, each task should have its own
private layer(s) given that the nature of both tasks, single where λ is a trade-off parameter. This loss function is also
versus multi-labeling, is different. In such a way, the updating widely used in other works such as [61]. During training,
of parameters can be done more freely for optimization of the errors propagated from the two branches are linearly
individual performance. combined and the weights of the convolutional layers shared
Based on the two design issues, we derive four different between the two tasks will be updated accordingly. The updat-
deep architectures as depicted in Figure 6, respectively name ing will subsequently affect the last two layers simultaneously,
as Arch-A to Arch-D. The first design (Arch-A) considers adjusting the features separately owned by food and ingredient
stacked architecture by placing food categorization on top of recognition. Let q̂ n,y as the predicted score of an image x n for
ingredient recognition, and vice versa. As the composition of its ground-truth food label y, L 1 is defined as follows:
ingredients for different dishes under the same food category L 1 = log(q̂ n,y ) (2)
can be different, this architecture has the risk that model
learning converges slowly as observed in the experiment. where q̂ n,y is obtained from softmax activation function.
The second design (Arch-B) is similar except that indirect Furthermore, denote pn ∈ {0, 1}t , represented as a vector in
influence is adopted and both tasks are at different pathways. t dimensions, as the ground-truth ingredients for an image
Both designs are relatively straightforward to implement by x n . Basically pn is a binary vector with entries of value 1 or
adding additional layers to DCNN. The next two architectures 0 indicating the presence or absence of an ingredient. The loss
consider the decoupling of some intermediate layers. The function L 2 is defined as
third design (Arch-C) allows each task to privately own two 
t
intermediate layers on top of the convolutional layers for L2 = pn,c log( p̂n,c ) + (1 − pn,c )log(1 − p̂n,c ) (3)
parameter learning. The last design (Arch-D) is a compromise c=1
version between the second and third architectures, by having where p̂n,c denotes the probability of having ingredient cate-
one shared and one private layer. Arch-D has the peculiarity gory c for x n , obtained through sigmoid activation function.
that the shared layer can correspond to the high or mid-level
features common between the two tasks at the early stage
of learning, while the private layer preserves the learning of B. Region-Wise Ingredient Recognition
specialized features useful for optimizing the performance of The previous section considers the global image feature
each task. for multi-label learning, while ignoring regional information.
The architectures are modified from existing deep models, This section introduces region-wise ingredient recognition,
including VGG-16 [12], ResNet-50 [13], ResNet-101 [13], and as illustrated by the pipeline in Figure 7. Given a food
SENet-154 [60]. In terms of design, the major modification image I , the feature map (denoted as F I ∈ Rm×m×d ),
is made on the fully connected layers. As VGG contains two which corresponds to the last convolution layer of DCNN
fully connected layers, we modify the fully connected layers to and retains the spatial information of the original image,
implement all the architectures presented in Figure 6. For the is extracted from DCNN. The feature map is divided into
private layers in Arch-D, there are 4,096 neurons for both the m × m grids, where each grid is represented by a vector of

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1519

Fig. 7. The pipeline of region-wise ingredient recognition. Given a food image, the feature maps from the last convolutional layer of deep models are
extracted for region-wise ingredient classification. Max pooling is performed across different regions to obtain the final predictions.

d dimensions. The value of m varies depending on the image TABLE I


size. Using VGG as an example, m = 14 if the size of an P ERFORMANCE C OMPARISON A MONG D IFFERENT M ULTI -TASK L EARN -
ING A RCHITECTURES FOR I NGREDIENT R ECOGNITION ON V IREO
image is 448 × 448. In this case, each grid corresponds to F OOD 251 D ATASET. VGG I S U TILIZED AS THE BACKBONE N ET-
a receptive field of 32 × 32 resolution. We denote F i as the WORK . M ICRO -F1 AND M ACRO -F1 A RE R EPORTED
feature vector for i t h grid or region, where i ∈ [0, m × m].
As each grid depicts a small region of the original image,
a reasonable assumption is that there is one dominant ingre-
dient per region. Hence, ingredient recognition is performed
in a region-wise manner by single-label classification on each
grid. The activation function being applied is so f tmax for
getting the probability distributions of ingredients, denoted as
p̂i ∈ Rt for i t h region as follows: recall of ingredient recognition are employed as evaluation
p̂i = so f tmax(W F i + b). (4) metrics. We split the experiments into two parts to verify
the performances of multi-task learning (Section V-A) and
The learnt transformation matrix is W ∈ b∈ Rt ×d ,
is the Rt region-wise recognition (Section V-B) respectively. The first
bias terms, and t is the number of ingredients. part aims to evaluate different deep architectures for multi-task
Since each region is associated with the probability distrib- learning in comparison to single-task DCNN. The second part
utions of the ingredients, a straightforward way to obtain the aims to demonstrate the merits of region-wise learning for
image-level labels is by max-pooling over the distributions ingredient recognition.
across regions. Let p̂ I be the probability distribution of
ingredients for image I . The response of an ingredient indexed A. Multi-Task Learning
by j element is obtained as follows:
For multi-task model training, we fix the value of λ = 0.3
2
p̂ I ( j ) = max{ p̂i ( j )|m
i=1 } (5) in Equation 1 for VGG model. When λ = 0.3, the ingredient
recognition achieves the best performance on the valida-
where m 2 is the total number of image grids. The loss tion set. As ingredient recognition involves multiple labels,
function is cross-entropy since ingredient recognition is a a threshold is required to gate the selection of labels. The
multi-label classification problem. Denote p In ∈ {0, 1}t as the threshold is set to be the value of 0.5, following the standard
ground-truth ingredients for a food picture In , represented by setting when sigmoid is used as the activation function. The
a binary vector whose elements are either 1 or 0 indicating learning rate is set to be 0.001 and the batch size to be 64.
the presence or absence of a particular ingredient. The loss The learning rate decays when the model reaches a plateau.
function L is defined as We first evaluate the performances of different multi-task
1 
N t learning architectures by using VGG as the backbone network.
L= ( p In ,j log( p̂ In ,j )+(1− p In ,j )log(1 − p̂ In ,j )) The multi-task learning includes the four deep architectures
N
n=1 j =1 illustrated in Figure 6. Note that we experiment with two
(6) variants of Arch-A, with the layer of food categorization on top
of ingredient recognition (Arch-A1) and vice versa (Arch-A2).
V. E XPERIMENTS For comparison, the single task VGG trained with ingredient
The experiments are conducted on the VIREO Food- labels only is utilized as the baseline.
251 dataset. In each food category, 60% of images are Table I lists the performances of different multi-task archi-
randomly picked for training, while 10% for validation and tectures for ingredient recognition. Except for Arch-A, all
the remaining 30% for testing. Note that only 385 ingredient multi-task models exhibit better performance than single-task
categories that have at least 10 training examples are evaluated. VGG. As the recognition results for both food and ingredient
As ingredient recognition is a multi-label problem, micro- are imperfect, layer stacking as in Arch-A actually could
F1 and macro-F1 that take into account both precision and hurt each other’s performance. Specifically, the inaccurate

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

TABLE II TABLE III


P ERFORMANCE C OMPARISON A MONG D IFFERENT M ULTI -TASK L EARN - T EN I NGREDIENTS S HOWING L ARGE P ERFORMANCE D IFFERENCES IN
ING A RCHITECTURES FOR F OOD C ATEGORIZATION ON V IREO F OOD F1 B ETWEEN S INGLE -TASK VGG AND M ULTI -TASK VGG
251 D ATASET. VGG I S U TILIZED AS THE BACKBONE N ETWORK
AND AVERAGE T OP -1 AND T OP -5 A CCURACIES A RE R EPORTED

prediction in one task will directly affect the other task. On the
TABLE IV
other hand, while having separate paths as in Arch-B leads
P ERFORMANCES OF I NGREDIENT R ECOGNITION IN E ACH G ROUP. T HE
to better performances, the improvement is smaller compared N UMBER OF I NGREDIENT C ATEGORIES , AVERAGE AND M EDIAN N UM -
with Arch-C and Arch-D that do not share the same lower BERS OF T RAINING I MAGES A RE S HOWN IN THE 2 ND , 3 RD AND 4 TH
layer before the classification layer. Arch-D, which shares one C OLUMNS R ESPECTIVELY
layer while also learning separate layers tailor-made for both
tasks, attains the best performance in terms of Micro-F1 and
Macro-F1.
Table II lists the performance of food categorization. For
multi-task learning, similar trends are observed as ingredient
recognition. For top-1 accuracy, the best result is attained
by Arch-D while for top-5, the best results are attained by
Arch-C. This basically verifies the importance of private
layers for both tasks. It is worth to note that, different from
ingredient recognition, Arch-A1 performs much better than
Arch-A2 for food categorization. The result indicates that
recognizing the food category based on the composition of in different food categories. Introducing the food category
the ingredients is more feasible than inferring ingredients information for multi-task learning will increase the confusion
based on the food category. To verify that the improvement and hence resulted in lower recognition performance. Overall,
is not by chance, we conduct a significance test to compare with multi-task learning, the macro-F1 is boosted from 56.79%
multi-task (Arch-D) and single-task (VGG) using the source to 61.74%, with 285 ingredients showing improvements.
code provided by TRECVID2 . The test is performed by To provide insights on which type of ingredients are difficult
partial randomization with 100,000 iterations, with the null to recognize, we divide the ingredients into ten major food
hypothesis that the improvement is due to chance. At a groups and report the Macro-F1 of each group in Table IV.
significance level of 0.05, Arch-D is significantly different As shown, the Macro-F1 for “fish” and “meat” are fairly high
from VGG in both food categorization and ingredient due to a sufficient number of training samples. The average
recognition by Top-1 accuracy and Macro-F1, respectively. numbers of training samples for the ingredient in “meat”
The p-values are close to 0, which rejects the null hypothesis. and “fish” are 1,226 and 1,301, respectively, which results
To further contrast the performance between single-task in high recognition accuracies. On the contrary, the group
and multi-task learning models, Table III lists the ingre- “fruits” has only 240 training samples on average, which is
dients showing large deviations in performances. Basically, the fewest among all the groups. The median number is even
for ingredients that are unique for a few food categories, fewer, which is only 40 as most of the training samples are
the multi-task learning model performs much better than from the ingredient “pineapple”. As a result, the Macro-F1 of
the single-task learning model. For example, “cordyceps “fruits” is rather low, which is only 24.03%. Despite having
sinensis” only appears in “black chicken soup”, and hence 903 training samples on average, the group “seasonings” has
multi-task VGG is able to outperform single-task VGG with the second-lowest Macro-F1, which suggests that recognizing
a large margin. Another example is “red bean paste” which ingredients in the “seasonings” group is relatively challenging
is unique to the food category “traditional Chinese rice- compared with the other ingredients.
pudding”, multi-task VGG outperforms single-task VGG by Figure 8 shows three failure examples of seasoning ingredi-
33.1%. On the contrary, multi-task learning suffers from ent recognition. In these examples, seasoning ingredients are
lower performance when confused by the frequently appear- “dried chili”, “minced garlic”, “broad bean paste”, “minced
ing ingredients. As shown in Table III, ingredients such as ginger” and “minced green onion”. Basically, there are
“corn block”, “cabbage”, “sliced tomato” are always seen three major reasons for the low performance of seasoning
ingredient recognition. First, some seasoning ingredients tend
2 http://www-nlpir.nist.gov/projects/t01v/trecvid.tools/ to confuse with each other due to similar appearances. For
randomization.testing example, in Figure 9(a), “dried chilli sections” is incorrectly

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1521

TABLE V
P ERFORMANCE OF M ULTI -L ABEL I NGREDIENT R ECOGNITION ON VIREO
F OOD -251 D ATASET. N OTE T HAT A RCH -B I S I MPLEMENTED FOR
M ULTI -TASK L EARNING

Fig. 8. Failure examples of seasoning ingredients (i.e., “dried chili”,


“minced garlic”, “broad bean paste”, “minced ginger”, “minced green onion”)
recognition. False positives are marked in red while false negatives are marked
in yellow.

TABLE VI
predicted as “dried chilli”. As “dried chili sections” differs P ERFORMANCE OF I NGREDIENT R ECOGNITION
from “dried chili” only in shape, the model tends to confuse
them because of occlusions among different ingredients in
the dish. Second, seasoning ingredients tend to be small in
size and portion, which makes the recognition of “minced
garlic” in Figure 9(b)-(c) and “minced ginger” in Figure 9(c)
difficult. Such examples are not easy to be recognized even
by humans. Third, as the training samples for some seasoning
ingredients are not sufficient, the recognition performance is
not satisfactory. For example, there are only 18 samples for
seasoning ingredient “broad bean pasta”, resulting in incorrect
prediction as shown in Figure 9(b).
To validate the the effectiveness of multi-task learning strat- models. Basically, region-wise recognition helps to improve
egy for other backbone networks (e.g., ResNet [13], Inception- the ingredient recognition performances on all backbone
V3 [62], SENet-154 [60]), we report the performances of networks except for SENet-154. Since the key idea of SENet
different backbone networks in Table V. As most of the is to re-calibrate channel-wise (i.e., region-wise) feature
convolutional networks contain only one fully connected layer, responses by explicitly modeling inter-dependencies between
hence we implement Arch-B for multi-task learning on differ- channels (regions), performing region-wise recognition for
ent backbone networks. Although Arch-B is not the optimal SENet-154 will harm the dependencies among region features
design for multi-task learning, it still performs better than and lead to worse recognition performance. On the contrary,
single-task learning models, which verifies the effectiveness of with region-wise recognition, VGG and ResNet-50 further
leverage food category information for ingredient recognition. improve macro-F1 by around 5% and 4% respectively. The
The general trend is that the deeper a network is, the higher the improvement in terms of micro-F1 is not so obvious, which is
recognition rate will be. In the case of ingredient recognition, around 1%. This is due to the fact that region-wise recognition
providing food categories as an extra global cue can reduce mostly benefits ingredient categories with a smaller number of
false positives in ingredient recognition. This advantage does training examples. For example, the rare ingredient “cordyceps
come with the trade-off of introducing more false negatives. sinensis” having only 10 training examples improves F1 from
As a result, in terms of numerical scores, both single and 0% to 100%. This is because region-wise recognition, similar
multi-task learning do not seem to differ too much. We further to data augmentation, inherently increases the number of
perform result analysis and notice the following. The false training examples. Furthermore, the ingredients that are small
positives introduced by single-task include major ingredients in size are less likely to be dominated by other ingredients
while the false negatives introduced by multi-task are mostly during feature learning. As a consequence, the contribution
auxiliary ingredients. From the application point of view, aux- of region-wise recognition is more significant for ingredients
iliary ingredients have much less impact than major ingredients in small size and with less number of training examples.
towards the estimation of nutrition facts. Furthermore, false Figure 9 shows examples to contrast the performance between
positives can adversely frustrate user experience in food log- region-wise and image-level recognitions. Region-wise recog-
ging. Hence, multi-task still has its advantage over single-task nition is robust to size (e.g., “parsley” in Figure 9(a)), cutting
despite marginal improvement in terms of the numerical method (e.g., “minced ginger” in Figure 9(b)) and training size
score. (e.g., “kiwi fruit” in Figure 9(c)). On the other hand, limited by
region size, context information is not fully leveraged. Ingredi-
B. Region-Wise Recognition ents such as “crisp fritter” in Figure 9(e)) and “yellow peach”
We then evaluate the performance of region-wise ingredient are predicted correctly by image-level but not region-wise
recognition. Table VI compares the ingredient recognition recognition. Table VII shows the top-10 ingredients that gain
performance between image-level and region-wise recognition the largest improvement due to region-wise recognition.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1522 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Fig. 9. Examples of ingredient recognition. False positives are marked in red while false negatives are marked in yellow. The backbone network is ResNet-50.

TABLE VII
T EN I NGREDIENTS S HOWING L ARGE P ERFORMANCE I MPROVEMENT IN
F1 W ITH R EGION -W ISE R ECOGNITION M ODEL . T HE BACKBONE N ET-
WORK I S R ES N ET-50

Fig. 10. Ingredient localization: original image (left) and the response maps
of three ingredients. The backbone network is ResNet-50.

the receptive field extends to the spatial size of 64 × 64 in the


A by-product of the region-wise recognition model is the original image before resizing.
capability of locating ingredients. We visualize the result in a The consideration of multi-scale recognition will only
response map, which is formed by converting the prediction introduce minor changes to the original region-wise deep
score of an ingredient on an image grid into pixel intensity network architecture. Except for region-level pooling that
value. Figure 10 shows the response maps of ingredients. involves ingredient recognition probabilities from multiple
Generally speaking, the better the result of localization is, scales, the updating of parameters remains the same through-
the higher the prediction accuracy will be. out the learning procedure. Denote plI as the probability
distribution of ingredients at scale l, max pooling is conducted
C. Discussion across different regions and scales as follows:
Why not multi-scale recognition? As the scales of ingre- 2
p̂ I ( j ) = max{max{ p̂li ( j )|m
i=1 }|l=1 }.
L
(7)
dients change depending on camera-to-dish distance and cut-
ting methods, intuitively region-wise recognition should be Basically, the multi-scale design ensures that an ingredient
benefited from multi-scale processing, as reported in [20]. can be adaptively pooled from a region in a particular scale
We input a pyramid of images in multiple resolutions for that exhibits the highest possible prediction confidence.
region-wise recognition. In this way, the receptive field of a Multi-scale ingredient recognition is performed at two dif-
grid can spatially extend to a larger scope depending on the ferent scales: 224 × 224 and 448 × 448. Table VIII contrasts
resolution of the input image. For example, for an image of the performances between single and multi-scale recognition.
size 448 × 448, each grid in the feature map obtained from Different from the results reported in [20], multi-scale recogni-
VGG corresponds to a receptive field of 32 ×32 image region. tion indeed does not show an apparent advantage. In contrast,
By reducing the size of the image to a resolution of 224×224, both micro-F1 and macro-F1 drop for most of the backbones.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1523

TABLE VIII
P ERFORMANCE D IFFERENCE B ETWEEN S INGLE AND M ULTI -S CALE
R EGION -W ISE I NGREDIENT R ECOGNITION

Fig. 11. The pipeline of multi-task region-wise recognition based on ResNet


backbone. The model performs region-wise ingredient categorization and
image-level food recognition.

TABLE IX
P ERFORMANCE D IFFERENCE B ETWEEN S INGLE AND M ULTI -TASK
R EGION -W ISE I NGREDIENT R ECOGNITION

Our analysis shows that multi-scale recognition boosts the con-


fidence of prediction for both true positive and false negative
ingredients. As a consequence, simply using the confidence
threshold of 0.5 for selecting ingredients results in slightly
lower precision. On the other hand, despite that multi-scale
recognition can inherently generate more training samples,
its contribution to categories with few training examples is TABLE X
not significant. We argue that a better alternative way is the P ERFORMANCE OF M ULTI -L ABEL I NGREDIENT R ECOGNITION
adaptive fusion of results from multiple scales, rather than ON UEC F OOD -100
simple thresholding for multi labeling, which is beyond the
scope of this paper.
Why not multi-task region-wise recognition? Region-wise
ingredient recognition can be carried out in a multi-task learn-
ing fashion together with food recognition. Figure 11 depicts
an end-to-end learning architecture, where the two tasks are
branched out from the last convolution layer. Intuitively,
such architecture might learn to strike a balance between
image-level and region-level learning, reaching optimal perfor-
mance to contextualize ingredient recognition while attending is worse than that of the single-task image-level ingredient
to regional features. recognition, implying that the multi-task model fails in taking
Table IX compares the performance when region-wise advantages of context and region levels information for recog-
recognition is implemented in single and multi-task fashions. nition. More advanced architectures, such as attention branch
As noted, multi-task implementation degrades the recognition network that leverages spatial attention [63], are worth further
rate significantly across different CNN backbones. We attribute exploration.
the failure to the fact that both tasks indeed perform recog- Performance on UEC Food-100 [29] dataset. We further
nition at different levels of granularities, i.e., image ver- conduct evaluations on UEC Food-100 dataset. UEC Food-
sus region information. Optimizing both tasks based on the 100 is a Japanese food dataset, including 14,361 images
architecture in Figure 11 might lead to conflict in learning from 100 categories of food. [16] labeled this dataset with
objectives, resulting in fluctuating performance. For example, 190 ingredient classes. By merging the duplicate ingredient
the performance of ingredients such as “raisin” and “Ginseng”, labels, we finally obtain 176 ingredients. Basically, simi-
which are small in size, decreases when adopting multi-task lar observations can be found on UEC Food-100 datasets.
region-wise recognition. This might because introducing the As shown in Table X, multi-task learning generally improves
image categorization task forces the model to pay more the performances of ingredient recognition on all backbone
attention to global features which somehow overlooks the models. It is worth noting that due to the lower resolution
small ingredients and harms the regional features optimized problem, the ingredient recognition performances on UEC
for ingredient recognition. Food-100 are much lower than that on Vireo Food-251.
When visualizing the response maps (Figure 12), we also Table XI further compares the performance between
notice that the ingredients cannot be localized as precise as image-level recognition, region-wise recognition and
in the single-task model. In some cases, the regions with multi-scale region-wise recognition. Similar to the results
multiple ingredients are attended, while small-size ingredients on Vireo Food-251, region-wise recognition improves the
are overlooked. Furthermore, the inherent data augmentation performances in terms of macro-F1. Since region-wise
in region-wise recognition cannot be effectively leveraged by recognition is equivalent to data augmentation, it benefits
multi-task learning. As a consequence, the recognition rate the recognition of ingredient categories with only a few
for ingredients with a small number of training samples does training samples hence leading to higher macro-F1. However,
not improve compared to the single-task model. The result different to the results on Vireo Food-251, the performances

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1524 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

TABLE XI granularities, are conflicting in learning objectives. Optimiz-


P ERFORMANCE OF I NGREDIENT R ECOGNITION ON UEC F OOD -100 ing both tasks in a multi-task learning fashion needs more
sophisticated network architecture, or otherwise will result in
significant performance degradation as shown in our analysis.
Future work should pay more attention to adaptive fusion
of recognition results from multiple image scales as well
as effective leveraging food categorization to contextualize
ingredient recognition.
Several research problems can be explored on Vireo Food-
251 dataset. First, the dataset is highly unbalanced in the num-
ber of training examples for different ingredient labels, ranging
from 1 to 32,859 examples. The distribution is long-tail as in
real-world scenarios. Solutions such as few-shot learning could
be promising for pushing the recognition rate at the tail of the
distribution. Second, the co-occurrence probability of ingre-
dients are not random, but follows certain inherent rules in
cooking practice. Mining and applying such rules are expected
to boost ingredient recognition. Finally, Vireo food-251 can
be studied jointly with other datasets for domain adaptation
based ingredient recognition. Examples include transferring
the model trained by Chinese food to recognize ingredients
in Western cuisines with different cooking methods.

Fig. 12. Ingredient localization with multi-task region-wise recognition R EFERENCES


model: original image (left) and the response maps of three ingredients. The
backbone network is ResNet-50. [1] A. H. Goris, M. S. Westerterp-Plantenga, and K. R. Westerterp, “Under-
eating and underrecording of habitual food intake in obese men: Selec-
tive underreporting of fat intake,” Amer. J. Clin. Nutrition, vol. 71, no. 1,
of multi-scale recognition are much lower than single-scale pp. 130–134, Jan. 2000.
recognition. This is due to the reason that the images [2] V. Bettadapura, E. Thomaz, A. Parnami, G. D. Abowd, and I. Essa,
“Leveraging context to support automated food recognition in restau-
in UEC Food-100 are in low resolution, resizing the rants,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., Jan. 2015,
lower resolution images to a higher resolution to perform pp. 580–587.
multi-scale recognition will introduce noise hence leading to [3] M.-Y. Chen et al., “Automatic chinese food identification and quantity
estimation,” in Proc. SIGGRAPH Asia Tech. Briefs (SA), 2012, p. 29.
worse recognition results. [4] K. Kitamura, T. Yamasaki, and K. Aizawa, “Food log by analyzing food
images,” in Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 999–1000.
VI. C ONCLUSION [5] Z.-Y. Ming, J. Chen, Y. Cao, C. Forde, C.-W. Ngo, and T. S. Chua,
“Food photo recognition for dietary tracking: System and experiment,”
We have presented a Chinese food dataset, along with two in Proc. Int. Conf. Multimedia Modeling. Cham, Switzerland: Springer,
proposed methods for ingredient recognition. The common 2018, pp. 129–141.
[6] Y. Kawano and K. Yanai, “FoodCam-256: A large-scale real-time mobile
challenges, regardless of multi-task learning or region-wise food RecognitionSystem employing high-dimensional features and com-
recognition, are an unbalanced number of training examples, pression of classifier weights,” in Proc. ACM Int. Conf. Multimedia,
varying sizes and scales of ingredients under different image 2014, pp. 761–762.
capturing conditions. On the other hand, similar to most [7] K. Aizawa and M. Ogawa, “FoodLog: Multimedia tool for healthcare
applications,” IEEE MultimediaMag., vol. 22, no. 2, pp. 4–8, Apr. 2015.
recognition tasks, the experiments also show a large margin of [8] D. G. Lowe, “Object recognition from local scale-invariant features,” in
improvement when deeper networks are employed. Leveraging Proc. 7th IEEE Int. Conf. Comput. Vis., Sep. 1999, pp. 1150–1157.
the food category as a prior, such as in multi-task learning, has [9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
advantages for recognizing ingredients that are unique only for Recognit. (CVPR), Jun. 2005, pp. 886–893.
a few numbers of food categories. For ingredients frequently [10] M. A. Stricker and M. Orengo, “Similarity of color images,” in Proc.
appear in different dishes, the performances are either not Storage Retr. Image Video Databases III, Mar. 1995, pp. 381–392.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
improved or degraded. Comparing image and region-wise with deep convolutional neural networks,” in Proc. Neural Inf. Process.
recognitions, the latter improves recognition performance for Syst., 2012, pp. 1097–1105.
ingredients in small size and labels with less number of [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
training examples. Region-wise recognition is effective in seg- Available: http://arxiv.org/abs/1409.1556
regating irrelevant parts of an image from recognition, while [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
augmenting image patches which results in more examples image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
for model training. Nevertheless, as indicated in our result, [14] K. Lyu, Y. Li, and Z. Zhang, “Attention-aware multi-task convolutional
multi-scale image processing to compensate loss in image neural networks,” IEEE Trans. Image Process., vol. 29, pp. 1867–1878,
context is not helpful for ingredient recognition. Furthermore, 2020.
[15] Z. Zhang, Y. Xie, W. Zhang, Y. Tang, and Q. Tian, “Tensor multi-
image-level food categorization and region-level ingredient task learning for person re-identification,” IEEE Trans. Image Process.,
recognition, which leverage on different levels of feature vol. 29, pp. 2463–2477, 2020.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1525

[16] J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking [40] M. Merler, H. Wu, R. Uceda-Sosa, Q.-B. Nguyen, and J. R. Smith,
recipe retrieval,” in Proc. ACM Multimedia Conf., 2016, pp. 32–41. “Snap, eat, Repeat: A food recognition engine for dietary logging,”
[17] N. Martinel, G. Luca Foresti, and C. Micheloni, “Wide-slice resid- in Proc. 2nd Int. Workshop Multimedia Assist. Dietary Manage.
ual networks for food recognition,” 2016, arXiv:1612.06543. [Online]. (MADiMa), 2016, pp. 31–40.
Available: http://arxiv.org/abs/1612.06543 [41] G. Ciocca, P. Napoletano, and R. Schettini, “CNN-based features for
[18] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining retrieval and classification of food images,” Comput. Vis. Image Under-
discriminative components with random forests,” in Proc. Eur. Conf. stand., vols. 176–177, pp. 70–77, Nov. 2018.
Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 446–461. [42] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep
[19] X.-J. Zhang, Y.-F. Lu, and S.-H. Zhang, “Multi-task learning for food feature aggregation for food recognition,” IEEE Trans. Image Process.,
identification and analysis with deep convolutional neural networks,” vol. 29, pp. 265–276, Jul. 2020.
J. Comput. Sci. Technol., vol. 31, no. 3, pp. 489–500, [43] Y. Kawano and K. Yanai, “Food image recognition with deep convo-
May 2016. lutional features,” in Proc. ACM Int. Joint Conf. Pervas. Ubiquitous
[20] J.-J. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval Comput. Adjunct Publication UbiComp Adjunct, 2014, pp. 589–593.
with rich food attributes,” in Proc. 25th ACM Int. Conf. Multimedia, [44] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual
Oct. 2017, pp. 1771–1779. networks for food recognition,” in Proc. IEEE Winter Conf. Appl.
[21] M. Bola nos, A. Ferrà, and P. Radeva, “Food ingredients recognition Comput. Vis. (WACV), Mar. 2018, pp. 567–576.
through multi-label learning,” in Proc. Int. Conf. Image Anal. Process. [45] H. Chen, J. Wang, Q. Qi, Y. Li, and H. Sun, “Bilinear CNN models
Cham, Switzerland: Springer, 2017, pp. 394–402. for food recognition,” in Proc. Int. Conf. Digit. Image Comput., Techn.
[22] K. Yanai and Y. Kawano, “Food image recognition using deep convo- Appl. (DICTA), Nov. 2017, pp. 1–6.
lutional network with pre-training and fine-tuning,” in Proc. IEEE Int. [46] L. Meng et al., “Learning using privileged information for food
Conf. Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6. recognition,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019,
pp. 1–9.
[23] X. Chen, Y. Zhu, H. Zhou, L. Diao, and D. Wang, “ChineseFood-
[47] J. Bootkrajang, J. Chawachat, and E. Trakulsanguan, “Deep-based
Net: A large-scale image dataset for chinese food recognition,” 2017,
arXiv:1705.02743. [Online]. Available: http://arxiv.org/abs/1705.02743 openset classification technique and its application in novel food cat-
egories recognition,” in Proc. Int. Conf. Comput. Recognit. Syst. Cham,
[24] M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition
Switzerland: Springer, 2019, pp. 235–245.
and volume estimation of food intake using a mobile device,” in Proc.
[48] J.-J. Chen, L. Pan, Z. Wei, X. Wang, C.-W. Ngo, and T.-S. Chua, “Zero-
Workshop Appl. Comput. Vis. (WACV), Dec. 2009, pp. 1–8.
shot ingredient recognition by multi-relational graph convolutional net-
[25] A. Myers et al., “Im2Calories: Towards an automated mobile vision work,” in Proc. AAAI, 2020, pp. 10542–10550.
food diary,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, [49] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
pp. 1233–1241. using statistics of pairwise local features,” in Proc. IEEE Comput. Soc.
[26] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocal- Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.
ized modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, [50] H. He, F. Kong, and J. Tan, “DietCam: Multiview food recognition using
no. 8, pp. 1187–1199, Aug. 2015. a multikernel SVM,” IEEE J. Biomed. Health Informat., vol. 20, no. 3,
[27] Z. Wei, J. Chen, Z. Ming, C.-W. Ngo, T.-S. Chua, and F. Zhou, pp. 848–855, May 2016.
“DietLens-eout: Large scale restaurant food photo recognition,” in Proc. [51] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
Int. Conf. Multimedia Retr., Jun. 2019, pp. 399–403. A. L. Yuille, “Semantic image segmentation with deep convolutional
[28] H. Matsunaga, K. Doman, T. Hirayama, I. Ide, D. Deguchi, and nets and fully connected CRFs,” 2014, arXiv:1412.7062. [Online].
H. Murase, “Tastes and textures estimation of foods based on the analysis Available: http://arxiv.org/abs/1412.7062
of its ingredients list and image,” in Proc. New Trends Image Anal. [52] W. Min, S. Jiang, J. Sang, H. Wang, X. Liu, and L. Herranz, “Being
Process. Workshop. Cham, Switzerland: Springer, 2015, pp. 326–333. a supercook: Joint food attributes and multimodal content modeling for
[29] Y. Matsuda and K. Yanai, “Multiple-food recognition considering co- recipe retrieval and exploration,” IEEE Trans. Multimedia, vol. 19, no. 5,
occurrence employing manifold ranking,” in Proc. Int. Conf. Pattern pp. 1100–1113, May 2017.
Recognit., 2012, pp. 2017–2020. [53] F. Zhou and Y. Lin, “Fine-grained image classification by exploring
[30] T. Ege and K. Yanai, “Estimating food calories for multiple-dish food bipartite-graph labels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
photos,” in Proc. 4th IAPR Asian Conf. Pattern Recognit. (ACPR), nit. (CVPR), Jun. 2016, pp. 1124–1133.
Nov. 2017, pp. 646–651. [54] J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval: How
[31] E. Aguilar, B. Remeseiro, M. Bolanos, and P. Radeva, “Grab, pay, to cook this dish?” in Proc. Int. Conf. Multimedia Modeling. Cham,
and eat: Semantic food detection for smart restaurants,” IEEE Trans. Switzerland: Springer, 2017, pp. 588–600.
Multimedia, vol. 20, no. 12, pp. 3266–3275, Dec. 2018. [55] J.-J. Chen, C.-W. Ngo, F.-L. Feng, and T.-S. Chua, “Deep understanding
[32] Y. Wang, J.-J. Chen, C.-W. Ngo, T.-S. Chua, W. Zuo, and Z. of cooking procedure for cross-modal recipe retrieval,” in Proc. 26th
Ming, “Mixed dish recognition through multi-label learning,” in Proc. ACM Int. Conf. Multimedia, Oct. 2018, pp. 1020–1028.
11th Workshop Multimedia Cooking Eating Activities (CEA), 2019, [56] J.-J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval
pp. 1–8. with stacked attention model,” Multimedia Tools Appl., vol. 77, no. 22,
[33] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized pp. 29457–29473, Nov. 2018.
classifier for food image recognition,” IEEE Trans. Multimedia, vol. 20, [57] A. Salvador et al., “Learning cross-modal embeddings for cooking
no. 10, pp. 2836–2848, Oct. 2018. recipes and food images,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jul. 2017, pp. 3020–3028.
[34] H. Hoashi, T. Joutou, and K. Yanai, “Image recognition of 85 food
[58] B. Zhu, C.-W. Ngo, J. Chen, and Y. Hao, “R2 GAN: Cross-modal recipe
categories by feature fusion,” in Proc. IEEE Int. Symp. Multimedia,
retrieval with generative adversarial network,” in Proc. IEEE/CVF Conf.
Dec. 2010, pp. 296–301.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 11477–11486.
[35] Y. Kawano and K. Yanai, “Real-time mobile food recognition system,” in [59] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2013, “PFID: Pittsburgh fast-food image dataset,” in Proc. 16th IEEE Int. Conf.
pp. 1–7. Image Process. (ICIP), Nov. 2009, pp. 289–292.
[36] B. V. Resende e Silva, M. G. Rad, J. Cui, M. McCabe, and K. Pan, [60] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
“A mobile-based diet monitoring system for obesity management,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
J. Health Med. Informat., vol. 9, no. 2, pp. 1–20, 2018. pp. 7132–7141.
[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [61] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. sentation by joint identification-verification,” in Proc. Adv. Neural Inf.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. Process. Syst., 2014, pp. 1988–1996.
[38] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. the inception architecture for computer vision,” in Proc. IEEE Conf.
Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
[39] H. Hassannejad, G. Matrella, P. Ciampolini, I. De Munari, [63] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention
M. Mordonini, and S. Cagnoni, “Food image recognition using very branch network: Learning of attention mechanism for visual explana-
deep convolutional networks,” in Proc. 2nd Int. Workshop Multimedia tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Assist. Dietary Manage. (MADiMa), 2016, pp. 41–49. Jun. 2019, pp. 10705–10714.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1526 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021

Jingjing Chen (Member, IEEE) received the Ph.D. Tat-Seng Chua received the Ph.D. degree from the
degree in computer science from the City University University of Leeds, U.K. He is the KITHCT Chair
of Hong Kong in 2018. She is currently a pre-tenured Professor with the School of Computing, National
Associate Professor with the School of Computer University of Singapore, where he was the Acting
Science, Fudan University. Before joining Fudan and Founding Dean of the School from 1998 to
University, she was a Postdoctoral Research Fellow 2000. His main research interests include multimedia
with the School of Computing, National Univer- information retrieval and social media analytics.
sity of Singapore. Her research interests include In particular, his research focuses on the extraction,
diet tracking and nutrition estimation based on retrieval, and question-answering (QA) of text and
multi-modal processing of food images, including rich media arising from the Web and multiple social
food recognition and cross-modal recipe retrieval. networks. He is the Co-Director of NExT, a joint
center between NUS and Tsinghua University, to develop technologies for live
social media search. He is the 2015 winner of the prestigious ACM SIGMM
Award for Outstanding Technical Contributions to Multimedia Computing,
Communications, and Applications. He is the Chair of Steering Committee
Bin Zhu (Graduate Student Member, IEEE) received of the ACM International Conference on Multimedia Retrieval (ICMR) and
the B.Sc. degree from Southeast University, Nan- Multimedia Modeling (MMM) conference series. He is also the General
jing, China, in 2015, and the M.Sc. degree from Co-Chair of ACM Multimedia 2005, ACM CIVR (now ACM ICMR) 2005,
Zhejiang University, Hangzhou, China, in 2018. ACM SIGIR 2008, and ACM Web Science 2015. He serves on the editorial
He is currently pursuing the Ph.D. degree with the boards of four international journals. He is the Co-Founder of two technology
VIREO Group, Department of Computer Science, startup companies in Singapore.
City University of Hong Kong. His research interests
include diet tracking, generative model and multime-
dia analysis, including food recognition, cross-modal
recipe retrieval, nutrition estimation, and image
generation.

Chong-Wah Ngo received the B.Sc. and M.Sc.


degrees in computer engineering from Nanyang
Technological University, Singapore, and the Ph.D.
degree in computer science from The Hong Kong Yu-Gang Jiang (Member, IEEE) received the Ph.D.
University of Science and Technology (HKUST), degree in computer science from the City University
Hong Kong. He is currently a Professor with the of Hong Kong. He is currently a Professor with the
Department of Computer Science, City University School of Computer Science and the Vice Director
of Hong Kong, Hong Kong. Before joining the City of the Shanghai Engineering Research Center for
University of Hong Kong, he was a Postdoctoral Video Technology and System, Fudan University,
Scholar with the Beckman Institute, University of China. His Laboratory for Big Video Data Analyt-
Illinois at Urbana-Champaign (UIUC), Urbana, IL, ics conducts research on all aspects of extracting
USA. He was also a Visiting Researcher with Microsoft Research Asia, Bei- high-level information from big video data, such
jing, China. His research interests include large-scale multimedia information as video event recognition, object/scene recognition,
retrieval, video computing, multimedia mining, and visualization. He was the and large-scale visual search. Before joining Fudan
Conference Co-Chair of the ACM International Conference on Multimedia University in 2011, he spent three years at Columbia University. His work
Retrieval 2015 and the Pacific Rim Conference on Multimedia 2014. He also has led to many awards, including “Emerging Leader in Multimedia” Award
served as the Program Co-Chair for ACM Multimedia Modeling 2012 and from IBM T. J. Watson Research in 2009, the Early-Career Faculty Award
ICMR 2012. He was the Chairman of ACM (Hong Kong Chapter) from from Intel and China Computer Federation, the 2014 ACM China Rising Star
2008 to 2009. He was an Associate Editor of the IEEE T RANSACTIONS Award, the 2015 ACM SIGMM Rising Star Award, and the Research Award
ON M ULTIMEDIA (2011–2014). for outstanding young researchers from NSF China.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.

You might also like