A Study of Multi - Task and Region-Wise Deep Learning
A Study of Multi - Task and Region-Wise Deep Learning
30, 2021
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1515
features such as SIFT [8], HOG [9] and color [10]. Thanks to comparative studies of various compelling issues in ingredient
deep learning [11]–[15], there have been several recent studies recognition through the methods of multi-task learning and
[16], [17] that report high accuracy of food recognition of up region-wise recognition. The rest of the paper is organized as
to 80% on medium scale benchmark datasets, such as Vireo follows. Section II reviews related works while Section III
Food-172 [16] and Food-101 [18]. The success of food classi- introduces the extended dataset. Section IV presents two
fication with deep learning techniques has inspired researchers baselines, i.e., multi-task learning and region-wise multi-label
to explore a more challenging problem, i.e., understanding classification, for ingredient recognition. Section V details the
ingredient composition of a dish [16], [19]–[21]. performances of two baselines on Vireo Food-251. Finally,
Ingredient recognition is generally a harder problem than Section VI concludes this paper.
food categorization. The size, shape, and color of an ingredient
can exhibit large visual differences due to diverse ways of II. R ELATED W ORK
cooking and cutting, in addition to changes in viewpoints
Food recognition has become a popular research topic in
and lighting conditions. This paper studies the recognition of
recent years and variants of recognition-centric approaches
ingredients in the domain of Chinese dishes. This domain is
have been investigated for different food-related applications.
particularly challenging because dishes are often composed
These efforts include food quantity estimation based on depth
of a variety of ingredients being fuzzily mixed, rather than
images [3], image segmentation for volume estimation [24],
separated into different food containers or as non-overlapping
[25], context-based recognition by GPS and restaurant menus
food items as frequently seen in Japanese and Western dishes.
[2], [26], [27], taste estimation [28], multi-food recognition
This paper describes two methods for ingredient recog-
[29]–[32], personalized recognition [33], multi-modal fusion
nition. The methods are not completely new in the litera-
[34] and real-time recognition [5]–[7], [35], [36]. This section
ture of food recognition [16], [20]. This paper presents a
mainly reviews previous works on food and ingredient
throughout analysis of both methods, including their strength
recognition.
and limitation in ingredient recognition. The first method
is based on multi-task learning that relies on global image
features for simultaneous food and ingredient classifications. A. Food Recognition
The motivation is to exploit the mutual relationship between The challenge of food recognition comes from visual vari-
the food category and ingredients for better performance. ations in shape, color and texture layout. These variations
The key ingredients of a category remain similar despite are hard to be tackled by hand-crafted features such as SIFT
composing with different auxiliary ingredients. Knowing the [8], HOG [9] and color [10]. Instead, the features extracted
food category basically eases the recognition of ingredients. from deep convolutional neural network (DCNN) [11], which
For example, the ingredient “cherry tomatoes” has a higher is trained on ImageNet [37] and fine-tuned on food images,
chance than “pork” to appear in the food “shredded oyster often exhibit impressive recognition performance [22], [25],
mushroom salad”. Hence, learning ingredients with the food [38]–[42]. Combination of multi-modal features sometimes
category in mind in principle shall lead to better performance. also leads to better recognition performance, as reported
The second method does not leverage food category informa- in [38], [43]. Recent works mostly focus on researching
tion. Ingredient recognition is performed at the image region new architectures for food recognition, such as Wide-slice
level. Instead of globally pooling features for recognition, residual networks [44] and bin-linear CNN models [45].
ingredients are first predicted for each local image patch and As reported in [44], the best performances on both UEC
then pooled across regions as the final recognition result. Food-100 and Food 101 are achieved by wide-slice residual
This paper also contributes a large dataset, Vireo Food- networks that contain two branches: a residual network branch
251, composed of 169,673 images with 251 Chinese food and a slice branch network with slice convolutional layers.
categories and 406 ingredient labels. In terms of the number Apart from deep architectures, different learning strategies are
of food categories, this new dataset is on par with UEC also investigated [16], [19], [46]. For example, in [19] and
Food-256 [22] and ChineseFoodNet [23] with 208 categories. [16], food recognition is formulated as a multi-task learning
Note that ingredient labels are not available on both datasets. problem by leveraging ingredient labels or taste labels as
In the literature, Food-101 [18] also includes ingredient labels. supplementary supervised information. By treating ingredi-
Nevertheless, it is assumed that all dishes under a food ents as priviledge information, Meng et al.,meng2019learning
category share the same list of ingredients, which makes the propose a cross-modal alignment and transfer network for
dataset inappropriate for ingredient recognition. food recognition. In addition to using static dataset for model
We extend the paper by comparing the originally proposed training, zero and few shots learning has also started capturing
multi-task learning framework in [16] to region-wise ingre- research attention [47], [48].
dient recognition. More in-depth studies, including the issues
of image-level versus region-level recognition, single versus B. Ingredient Recognition
multi-scale feature pooling and single versus multi-task learn- Compared to food categories, ingredients exhibit larger
ing, are presented with new empirical insights. Furthermore, visual appearance variations due to different cooking and
we extend the original Vireo Food-172 dataset from 172 to cutting methods. Labeling of ingredients also poses a higher
251 food categories and 353 to 406 ingredient labels. The main challenge and only very few datasets [16] are constructed
contributions are the sharing of a large food dataset, and the for ingredient recognition. An early work is PFD [49],
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1516 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
which leverages the result of ingredient recognition for including DCNN and deep Boltzmann machine (DBM), for
food categorization. In PFD, based upon the appearance this problem [16], [52]. To increase the robustness of recogni-
of image patches, pixels are softly labeled with ingredient tion, multi-task learning, which leverages food category labels
categories. The spatial relationship between pixels is then as supplementary supervised information, is often employed
modeled as a multi-dimensional histogram, characterized by for simultaneous classification of food and ingredient labels
label co-occurrence and their geometric properties such as [16], [19]. As the appearance of an ingredient change depend-
distance and orientation. With this histogram representation, ing on food preparation, cooking and cutting methods are
PFD shows impressive food recognition performance. PFD, also explored as supervised information in [20] for ingredient
nevertheless, is hardly scalable to the number of ingredients. recognition. However, as food preparation is a process, label-
Using only eight categories of ingredients as demonstrated in ing of ingredients with cooking and cutting attributes is com-
[49], the histogram already grows up to tens of thousands plicated and not intuitive. Other supervised information being
of dimensions. Other earlier works explore spatial layout explored in the literature include restaurant menus using bipar-
[50], feature mining [18] and image segmentation [25] for tite graph representation [53], cuisine, and course using DBM
ingredient or food item recognition. In [50], ingredient regions [52]. Another branch of approaches pose ingredient recogni-
are detected by shape and texture models, where the shape tion as cross-modal learning problem [54]–[56]. Specifically,
is based on DPM (deformable part-based model) while the both images and recipes are projected into a joint embedding
texture is based on STF (semantic texton forest). Similar to space for similarity measure. Ingredients are either extracted
PFD [49], the regions are encoded into a histogram modeling from the matched recipe of an image [54], [56] or directly
spatial relationship between them for food recognition. The predicted from the joint space [57]. However, as the perfor-
spatial relationship is not statistically encoded as in [49], but mance is not scalable to large recipe dataset as studied in [58]
rather explicit relationships such as “above”, “below”, and and cross-modal learning is inherently a “black box” model,
“overlapping” are modeled. Such relationships are helpful for the robustness of these approaches is not yet seriously studied.
recognizing food such as dessert and fast food, but diffi-
cult to be generalized such as for Chinese dishes. In [18], III. DATASET
an interesting work that mines the composition of ingredients
We construct a large food dataset specifically for Chinese
as discriminative patterns is proposed for food classification.
dishes, namely Vireo Food-251. Different from other publicly
A drawback of this approach is the requirement of image
available datasets [18], [29], [59], both food categories and
segmentation, which is sensitive to parameter settings and
ingredient labels are included. To the best of our knowledge,
can impact recognition performance. As reported in [18],
this is the largest dataset that provides both food categories
the performance is not better than of DCNN without image
and ingredient labels.
segmentation on the Food-101 dataset. Similar to [24], image
segmentation is employed in [25], but using a more advanced
technique based on conditional random field (CRF) with unary A. Dataset Collection
potentials provided by DCNN [51]. The promising perfor- VIREO Food-251 is extended from the original Vireo
mance in segmentation for western food, nevertheless, comes Food-172 [16]. With the newly added 79 food categories,
from the price for requiring training labels that need manual the dataset covers the most of popular Chinese dishes. The
segmentation of food items for model learning. For Chinese new dataset is compiled from the top-200 popular food
food, collecting such training labels is extremely difficult, listed on the website “Go Cooking”1. The popularity is
given the fuzzy composition and placement of ingredients as sorted based on the number of user-uploaded food images.
shown in Figure 1. The popular food list is further combined with the original
Ingredient recognition is posed as a multi-label learning
problem [16]. More recent works exploit neural networks, 1 https://www.xiachufang.com/category/
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1517
Fig. 3. The distribution of food categories under eight major food groups in Fig. 4. The ingredient “egg” shows large difference in visual appearance
Vireo Food-251. across different kinds of dishes.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1518 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Fig. 6. Four different deep architectures for multi-task learning of food category and ingredient recognition.
We formulate the food categorization and ingredient recog- food categorization and ingredient recognition layers. RestNet
nition as a multi-task deep learning problem and modify the and SENet, on the other hand, have only one fully connected
architecture of DCNN for our purpose. The modification is not layer. In this case, only Arch-B can be implemented. Due to
straightforward as it involves two design issues. The first is on the different natures of tasks, we adopt multinomial logistic
whether the prediction scores of both tasks should directly or loss function L 1 for single-label food categorization and
indirectly influence each other. Direct influence means that the cross-entropy as the loss function L 2 for multi-label ingredient
input of one task is connected as the output of another task. recognition. Denote N as the total number of training images,
Indirect influence decouples the connection such that each task the overall loss function L is as following:
is on a different path of the network. Both tasks influence each
1
N
other by updating the shared intermediate layers. The second L=− (L 1 + λL 2 ) (1)
issue is about the degree in which the intermediate layers N
n=1
should be shared. Ideally, each task should have its own
private layer(s) given that the nature of both tasks, single where λ is a trade-off parameter. This loss function is also
versus multi-labeling, is different. In such a way, the updating widely used in other works such as [61]. During training,
of parameters can be done more freely for optimization of the errors propagated from the two branches are linearly
individual performance. combined and the weights of the convolutional layers shared
Based on the two design issues, we derive four different between the two tasks will be updated accordingly. The updat-
deep architectures as depicted in Figure 6, respectively name ing will subsequently affect the last two layers simultaneously,
as Arch-A to Arch-D. The first design (Arch-A) considers adjusting the features separately owned by food and ingredient
stacked architecture by placing food categorization on top of recognition. Let q̂ n,y as the predicted score of an image x n for
ingredient recognition, and vice versa. As the composition of its ground-truth food label y, L 1 is defined as follows:
ingredients for different dishes under the same food category L 1 = log(q̂ n,y ) (2)
can be different, this architecture has the risk that model
learning converges slowly as observed in the experiment. where q̂ n,y is obtained from softmax activation function.
The second design (Arch-B) is similar except that indirect Furthermore, denote pn ∈ {0, 1}t , represented as a vector in
influence is adopted and both tasks are at different pathways. t dimensions, as the ground-truth ingredients for an image
Both designs are relatively straightforward to implement by x n . Basically pn is a binary vector with entries of value 1 or
adding additional layers to DCNN. The next two architectures 0 indicating the presence or absence of an ingredient. The loss
consider the decoupling of some intermediate layers. The function L 2 is defined as
third design (Arch-C) allows each task to privately own two
t
intermediate layers on top of the convolutional layers for L2 = pn,c log( p̂n,c ) + (1 − pn,c )log(1 − p̂n,c ) (3)
parameter learning. The last design (Arch-D) is a compromise c=1
version between the second and third architectures, by having where p̂n,c denotes the probability of having ingredient cate-
one shared and one private layer. Arch-D has the peculiarity gory c for x n , obtained through sigmoid activation function.
that the shared layer can correspond to the high or mid-level
features common between the two tasks at the early stage
of learning, while the private layer preserves the learning of B. Region-Wise Ingredient Recognition
specialized features useful for optimizing the performance of The previous section considers the global image feature
each task. for multi-label learning, while ignoring regional information.
The architectures are modified from existing deep models, This section introduces region-wise ingredient recognition,
including VGG-16 [12], ResNet-50 [13], ResNet-101 [13], and as illustrated by the pipeline in Figure 7. Given a food
SENet-154 [60]. In terms of design, the major modification image I , the feature map (denoted as F I ∈ Rm×m×d ),
is made on the fully connected layers. As VGG contains two which corresponds to the last convolution layer of DCNN
fully connected layers, we modify the fully connected layers to and retains the spatial information of the original image,
implement all the architectures presented in Figure 6. For the is extracted from DCNN. The feature map is divided into
private layers in Arch-D, there are 4,096 neurons for both the m × m grids, where each grid is represented by a vector of
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1519
Fig. 7. The pipeline of region-wise ingredient recognition. Given a food image, the feature maps from the last convolutional layer of deep models are
extracted for region-wise ingredient classification. Max pooling is performed across different regions to obtain the final predictions.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
prediction in one task will directly affect the other task. On the
TABLE IV
other hand, while having separate paths as in Arch-B leads
P ERFORMANCES OF I NGREDIENT R ECOGNITION IN E ACH G ROUP. T HE
to better performances, the improvement is smaller compared N UMBER OF I NGREDIENT C ATEGORIES , AVERAGE AND M EDIAN N UM -
with Arch-C and Arch-D that do not share the same lower BERS OF T RAINING I MAGES A RE S HOWN IN THE 2 ND , 3 RD AND 4 TH
layer before the classification layer. Arch-D, which shares one C OLUMNS R ESPECTIVELY
layer while also learning separate layers tailor-made for both
tasks, attains the best performance in terms of Micro-F1 and
Macro-F1.
Table II lists the performance of food categorization. For
multi-task learning, similar trends are observed as ingredient
recognition. For top-1 accuracy, the best result is attained
by Arch-D while for top-5, the best results are attained by
Arch-C. This basically verifies the importance of private
layers for both tasks. It is worth to note that, different from
ingredient recognition, Arch-A1 performs much better than
Arch-A2 for food categorization. The result indicates that
recognizing the food category based on the composition of in different food categories. Introducing the food category
the ingredients is more feasible than inferring ingredients information for multi-task learning will increase the confusion
based on the food category. To verify that the improvement and hence resulted in lower recognition performance. Overall,
is not by chance, we conduct a significance test to compare with multi-task learning, the macro-F1 is boosted from 56.79%
multi-task (Arch-D) and single-task (VGG) using the source to 61.74%, with 285 ingredients showing improvements.
code provided by TRECVID2 . The test is performed by To provide insights on which type of ingredients are difficult
partial randomization with 100,000 iterations, with the null to recognize, we divide the ingredients into ten major food
hypothesis that the improvement is due to chance. At a groups and report the Macro-F1 of each group in Table IV.
significance level of 0.05, Arch-D is significantly different As shown, the Macro-F1 for “fish” and “meat” are fairly high
from VGG in both food categorization and ingredient due to a sufficient number of training samples. The average
recognition by Top-1 accuracy and Macro-F1, respectively. numbers of training samples for the ingredient in “meat”
The p-values are close to 0, which rejects the null hypothesis. and “fish” are 1,226 and 1,301, respectively, which results
To further contrast the performance between single-task in high recognition accuracies. On the contrary, the group
and multi-task learning models, Table III lists the ingre- “fruits” has only 240 training samples on average, which is
dients showing large deviations in performances. Basically, the fewest among all the groups. The median number is even
for ingredients that are unique for a few food categories, fewer, which is only 40 as most of the training samples are
the multi-task learning model performs much better than from the ingredient “pineapple”. As a result, the Macro-F1 of
the single-task learning model. For example, “cordyceps “fruits” is rather low, which is only 24.03%. Despite having
sinensis” only appears in “black chicken soup”, and hence 903 training samples on average, the group “seasonings” has
multi-task VGG is able to outperform single-task VGG with the second-lowest Macro-F1, which suggests that recognizing
a large margin. Another example is “red bean paste” which ingredients in the “seasonings” group is relatively challenging
is unique to the food category “traditional Chinese rice- compared with the other ingredients.
pudding”, multi-task VGG outperforms single-task VGG by Figure 8 shows three failure examples of seasoning ingredi-
33.1%. On the contrary, multi-task learning suffers from ent recognition. In these examples, seasoning ingredients are
lower performance when confused by the frequently appear- “dried chili”, “minced garlic”, “broad bean paste”, “minced
ing ingredients. As shown in Table III, ingredients such as ginger” and “minced green onion”. Basically, there are
“corn block”, “cabbage”, “sliced tomato” are always seen three major reasons for the low performance of seasoning
ingredient recognition. First, some seasoning ingredients tend
2 http://www-nlpir.nist.gov/projects/t01v/trecvid.tools/ to confuse with each other due to similar appearances. For
randomization.testing example, in Figure 9(a), “dried chilli sections” is incorrectly
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1521
TABLE V
P ERFORMANCE OF M ULTI -L ABEL I NGREDIENT R ECOGNITION ON VIREO
F OOD -251 D ATASET. N OTE T HAT A RCH -B I S I MPLEMENTED FOR
M ULTI -TASK L EARNING
TABLE VI
predicted as “dried chilli”. As “dried chili sections” differs P ERFORMANCE OF I NGREDIENT R ECOGNITION
from “dried chili” only in shape, the model tends to confuse
them because of occlusions among different ingredients in
the dish. Second, seasoning ingredients tend to be small in
size and portion, which makes the recognition of “minced
garlic” in Figure 9(b)-(c) and “minced ginger” in Figure 9(c)
difficult. Such examples are not easy to be recognized even
by humans. Third, as the training samples for some seasoning
ingredients are not sufficient, the recognition performance is
not satisfactory. For example, there are only 18 samples for
seasoning ingredient “broad bean pasta”, resulting in incorrect
prediction as shown in Figure 9(b).
To validate the the effectiveness of multi-task learning strat- models. Basically, region-wise recognition helps to improve
egy for other backbone networks (e.g., ResNet [13], Inception- the ingredient recognition performances on all backbone
V3 [62], SENet-154 [60]), we report the performances of networks except for SENet-154. Since the key idea of SENet
different backbone networks in Table V. As most of the is to re-calibrate channel-wise (i.e., region-wise) feature
convolutional networks contain only one fully connected layer, responses by explicitly modeling inter-dependencies between
hence we implement Arch-B for multi-task learning on differ- channels (regions), performing region-wise recognition for
ent backbone networks. Although Arch-B is not the optimal SENet-154 will harm the dependencies among region features
design for multi-task learning, it still performs better than and lead to worse recognition performance. On the contrary,
single-task learning models, which verifies the effectiveness of with region-wise recognition, VGG and ResNet-50 further
leverage food category information for ingredient recognition. improve macro-F1 by around 5% and 4% respectively. The
The general trend is that the deeper a network is, the higher the improvement in terms of micro-F1 is not so obvious, which is
recognition rate will be. In the case of ingredient recognition, around 1%. This is due to the fact that region-wise recognition
providing food categories as an extra global cue can reduce mostly benefits ingredient categories with a smaller number of
false positives in ingredient recognition. This advantage does training examples. For example, the rare ingredient “cordyceps
come with the trade-off of introducing more false negatives. sinensis” having only 10 training examples improves F1 from
As a result, in terms of numerical scores, both single and 0% to 100%. This is because region-wise recognition, similar
multi-task learning do not seem to differ too much. We further to data augmentation, inherently increases the number of
perform result analysis and notice the following. The false training examples. Furthermore, the ingredients that are small
positives introduced by single-task include major ingredients in size are less likely to be dominated by other ingredients
while the false negatives introduced by multi-task are mostly during feature learning. As a consequence, the contribution
auxiliary ingredients. From the application point of view, aux- of region-wise recognition is more significant for ingredients
iliary ingredients have much less impact than major ingredients in small size and with less number of training examples.
towards the estimation of nutrition facts. Furthermore, false Figure 9 shows examples to contrast the performance between
positives can adversely frustrate user experience in food log- region-wise and image-level recognitions. Region-wise recog-
ging. Hence, multi-task still has its advantage over single-task nition is robust to size (e.g., “parsley” in Figure 9(a)), cutting
despite marginal improvement in terms of the numerical method (e.g., “minced ginger” in Figure 9(b)) and training size
score. (e.g., “kiwi fruit” in Figure 9(c)). On the other hand, limited by
region size, context information is not fully leveraged. Ingredi-
B. Region-Wise Recognition ents such as “crisp fritter” in Figure 9(e)) and “yellow peach”
We then evaluate the performance of region-wise ingredient are predicted correctly by image-level but not region-wise
recognition. Table VI compares the ingredient recognition recognition. Table VII shows the top-10 ingredients that gain
performance between image-level and region-wise recognition the largest improvement due to region-wise recognition.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1522 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Fig. 9. Examples of ingredient recognition. False positives are marked in red while false negatives are marked in yellow. The backbone network is ResNet-50.
TABLE VII
T EN I NGREDIENTS S HOWING L ARGE P ERFORMANCE I MPROVEMENT IN
F1 W ITH R EGION -W ISE R ECOGNITION M ODEL . T HE BACKBONE N ET-
WORK I S R ES N ET-50
Fig. 10. Ingredient localization: original image (left) and the response maps
of three ingredients. The backbone network is ResNet-50.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1523
TABLE VIII
P ERFORMANCE D IFFERENCE B ETWEEN S INGLE AND M ULTI -S CALE
R EGION -W ISE I NGREDIENT R ECOGNITION
TABLE IX
P ERFORMANCE D IFFERENCE B ETWEEN S INGLE AND M ULTI -TASK
R EGION -W ISE I NGREDIENT R ECOGNITION
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1524 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STUDY OF MULTI-TASK AND REGION-WISE DEEP LEARNING FOR FOOD INGREDIENT RECOGNITION 1525
[16] J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking [40] M. Merler, H. Wu, R. Uceda-Sosa, Q.-B. Nguyen, and J. R. Smith,
recipe retrieval,” in Proc. ACM Multimedia Conf., 2016, pp. 32–41. “Snap, eat, Repeat: A food recognition engine for dietary logging,”
[17] N. Martinel, G. Luca Foresti, and C. Micheloni, “Wide-slice resid- in Proc. 2nd Int. Workshop Multimedia Assist. Dietary Manage.
ual networks for food recognition,” 2016, arXiv:1612.06543. [Online]. (MADiMa), 2016, pp. 31–40.
Available: http://arxiv.org/abs/1612.06543 [41] G. Ciocca, P. Napoletano, and R. Schettini, “CNN-based features for
[18] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining retrieval and classification of food images,” Comput. Vis. Image Under-
discriminative components with random forests,” in Proc. Eur. Conf. stand., vols. 176–177, pp. 70–77, Nov. 2018.
Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 446–461. [42] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep
[19] X.-J. Zhang, Y.-F. Lu, and S.-H. Zhang, “Multi-task learning for food feature aggregation for food recognition,” IEEE Trans. Image Process.,
identification and analysis with deep convolutional neural networks,” vol. 29, pp. 265–276, Jul. 2020.
J. Comput. Sci. Technol., vol. 31, no. 3, pp. 489–500, [43] Y. Kawano and K. Yanai, “Food image recognition with deep convo-
May 2016. lutional features,” in Proc. ACM Int. Joint Conf. Pervas. Ubiquitous
[20] J.-J. Chen, C.-W. Ngo, and T.-S. Chua, “Cross-modal recipe retrieval Comput. Adjunct Publication UbiComp Adjunct, 2014, pp. 589–593.
with rich food attributes,” in Proc. 25th ACM Int. Conf. Multimedia, [44] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual
Oct. 2017, pp. 1771–1779. networks for food recognition,” in Proc. IEEE Winter Conf. Appl.
[21] M. Bola nos, A. Ferrà, and P. Radeva, “Food ingredients recognition Comput. Vis. (WACV), Mar. 2018, pp. 567–576.
through multi-label learning,” in Proc. Int. Conf. Image Anal. Process. [45] H. Chen, J. Wang, Q. Qi, Y. Li, and H. Sun, “Bilinear CNN models
Cham, Switzerland: Springer, 2017, pp. 394–402. for food recognition,” in Proc. Int. Conf. Digit. Image Comput., Techn.
[22] K. Yanai and Y. Kawano, “Food image recognition using deep convo- Appl. (DICTA), Nov. 2017, pp. 1–6.
lutional network with pre-training and fine-tuning,” in Proc. IEEE Int. [46] L. Meng et al., “Learning using privileged information for food
Conf. Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6. recognition,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019,
pp. 1–9.
[23] X. Chen, Y. Zhu, H. Zhou, L. Diao, and D. Wang, “ChineseFood-
[47] J. Bootkrajang, J. Chawachat, and E. Trakulsanguan, “Deep-based
Net: A large-scale image dataset for chinese food recognition,” 2017,
arXiv:1705.02743. [Online]. Available: http://arxiv.org/abs/1705.02743 openset classification technique and its application in novel food cat-
egories recognition,” in Proc. Int. Conf. Comput. Recognit. Syst. Cham,
[24] M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition
Switzerland: Springer, 2019, pp. 235–245.
and volume estimation of food intake using a mobile device,” in Proc.
[48] J.-J. Chen, L. Pan, Z. Wei, X. Wang, C.-W. Ngo, and T.-S. Chua, “Zero-
Workshop Appl. Comput. Vis. (WACV), Dec. 2009, pp. 1–8.
shot ingredient recognition by multi-relational graph convolutional net-
[25] A. Myers et al., “Im2Calories: Towards an automated mobile vision work,” in Proc. AAAI, 2020, pp. 10542–10550.
food diary,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, [49] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
pp. 1233–1241. using statistics of pairwise local features,” in Proc. IEEE Comput. Soc.
[26] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocal- Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.
ized modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, [50] H. He, F. Kong, and J. Tan, “DietCam: Multiview food recognition using
no. 8, pp. 1187–1199, Aug. 2015. a multikernel SVM,” IEEE J. Biomed. Health Informat., vol. 20, no. 3,
[27] Z. Wei, J. Chen, Z. Ming, C.-W. Ngo, T.-S. Chua, and F. Zhou, pp. 848–855, May 2016.
“DietLens-eout: Large scale restaurant food photo recognition,” in Proc. [51] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
Int. Conf. Multimedia Retr., Jun. 2019, pp. 399–403. A. L. Yuille, “Semantic image segmentation with deep convolutional
[28] H. Matsunaga, K. Doman, T. Hirayama, I. Ide, D. Deguchi, and nets and fully connected CRFs,” 2014, arXiv:1412.7062. [Online].
H. Murase, “Tastes and textures estimation of foods based on the analysis Available: http://arxiv.org/abs/1412.7062
of its ingredients list and image,” in Proc. New Trends Image Anal. [52] W. Min, S. Jiang, J. Sang, H. Wang, X. Liu, and L. Herranz, “Being
Process. Workshop. Cham, Switzerland: Springer, 2015, pp. 326–333. a supercook: Joint food attributes and multimodal content modeling for
[29] Y. Matsuda and K. Yanai, “Multiple-food recognition considering co- recipe retrieval and exploration,” IEEE Trans. Multimedia, vol. 19, no. 5,
occurrence employing manifold ranking,” in Proc. Int. Conf. Pattern pp. 1100–1113, May 2017.
Recognit., 2012, pp. 2017–2020. [53] F. Zhou and Y. Lin, “Fine-grained image classification by exploring
[30] T. Ege and K. Yanai, “Estimating food calories for multiple-dish food bipartite-graph labels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
photos,” in Proc. 4th IAPR Asian Conf. Pattern Recognit. (ACPR), nit. (CVPR), Jun. 2016, pp. 1124–1133.
Nov. 2017, pp. 646–651. [54] J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval: How
[31] E. Aguilar, B. Remeseiro, M. Bolanos, and P. Radeva, “Grab, pay, to cook this dish?” in Proc. Int. Conf. Multimedia Modeling. Cham,
and eat: Semantic food detection for smart restaurants,” IEEE Trans. Switzerland: Springer, 2017, pp. 588–600.
Multimedia, vol. 20, no. 12, pp. 3266–3275, Dec. 2018. [55] J.-J. Chen, C.-W. Ngo, F.-L. Feng, and T.-S. Chua, “Deep understanding
[32] Y. Wang, J.-J. Chen, C.-W. Ngo, T.-S. Chua, W. Zuo, and Z. of cooking procedure for cross-modal recipe retrieval,” in Proc. 26th
Ming, “Mixed dish recognition through multi-label learning,” in Proc. ACM Int. Conf. Multimedia, Oct. 2018, pp. 1020–1028.
11th Workshop Multimedia Cooking Eating Activities (CEA), 2019, [56] J.-J. Chen, L. Pang, and C.-W. Ngo, “Cross-modal recipe retrieval
pp. 1–8. with stacked attention model,” Multimedia Tools Appl., vol. 77, no. 22,
[33] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized pp. 29457–29473, Nov. 2018.
classifier for food image recognition,” IEEE Trans. Multimedia, vol. 20, [57] A. Salvador et al., “Learning cross-modal embeddings for cooking
no. 10, pp. 2836–2848, Oct. 2018. recipes and food images,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jul. 2017, pp. 3020–3028.
[34] H. Hoashi, T. Joutou, and K. Yanai, “Image recognition of 85 food
[58] B. Zhu, C.-W. Ngo, J. Chen, and Y. Hao, “R2 GAN: Cross-modal recipe
categories by feature fusion,” in Proc. IEEE Int. Symp. Multimedia,
retrieval with generative adversarial network,” in Proc. IEEE/CVF Conf.
Dec. 2010, pp. 296–301.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 11477–11486.
[35] Y. Kawano and K. Yanai, “Real-time mobile food recognition system,” in [59] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2013, “PFID: Pittsburgh fast-food image dataset,” in Proc. 16th IEEE Int. Conf.
pp. 1–7. Image Process. (ICIP), Nov. 2009, pp. 289–292.
[36] B. V. Resende e Silva, M. G. Rad, J. Cui, M. McCabe, and K. Pan, [60] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
“A mobile-based diet monitoring system for obesity management,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
J. Health Med. Informat., vol. 9, no. 2, pp. 1–20, 2018. pp. 7132–7141.
[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [61] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face repre-
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. sentation by joint identification-verification,” in Proc. Adv. Neural Inf.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. Process. Syst., 2014, pp. 1988–1996.
[38] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. the inception architecture for computer vision,” in Proc. IEEE Conf.
Multimedia Expo Workshops (ICMEW), Jun. 2015, pp. 1–6. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
[39] H. Hassannejad, G. Matrella, P. Ciampolini, I. De Munari, [63] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention
M. Mordonini, and S. Cagnoni, “Food image recognition using very branch network: Learning of attention mechanism for visual explana-
deep convolutional networks,” in Proc. 2nd Int. Workshop Multimedia tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Assist. Dietary Manage. (MADiMa), 2016, pp. 41–49. Jun. 2019, pp. 10705–10714.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.
1526 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 30, 2021
Jingjing Chen (Member, IEEE) received the Ph.D. Tat-Seng Chua received the Ph.D. degree from the
degree in computer science from the City University University of Leeds, U.K. He is the KITHCT Chair
of Hong Kong in 2018. She is currently a pre-tenured Professor with the School of Computing, National
Associate Professor with the School of Computer University of Singapore, where he was the Acting
Science, Fudan University. Before joining Fudan and Founding Dean of the School from 1998 to
University, she was a Postdoctoral Research Fellow 2000. His main research interests include multimedia
with the School of Computing, National Univer- information retrieval and social media analytics.
sity of Singapore. Her research interests include In particular, his research focuses on the extraction,
diet tracking and nutrition estimation based on retrieval, and question-answering (QA) of text and
multi-modal processing of food images, including rich media arising from the Web and multiple social
food recognition and cross-modal recipe retrieval. networks. He is the Co-Director of NExT, a joint
center between NUS and Tsinghua University, to develop technologies for live
social media search. He is the 2015 winner of the prestigious ACM SIGMM
Award for Outstanding Technical Contributions to Multimedia Computing,
Communications, and Applications. He is the Chair of Steering Committee
Bin Zhu (Graduate Student Member, IEEE) received of the ACM International Conference on Multimedia Retrieval (ICMR) and
the B.Sc. degree from Southeast University, Nan- Multimedia Modeling (MMM) conference series. He is also the General
jing, China, in 2015, and the M.Sc. degree from Co-Chair of ACM Multimedia 2005, ACM CIVR (now ACM ICMR) 2005,
Zhejiang University, Hangzhou, China, in 2018. ACM SIGIR 2008, and ACM Web Science 2015. He serves on the editorial
He is currently pursuing the Ph.D. degree with the boards of four international journals. He is the Co-Founder of two technology
VIREO Group, Department of Computer Science, startup companies in Singapore.
City University of Hong Kong. His research interests
include diet tracking, generative model and multime-
dia analysis, including food recognition, cross-modal
recipe retrieval, nutrition estimation, and image
generation.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:26:57 UTC from IEEE Xplore. Restrictions apply.