Utilization of Text Mining As A Big Data Analysis Tool For Food Science and Nutrition
Utilization of Text Mining As A Big Data Analysis Tool For Food Science and Nutrition
DOI: 10.1111/1541-4337.12540
KEYWORDS
big data, information technology, semantic web, text mining
Compr Rev Food Sci Food Saf. 2020;1–20. wileyonlinelibrary.com/journal/crf3 © 2020 Institute of Food Technologists® 1
2 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
TABLE 1 Popular software and packages used for data mining analysis being rapidly adopted in agriculture sector (Kami-
Software/programming language Library/package laris, Fonts, & Prenafeta-Bold𝜐, ́ 2019). For example, in food
Python Pandas (Python) production, big data analysis has been used to make predictive
R Software Scikit-learn (Python) insights about farm operations through models such as predic-
SAS Enterprise Miner NLTK (Python) tive yield model using Geographic Information System tech-
IBM SPSS modeler NetworkX (Python) niques (Al-Gaadi et al., 2016). Specifically, several vegetation
Oracle Data Mining Numpy (Python) indices (VIs) were extracted from massive satellite image data
Orange Data Mining tm (R)
by software (data preprocessing stage); then VIs were fed to
RapidMinder Sympy (Python)
build a regression model to predict potato yield (data pattern
Weka Data Mining Scipy (Python)
Anaconda nlp (R) discovery stage); the results were evaluated with actual yields
GNU Octave wordcloud (R) (pattern evaluation stage). In the food process industry, qual-
Gephi apriori (R) ity is critical in influencing consumer acceptance of the final
STATISTICA topicmodels (R) product (Du & Sun, 2006). To assure food quality through
NVIVO textir (R) the supply chain, diverse sensors (e.g., hyperspectral imag-
BigML network (R) ing, spectroscopy, and biometric receptors) coupled with mul-
Note. Many tools are not listed here; this list only focuses on data mining tools tivariate analysis methods have been used for classification
commonly used in text data analysis.
and prediction purposes to evaluate food quality and authen-
ticity (Jiménez-Carvelo, González-Casado, Bagur-González,
facing today is how to feed continuing growing global popu- & Cuadros-Rodríguez, 2019; Ropodi, Panagou, & Nychas,
lation with limited resources of land, water, and energy in the 2016).
near future (Godfray et al., 2010). Ideas of “precision agricul- Food is an essential component of our lives, cultures,
ture” and “smart farming” are proposed for improving agri- and well-being (Abbar, Mejova, & Weber, 2015). The ways
culture production through monitoring, modeling, and opti- in which food is produced, prepared, and eaten is becom-
mized operations. The availability of huge amounts of data ing more interactive and creative as more digital and net-
from multiple sources and sensors, as well as advancement of work techniques are adopted by the public. The emer-
data storage and analyzing technologies are making big data gence of information technologies has allowed enormous
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 3
quantities of data to be collected and analyzed. As food some opportunities and challenges for future studies will be
is one of the most prevalent subjects in our lives, immea- proposed.
surable amounts of food-related information are generated
globally on a daily basis. Recently, the semantic web (e.g.,
Web sites, social media, online databases) has produced an 2 TEXT DATA S O URCES
increasing amount of digital data related to food produc-
tion, food processing, and consumption. Digital platforms Text information, that is, in the form of natural-language
such as social media provide new possibilities for people text (e.g., English text), can be found in all web pages,
to share their food consumption habits with others. Nowa- social media (e.g., tweets), news, scientific literature, public
days, it is not uncommon to see people in restaurants tak- records, and many other types of documents (Zhai & Mas-
ing photographs of attractive dishes and immediately sending sung, 2016). Text information relating to food and nutrition
them on social media to share with others (Masson, Buben- research primarily belong to three classifications: database
dorff, & Fraïssé, 2018). Scientists in many areas have become data, Internet data, and social media data (Figure 2). Database
interested in identifying human food choices and consump- data include information from government and scientific
tion behaviors (Mouritsen, Edwards-Stuart, Ahn, & Ahnert, databases. Internet data consist of news articles and Web
2017). sites created in public media or professional organizations.
Digital data are presented in a variety of forms, such as Social-media data are user-generated contents that can be
texts, images, and videos. Among them, text data play an published by anyone in social networks (e.g., Twitter), forums
essential role in our life, and have been studied most exten- (e.g., Reddit), or review Web sites (e.g., Yelp). In this section,
sively (Paul & Dredze, 2017). Large amounts of text data are these three kinds of information sources will be discussed
produced and consumed for communication purposes. Also, regarding how they can be used to study food science and
text data are generally rich in semantic content, containing nutrition-related subjects.
people’s knowledge, opinions, and preferences. They have,
therefore, been used in a wide range of studies, in areas such
as business intelligence, biomedical–literature mining, pub-
2.1 Database data
lic health surveillance, agricultural management, and so on Researchers have shown considerable interest in mining hid-
(Drury & Roche, 2019; Zhai & Massung, 2016). As more and den knowledge from databases since the end of the last cen-
more text information becomes accessible online, it has the tury (Chen, Han, & Yu, 1996; Han et al., 2011). Recently, a
potential to provide new insights, assist in decision-making, number of databases have been used to enhance our under-
and improve product and service quality (Marvin et al., 2017). standing of food-related topics, such as identifying food safety
However, text data are usually extremely unstructured and and fraud events, and the interplay between food and disease
difficult to analyze. New techniques have been developed in complicated food systems (Jensen, Panagiotou, & Kousk-
to handle such data from diverse sources. These techniques oumvekaki, 2014; Karaa, Mannai, Dey, Ashour, & Olariu,
are often referred to as text mining, a data mining division 2016; Marvin et al., 2017; Yang, Swaminathan, Sharma,
that focuses on discovering knowledge from text informa- Ketkar, & Jason, 2011). Bouzembrak and Marvin (2016),
tion (Zhai & Massung, 2016). From word-frequency anal- for instance, used a Rapid Alert System for Food and Feed
ysis to advanced natural-language processing, text mining (RASFF) database to identify and monitor the hazards of
has created alternative methods of study and new insights in food fraud and food safety. Another source of text informa-
food-related topics. The previous work in data mining has tion is scientific databases, from which food safety hazards
mainly concentrated on instrument-generated data in the food can be identified (Lucas Luijckx, van de Brug, Leeman, van
industry (Chiang et al., 2017; Marvin et al., 2017; Ropodi der Vossen, & Cnossen, 2016; Van de Brug, Luijckx, Cnossen,
et al., 2016; Waldner, 2017). To our knowledge, no effort & Houben, 2014). Studies were conducted to analyze rela-
has been made to summarize the research activities on text tionships among food, genes, and illness using information
data or their utilization in the food science and nutrition from scientific abstracts (Jensen et al., 2014; Karaa et al.,
domains. 2016; Yang et al., 2011). In the food industry, companies own
In this compilation, an effort will be made to analyze how business-related databases including information of food pro-
and to what extent text data play a role in food systems. First, duction, processing, and consumer feedbacks. However, most
the method used for data collection and the basic findings of the industry-sourced databases are private, and thus are
from the data will be described. Then the primary sources difficult to retrieve and analyze. Recently, partial information
of text data in the digital era and the standard methods used such as ingredient list and nutrition table of food products are
in analyzing them will be discussed. The applications of text shared and can be accessed through government databases,
mining are divided into seven categories to highlight the usage for example, U.S. Department of Agriculture (USDA,
of the data and their relations to food-related topics. Finally, 2019). Databases from government agencies, international
4 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
authorities, and scientific fields are usually high in data municate with each other. These web-based microblogging
credibility. platforms are generating real-time text data for the analysis
of behavior, sentiments, and trends, and the surveillance of
2.2 Internet data health matters (Ghosh & Guha, 2013). Consequently, there is
a growing body of social media research focused on identify-
The Internet is a rich source of food-related data (Marvin ing the linguistic characteristics of the contents of food and
et al., 2017). Due to the wide implementation of digital tech- human interactions. Scientists are using the text information
niques, news articles and professional reports related to food to detect consumers’ dietary patterns, adverse reactions, per-
and nutrition are now easily accessible online. For exam- ceptions, preferences, and discussions on specific foods. The
ple, food safety-related information scattered on the Internet collection of dietary information includes maintaining diaries
is used for building database systems that can rank infor- and regular surveys, but these are restricted in scope. How-
mation based on their “relativity” to a food safety topic ever, the social media allow their users to update the world on
(Maeda, Kurita, and Ikeda, 2005; Kate, Chaudhari, Prapanca, the details of their daily lives, including their eating habits
& Kalagnanam, 2014; Chen, Huang, Nong, & Kwan, 2016). (Abbar et al., 2015). Web mining and social media analy-
These surveillance systems are based on multisourced Inter- sis approaches are widely used for analyzing social media
net data, such as mainstream news media, government Web data. However, as the social media data are user-generated,
sites, specialty blogs, and so on, and allow risk managers to the analysis of their text information still presents many
get up-to-date information on food safety events (Steinberger, challenges.
Pouliquen, & Van der Goot, 2013). In addition, text data on
food composition, such as labels and recipes, are accessible on
the Internet. They can be used to develop smart systems that
can make predictive and adaptive decisions/suggestions (e.g., 3 T E X T DATA A NA LYSI S
recipe completion and ingredient selection) based on data ana- M ETH O DS
lytic algorithms (Ahnert, 2013; De Clercq, Stock, De Baets, &
Waegeman, 2016). Published by official agencies or experts, The fact that we are producing and consuming a lot of text
these Internet-based data are usually high in relevancy and data in communications indicates the importance of text data
credibility. to our lives. In the days when we only needed to deal with
small volumes of text data, manual processing was crucial
and viable for improving productivity. With the fast increase
2.3 Social media data
in digital text information, manual processing, particularly for
In recent years, social media platforms like Twitter, Facebook, time-critical applications, is no longer feasible. Internet infor-
and Instagram have changed the way people interact and com- mation comes in highly complex multivariate datasets and is
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 5
difficult to inspect and present. A number of text analysis tech- tion, are called stop words (Zhai & Massung, 2016). They are
niques, as referred to text mining, have therefore appeared to usually removed in text analysis to improve the performance
enable computers to transform large quantities of text infor- of the algorithm by reducing useless words in vector spaces.
mation into useful insights. Text mining is a division of data Words with different forms or derivationally related words
mining that focuses on discovering knowledge from text infor- with similar meanings are often reduced to a common base
mation (Zhai & Massung, 2016). From data cleaning (basic in text preprocessing (Manning, Raghavan, & Schütze, 2010).
text preprocessing methods), data reduction (word-level anal- Stemming and lemmatization are two popular ways of remov-
ysis), data association analysis (word association analysis) to ing such inflections. Stemming truncates the ends of words
advanced data mining (e.g., text classification, text cluster- based on language-specific rules that often involve removing
ing), the fundamental techniques used in text data analysis suffixes. In contrast, lemmatization utilizes vocabulary and
will be discussed in the following section. In addition, nontext morphological analyses to return a word’s foundation (Man-
metadata including demographics (times, places, incomes, ning et al., 2010). Both stemming and lemmatization, like
etc.) can also be used to provide contextual information that stop-word removal, attempt to maintain the essential meaning
is helpful in text analysis. To this end, standard methods used behind the original text. Noise removal is another substitu-
for the integration of text and nontext data will also be dis- tion method with more task-specific purposes, including the
cussed. A schematic framework of these processes is shown removal of HTML, XML, metadata, and headers from text
in Figure 3. files, the extraction of information from other formats, and so
on. Tokenization generates an index over any text information
that transforms single-text files into sparse vectors of term (or
3.1 Basic text preprocessing methods token) counts (Manning et al., 2010; Zhai & Massung, 2016).
For many text mining tasks, data preprocessing is required. A successful tokenization system depends on a predefined
Normalization, noise removal, and tokenization are three gen- dictionary of critical terms that depend on the particular issue
eral steps in preprocessing. Normalization refers to a series (Zhai & Massung, 2016). A crucial token may be composed of
of tasks such as converting all letters to lower or upper case, a string of phrases, known as an n-gram of a series of n words,
converting numbers into words or removing them, remov- that captures the sequential relationship in the text informa-
ing punctuation, and so on. In particular, stop-word removal, tion (Manning et al., 2010). A 2-gram (or bigram) is a two-
stemming, and lemmatization are critical processes in text word sequence of words like “I like,” “like apple,” or “apple
normalization. Words of high frequency, such as I, the, of, juice,” and a 3-gram (or trigram) is a three-word sequence of
my, it, to, and from, which do not contain topical informa- words like “I like apple,” or “like apple juice” (Jurafsky &
6 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
Martin, 2008). Many promising applications have been dis- items in specific systems, and what the connections do. Net-
covered using bigrams and trigrams. Using longer grams works of food-related words can be constructed from the asso-
offers more decision-making information but can also trigger ciation of word pairs. The last decade has seen applications
data sparsity (Zhai & Massung, 2016). Although some task- of the network analysis of digital text data for improving our
specific processes may require manual coding to set the guide- understanding of food science and nutrition. The examples
lines, most of the above functions can be achieved using cur- of applying network analysis in food domain can be found
rent packages like “nlp” and “tm” in R programming (Hornik in Section 4. For example, in Subsection 4.4, a flavor net-
& Hornik, 2018), “nltk” in Python programming (Loper & work was constructed based on online recipes and analyzed
Bird, 2002), and many other instruments like Weka (Hall for providing ideas of creative cooking (Ahn, Ahnert, Bagrow,
et al., 2009), Stanford NLP (Manning et al., 2014), and MeTA & Barabási, 2011; Simas, Ficek, Diaz-Guilera, Obrador, &
(Massung, Geigle, & Zhai, 2016). Rodriguez, 2017). In Subsection 4.5, food disease and food
nutrient networks were build based on literature text mining
(Jensen et al., 2014; Karaa et al., 2016; Kim, Sung, Foo, Jin,
3.2 Word-level analysis & Kim, 2015; Yang et al., 2011).
Word-level analysis or frequency analysis is the most pop-
ular technique for handling text information (Zhai & Mas- 3.4 Advanced text analysis
sung, 2016). This type of analysis is based on the frequency
of occurrence of the tokens. In this approach, each document Word-based analysis and word association mining can fig-
(sentence) has a concept to convey, and each different con- ure out the basic meaningful units in a language and how
cept impacts the probability that tokens are used in the text they are related to each other. However, advanced text anal-
(Hofmann, 2017). Word counting is the simplest method for ysis is required to determine the meaning of a sentence or
analyzing frequency. However, frequency is often not based a larger unit of a document. Advanced text analysis covers
on simple counting, as different words need not necessarily text classification, topic modeling, sentiment analysis, infor-
provide the same amount of information. Weighted counting mation retrieval (IR), and so on, including a variety of tech-
schemes have therefore been derived, such as Term Frequency niques under each of the topic (Miller, 1995; Zhai & Massung,
(TF) and Inverse Document Frequency (IDF). TF measures 2016). In advanced text analysis tasks, machine learning (ML)
how frequently a term occurs in a document, and IDF mea- techniques have been commonly used. ML is a paradigm in
sures how important a word is in all documents (Zhai & Mas- which a computer program learns how to make predictions or
sung, 2016). A word’s significance increases proportionally to decisions from data based on two major procedures: training
the number of times it appears in a document, but this result and testing (Ropodi et al., 2016). There are two kinds of ML,
is offset by its frequency in the corpus. The main weakness depending on whether the information is labeled or not: super-
of word-level analysis is that the appearance of each token is vised learning and unsupervised learning. For example, text
assumed to depend only on the concept, and thus the word- classification is a supervised ML task, whereas topic model-
sequence information is discarded. Nevertheless, owing to its ing is an unsupervised ML task. Nevertheless, the interpreta-
simplicity, this strategy is very commonly used and has proven tion of the results requires the input from food scientists who
successful such as detecting influenza epidemics (Ginsberg are equipped with knowledge in statistics and data science.
et al., 2009). Also, critical-word sequences as in cases like The commonly used text mining techniques in the food indus-
negation can easily be adapted by incorporating the n-gram try as well as ML algorithms used in the tasks are described
technique (Zhai & Massung, 2016). below.
media (Effland et al., 2018; Harris et al., 2014; Harrison ally a keyword query) within large collections (usually a
et al., 2014; Sadilek et al., 2016). database; Manning et al., 2010). The core of an IR model is
• Text clustering: Text clustering is used for grouping objects to assess the relevancy of a keyword query and a text docu-
(e.g., documents, paragraphs, sentences or terms) based on ment. Relevancy is determined by a similarity measure such
similarity between the objects (Aggarwal & Zhai, 2012). as the cosine similarity that assumes that the queries and
There are a variety of algorithms used for text cluster- documents are represented as a vector of words (Drury &
ing depending on how to calculate the similarity. Common Roche, 2019). Effective IR models generally capture three
methods include distance-based clustering methods, such heuristics, that is, TF weighting, IDF weighting, and doc-
as hierarchical clustering algorithms, partitioning algo- ument length normalization (Aggarwal and Zhai, 2012).
rithms, and the hybrid method using both hierarchical and Scientists have constructed a number of databases includ-
partitional clustering algorithms (Aggarwal & Zhai, 2012). ing text documents from various sources and built IR mod-
In the food industry, researchers have used text cluster- els for returning information most relevant to specific food
ing to group food products based on consumer’s review on safety topics (Maeda et al., 2005; Kate et al., 2014; Chen
specific attributes (Kim, Ha, Choi, & Moon, 2018; Lee, et al., 2016). The difference of algorithms in defining rele-
Ghimire, & Rho, 2013). vance has created a number of new ranking methods, such
as ranking information based on published date or a score
• Topic modeling: Topic modeling, as a text clustering divi-
function (Chen et al., 2016), ranking with a formula adapted
sion, detects potential topics in a document and often deals
from Google ranking (Maeda et al., 2005), or ranking with
with information that does not have predefined labels (Zhai
a text classification method (Kate et al., 2014).
& Massung, 2016). A topic contains a cluster of words that
frequently occur together. The main idea of topic modeling
is to discover patterns of word use and how to connect doc- 3.5 Joint analysis of text and nontext data
uments that shared similar patterns. Latent Dirichlet Allo- In addition to text data analysis, it is beneficial to leverage
cation (LDA), Latent Semantic Analysis (LSA), and Proba- nontext data in knowledge discovery from text data. Nontext
bilistic Latent Semantic Analysis (PLSA) are common ML can be used for providing background/context information
algorithms used in topic modeling (Zhai & Massung, 2016). (e.g., time, location, and people) for predictive analysis (Zhai
Topic modeling has met its applications for discovery of & Massung, 2016). For example, the time and location infor-
hidden semantic structures in a text body. In the food indus- mation are useful “metadata” values of a text document. Given
try, topic modeling has been used to identify relevant pub- the context of time and location, it is possible to generate tem-
lic health topics such as obesity on Twitter (Ghosh & Guha, poral or spatial trends of a particular topic discovered from
2013). text data (Zhai & Massung, 2016). In food-related studies,
• Sentiment analysis: Sentiment analysis (or opinion min- temporal analysis integrates frequency data with time and has
ing) is the computational study of people’s affective states been used in public health monitoring (Ginsberg et al., 2009).
(e.g., opinions, emotions, attitudes) toward entities, issues, Spatial analysis shows the distribution of intensity across dis-
events, topics, and their attributes (Aggarwal & Zhai, tinct places and has been used to characterize nutritional pat-
2012). For example, businesses always want to find con- terns (Abbar et al., 2015). On the other hand, text data can also
sumer opinions about their products and services. Senti- help interpret patterns discovered from nontext data (Zhai &
ment classification can be usually formulated as a super- Massung, 2016).
vised learning problem with three classes, positive, neg-
ative and neutral. ML algorithms such as NB and SVM
are commonly used in sentiment analysis tasks. As opinion
4 A P P L I CAT I O N OF T E X T DATA I N
words and phrases are indicative for sentiment classifica-
FOOD-RELATED STUDIES
tion, unsupervised learning can also be used in sentiment
analysis (Aggarwal & Zhai, 2012). Apart from classifica- With food being one of the most common topics in our life,
tion of positive or negative sentiments, research has also digital text analysis has been applied to a variety of food-
been done on predicting the rating scores (e.g., 1 to 5 stars). related topics. In the following, the applications of digital
In this case, the problem is formulated as regression prob- text analysis related to food and nutrition will be discussed in
lem. In the food industry, sentiment analysis has been used seven categories of study: food safety and fraud surveillance,
to know about the performances of food products and ser- dietary pattern characterization, consumer opinion mining,
vices from customer reviews (Gan, Ferns, Yu, & Jin, 2017; new product development (NPD), food knowledge discovery,
Hayashi, Hsieh, & Setiono, 2009). food supply chain management, and online food services. The
• Information retrieval: IR is a task to find material (usu- publications discussed in this section are listed in Table S1 and
ally a text document) that satisfies information need (usu- summarized at the end of this section.
8 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
4.1 Food safety and food fraud surveillance events related to fruit and vegetables (Bouzembrak & Marvin,
2016; Bouzembrak, & Marvin, 2019; Bouzembrak, Camen-
Food safety plays a critical role in ensuring food security and zuli, Janssen, & Van der Fels-Klerx, 2018). In addition, scien-
sustainable food systems (Godfray et al., 2010; King et al., tific abstracts from MEDLINE/PubMed and FSTA databases
2017). It is also a significant public health component. In the were useful for designing an Emerging Risk Identification
United States alone, an estimated 800 foodborne outbreaks Support System (ERIS), which can identify unexpected haz-
are reported annually, accounting for about 15,000 illnesses, ards in the food chain (Lucas Luijckx et al., 2016; Van de Brug
800 hospitalizations, and 20 deaths according to the Centers et al., 2014).
for Disease Control and Prevention (CDC, 2018). Despite the The Internet is another source of data for disease surveil-
development of several advanced surveillance and monitor- lance (Waldner, 2017). A variety of information systems
ing technologies, foodborne disease outbreaks remain to be a around the globe has been developed to promote early warn-
major threat to public health and the food industry. Food fraud ing of food safety and food fraud hazards through Internet data
is another problem that raises government concerns about retrieval and text mining. For instance, a Japanese group built
food consumption under the scope of food safety (Spink & a database of documents on food safety hazard through key-
Moyer, 2011). Food supply chains and manufacturing meth- word searching from Google web pages (Maeda et al., 2005).
ods become increasingly dynamic and complicated. Nowa- By visualizing the interactions of the documents, they also
days, it is not always appropriate to manage early detection designed a Risk Path Finder system for people to recognize
and rapid response of food safety based on the results of the emergence of a risk event that is hidden in a pile of docu-
science-based risk assessment (Maeda et al., 2005). Recently, ments. Singapore’s National Environment Agency, in collab-
the food industry has looked into data science and “big data” oration with IBM Research, created a Food Safety Informa-
for insight into monitoring and responding in (near) real time tion System (FoodSIS) to proactively monitor emerging food
to contamination threats as they occur (Greis and Nogueira, safety problems in Singapore using relevant food safety con-
2017). In particular, a variety of text information sources have tents from the Internet (Kate et al., 2014). A database of food
been investigated to support risk detection and communica- safety information was developed in 2016 based on food safety
tion for food safety and fraud surveillance, including online news from media and government Web sites to help efficiently
database, the Internet, and social media (Marvin et al., 2017; assess food safety issues in China (Chen et al., 2016). The role
Nsoesie, Kluberg, & Brownstein, 2014; Ordun et al., 2013; of news media on food safety monitoring in China was fur-
Tiozzo et al., 2019; Waldner, 2017). ther highlighted by focusing on the Chinese dairy sector (Zhu,
For food risk and fraud surveillance, databases such as gov- Huang, & Manning, 2019). In Europe, Bouzembrak et al.
ernment databases and science databases, which are rich in (2018) developed a food fraud reporting system MeDISys-FF
text information, were leveraged. Thakur, Olafsson, Lee, and based on MeDISys, an infrastructure provided by the Euro-
Hurburgh (2010) implemented a text classification algorithm pean Media Monitor that collects reports published world-
DT to detect hidden trends in disease outbreaks and identify wide in the media.
relationships between food kinds and outbreak places. They More recently, scientists are interested in employing new
used text information from the CDC Outbreak Surveillance sources of digital text information to detect food safety and
Data to identify vehicles and locations that were associated food fraud incidences. Open source media outlets such as
with specific etiologies. The resulting knowledge can help Twitter and Rich Site Summary feeds were used to charac-
policymakers to inform successful food handling, prepara- terize the 2012 Salmonella event related to cantaloupes for
tion, and consumption practices interventions. Moore, Spink, predicting the number of sick, dead, and hospitalized (Ordun
and Lipp (2012) explored the scope, scale, and threat of food et al., 2013). Twitter and Yelp were employed to identify
fraud problems from what was publicly reported in academic unreported foodborne illnesses in several local public health
and media databases using keyword search. RASFF, created departments of the United States, and tested in cities such
by European Commission, has enabled rapid exchange of as Chicago, New York, and Las Vegas (Harris et al., 2014;
food fraud-related information, so that governments can react Harrison et al., 2014; Effland et al., 2018; Sadilek, et al.,
quickly to protect consumers (RASFF, 2017). This estab- 2016). Amazon reviews were analyzed with text classifica-
lished that RASFF system has assured food integrity through tion methods for detecting issues of unsafe food products, with
monitoring of both intentional and unintentional adulteration results validated by FDA food recalls (Maharana et al., 2019).
and fraud events (Esteki, Regueiro, & Simal-Gándara, 2019). Increasingly, the potential of employing social media data has
Researchers have shown strong interest in identifying pat- gained the attention of governments for supporting efforts in
terns of food fraud and predicting future fraud events by using public health surveillance. Besides detecting unreported food-
data from the RASFF system. It has been used for predict- borne diseases, social media has become a new channel for
ing food fraud types, optimizing sample size for monitoring communicating food safety risks (Kuttschreuter et al., 2014;
food safety risks, and identifying driving factors of food fraud Meyer, Hamer, Terlau, Raithel, & Pongratz, 2015; Rutsaert
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 9
et al., 2013). Social media’s popularity has made it a useful Instagram is another platform used to study health-related
channel for customers to seek information on food safety, and subjects, including nutritional habits, owing to its growing
for policymakers to communicate food safety crises to cus- popularity. Phan, Muralidhar, and Gatica-Perez (2019) used
tomers. Effective food safety management systems are critical alcohol-related posts from Instagram to study the temporal,
to deal with the growing complexity and globalization of food spatial, and contextual patterns of alcohol consumption in
supply chains. Digital data and solutions may provide new weekday nights. Sharma and De Choudhury (2015) focused
possibilities for predicting and controlling problems of food on using Instagram to study ingestion practices and nutri-
safety and food fraud to minimize economic losses (Fritsche, tion trends and identified how the wider Instagram commu-
2018). nity responds to low-calorie and high-calorie food. In spe-
cific, many studies have looked into the issue of food dessert,
defined as regions with significant proportions of families
4.2 Dietary pattern characterization
with insufficient access to good food, using information from
In recent years, chronic diet-related illnesses triggered by Instagram (Abbar et al., 2015; De Choudhury, Sharma, &
unhealthy eating or imbalanced diet patterns have received Kiciman, 2016). These studies illustrated food decisions and
growing attention worldwide. Studies were conducted to cor- dietary features in U.S. food deserts and highlighted the crit-
relate nutritional profiles of individuals with their health to ical role of social media in defining the connection between
provide evidence and recommendations for efficient govern- eating patterns and their effect on people’s health.
ment health interventions. Previous studies on dietary pat- One’s search queries related to food on the Internet
tern only gathered data from nutritional diaries or regular could also be reflective of their dietary habits. Zhao et al.
surveys, which are generally restricted in range and reach (2019) conducted a correlation analysis between Chinese
(Aiello, Schifanella, Quercia, & Del Prete, 2019). New infor- dietary preferences from Baidu search data and diabetes
mation sources, such as social media now allow users to share risk from government statistics and found a geographi-
their daily life with others, including diet and eating habits. cal distribution pattern. Besides, online food recipes have
Researchers are interested in employing these new sources of been recognized as a good source of Internet data to study
data for studying patterns of food consumption (Fried, Sur- recipe production, consumption, and innovation (Rong, Liu,
deanu, Kobourov, Hingle, & Bell, 2014). Among them, Twit- Huo, & Sun, 2019). West, White, and Horvitz (2013) and
ter, Instagram, and Internet logs such as Google history are Wagner, Singer, and Strohmaier (2014) used text infor-
common sources of text information to characterize the nutri- mation from online recipes as a proxy to derive con-
tional pattern of people. sumption and dietary patterns of individuals, indicating
Real-time or archived text information from Twitter was unbalanced food distribution and consumption that might be
used to study trends in health-related behaviors, conscious- useful for avoidance of food-related health issues such as mal-
ness, and monitoring. Ghosh and Guha (2013) identified rel- nutrition and overnutrition. By tracking users’ behavior of
evant public health topics such as obesity on Twitter using uploading recipes online, Trattner, Kusmierczyk, and Nørvåg
topic modeling techniques and discovered the correlation (2019) revealed that one’s social connections, in the form of
between obesity-related tweets and prevalence of obesity rates friendship with other users, was predictive of what type of
among U.S. adults (with BMI ≥ 30; CDC, 2012). Using extra recipe the user will upload in the future. Asano and Bier-
attribute information from tweets such as when and where the mann (2019) investigated dietary transition by analyzing a
tweets are published could further extend the use of Twitter dataset of millions of recipes from the most popular Ger-
information in health studies. For instance, a technique for man recipe Web site and found that a great number of peo-
deriving dietary information in Twitter posts was suggested by ple are shifting to plant-based diets with a 3.5% increase of
Abbar et al. (2015). They noted a range of dietary information- vegan recipes submitted. This food transition pattern was con-
derived correlations between Twitter and the incidence of firmed by interviewing users showing extreme dietary change.
obesity and diabetes in various U.S. counties. Based on tweets In addition to these web-based data, database with billions
linked to distinct dining circumstances, Vidal, Ares, Machín, of food purchase records has also been analyzed for identi-
and Jaeger (2015) evaluated the shift in eating habits of peo- fying consumers’ dietary patterns. Aiello et al. (2019) con-
ple during different times of day. And Huang, Huang, and ducted a correlation analysis between food consumption indi-
Nguyen (2019) used geotagged Twitter information to study cated from digital records of grocery purchases and preva-
the impacts of neighborhood features on dietary habits and lence of obesity-related syndromes in London area. They
the respective health results. García-León (2019) showed how found that increase of obesity rate was positively associated
Twitter hashtags reflected the food well-being of consumers with calorie intake and negatively associated with nutrient
in local food consumption. diversity.
10 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
4.3 Consumer opinion mining teristics from consumers’ online reviews (Kim et al., 2018).
McAuley and Leskovec (2013) modeled implicit taste pref-
The role of consumers in the process of NPD is becoming
erences of customers and product characteristics from online
increasingly important in the food industry (Moskowitz &
food reviews with ML algorithms. The result indicated the
Saguy, 2013). Therefore, another line of research has explored
possibility to construct a sensory-based food recommendation
consumers’ opinions, primarily in text format, toward food
system that targets the correct products to the right individu-
and dining services for business intelligence. The traditional
als. Many recommender systems have been created to provide
way to collect consumers’ responses is through questionnaires
recommendations of food tailored to the individual taste
or surveys, which is limited in scope and reach. Thus, the
of the user. Ge, Elahi, Fernaández-Tobías, Ricci, and Mas-
availability of consumer-generated Internet data is used as an
simo (2015), for instance, built a tablet-based application that
option to extract useful food quality and preferences infor-
includes user-selected tags for rated food items to predict their
mation. Recently, social media has been proven as a pow-
prospective likings to new products. With an ever-increasing
erful tool for acquiring business insights and enabling intel-
amount of food-related data being digitally recorded, it is
ligent product-development decisions. It has been employed
expected that our perception, consumption, and choices of
to assist marketing in many food companies for years, and
food will be significantly influenced (Mouritsen et al., 2017).
the most common application of social media is brand mar-
keting. By comparing the quality and results of three prod-
ucts (Pizza Hut, Dominos Pizza, and Papa Johns Pizza), He, 4.4 New product development
Zha, and Li (2013) provided competitive research on social Digital food composition and preparation data are used to
media in the U.S. pizza sector. Alamsyah and Peranginangin assist research & development (R&D) procedures in the food
(2015) analyzed brand communities of two giant companies sector for creating better formulations and speeding up prod-
(McDonald’s and Burger King) in Indonesia to understand uct development process (Chiang et al., 2017). One of the
dynamic market behaviors of the regional fast food indus- most significant examples of text mining applications in the
try. Consumer reactions to product quality are essential for culinary sector is the discovery of flavor pairing patterns from
the food industry, in addition to brand management. Using millions of recipes. Ahn et al. (2011) carried out a network
content analysis, Ariyasriwatana and Quiroga (2016) classi- study on regional recipes and built a “flavor network.” They
fied “deliciousness” phrases from Yelp restaurant reviews, discovered from the network that Western cuisines prefer to
a popular social network that helps individuals make food combine ingredients that share flavor compounds, while East
decisions. By analyzing text information from online forums, Asian cuisines, particularly India, are prone to avoid it. Sub-
Blackburn, Yilmaz, and Boyd (2018) and Masson et al. (2018) sequently, a food bridging hypothesis was suggested to extend
researched how individuals communicate about food on the the study. The theory demonstrated that although two compo-
Internet. More advanced text analysis, such as sentiment anal- nents do not share potent flavor compounds, they may become
ysis, is often needed for a deeper understanding of customer affine through a chain of pair affinities (Simas et al., 2017).
preferences toward which food to eat or which restaurant to On the other side, computational gastronomy’s emergence and
dine. For instance, Hayashi et al. (2009) used text mining growth can add to novel combinations of ingredients and new
and sentiment analysis to predict consumer preferences of product formulations (Ahnert, 2013; De Clercq et al., 2016).
fast food brands. Mostafa (2018) investigated people’s senti- Accordingly, online recipe mining has inspired development
ments toward halal through the study of 100,000 tweets from of recommender systems with algorithms designed for giving
Twitter.com. And restaurant review sentiment analysis dis- users healthier suggestions (Trattner & Elsweiler, 2019). For
covered the top three characteristics influencing restaurant example, Pinel, Varshney, and Bhattacharjya (2015) created a
customer preferences expressed as star scores as food, service, smart system to produce new recipes computationally, which
and context (Gan et al., 2017). later became known as IBM Chef Watson (Varshney et al.,
Similar to open-ended issues in sensory research, Internet 2019). Chen et al. (2019) proposed a recommender system
text information is an alternative source for evaluating the NutRec to provide suggestions on healthy recipes based on
sensory quality of food (Piqueras-Fiszman, 2015). The advan- quantities of ingredients and their interactions. Nyati, Rawat,
tage of digital text data is that they are often free, spontaneous, Gupta, Aggrawal, and Arora (2019) have shown a system for
and easy to get. In traditional sensory research, adjectives recipe recommendation based on the formation of ingredient–
are commonly used to describe the sensory characteristics of ingredient network and recipe–ingredient network. Without
food products. Lee et al. (2013) grouped tastes of 51 types the implementation of computational approaches on these
of Korean cuisine and refreshments using 87 adjectives often digital text data, regional patterns of culinary formulations
used to describe Korean food tastes. They later extended the cannot be easily discovered, nor the possibility of having
research by developing an automated text analysis method intelligent systems giving us advice on how to prepare tasty
using ML for evaluation of food taste, smell, and charac- meals. However, researchers have highlighted the importance
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 11
of the skill of the chef to ensure the palatability of the final Roman, Belorio, and Gomez (2019) also conducted text
dishes whose recipes were creatively generated by computers analysis of ingredient list and nutrition facts of gluten-free
(Mouritsen et al., 2017; Spence, Wang, & Youssef, 2017). products, revealing that commercial breads tend to use a
R&D has always played a vital role in the food industry. combination of several starchy sources instead of a single
The significance of integrating customer reaction to the phase ingredient for quality optimization. Fardet, Lakhssassi, and
of NPD has been emphasized (Moskowitz & Saguy, 2013). Briffaz (2018) categorized commercial food products based
Instead of concentrating solely on sensory evaluations, it is on the ingredient list, number of additives, texture, water
suggested that future food businesses consider incorporating activity, and shelf-life data.
more feedbacks from the outsider, the consumer, into their Scientific literature is a significant source of informa-
NPD process. We have seen a growing amount of research tion linked to food and nutrition sciences. Researchers have
now leveraging online digital data for marketing and con- worked extensively to discover bioactive compounds in food
sumer management. Christensen, Nørskov, Frederiksen, and and their relationships to varieties of diseases. The knowl-
Scholderer (2017) demonstrated the possibility of identifying edge of how food elements impact human health, however, is
new product ideas in online communities through text mining restricted to the complex network. Recently, relationships are
and ML. However, researches utilizing consumer data in the being built among food, gene, and diet-induced illnesses using
NPD process are rarely reported in the food domain. Carr et al. advanced computational methods such as literature mining
(2015) demonstrated a study employing social media data and network analysis (Jensen et al., 2014; Karaa et al., 2016;
for identifying consumers’ discussions around aroma, “coffee Yang et al., 2011). Also, Kim et al. (2015) constructed a food–
freshness.” They found that social media not only generated food network with a node representing a food, and an edge
product and category-related insights, but that those insights between two nodes representing similarities of nutritional
are reliable and in line with those derived from traditional contents between the two foods. Jensen et al. (2014) created a
research methods. The authors proposed that social media Nutrichem database that connects plant-based foods with their
data should not only be used for marketing purposes, it could small-molecule components and phenotypes of human dis-
also be included in the process of NPD. Bashir, Papamichail, ease. Anbarkhan, Stanier, and Sharp (2018) used text mining
and Malik (2017), however, pointed out that adopting external to obtain connections between obesity and herbal plants in the
ideas from social media for product development is not com- Arabic region. Rakhi, Tuwani, Mukherjee, and Bagler (2018)
mon in multinational companies. The value of using social recognized the health benefits of culinary herbs and spices
media data for future innovation has been discussed recently through literature mining. Chaix, Deléger, Bossy, and Nédel-
(Bhimani, Mention, & Barlatier, 2019; Muninger, Hammedi, lec (2019) applied text mining to a big collection of PubMed
& Mahr, 2019). Muninger et al. (2019) highlighted challenges scientific paper abstracts to identify ecological diversity and
in applying social media in innovation, such as complexity the origin of microbial presence in food.
of social media use and involvement of people from mul- Ontology, defined as formal naming of a set of concepts
tiple groups to acquire and diffuse knowledge from social within a domain, has been widely used for knowledge discov-
media. ery in the age of Semantic Web (Eftimov, Ispirova, Potočnik,
Ogrinc, & Seljak, 2019). As we move quickly toward the Inter-
net of Things (IoT) paradigm, advancing food ontology would
4.5 Food knowledge discovery provide effective communications of food, ingredients, and
Understanding of the chemical and biological effects of food health outcomes from a semantic view (Boulos, Yassine, Shir-
compositions on human health is critical to the nexus of mohammadi, Namahoot, & Brückner, 2015). A few examples
food, nutrition, and health. Multiple sources of data are used of food ontologies designed for various purposes are Food-
for investigating food composition, including information Wiki, AGROVOC, Open Food Facts, Food Product Ontol-
from nutrition literature, product labels, food composition ogy, and Foodon (Boulos et al., 2015; Dooley et al., 2018).
databases, and food regulations (Greenfield & Southgate, For instance, FoodWiki is a system designed for customers
2003). Recently, new computational methods such as network to quickly examine the free text written on packaged food
analysis and text mining have been implemented to manage products for inferring their side effects (Çelik, 2015; Ertuğrul,
knowledge of chemical components of foods of importance 2016). As the advocacy of clean-label movements, individuals
to human health. Product labels include valuable information are more cautious about packaged foods with additives with
about food compositions and health-related claims, which are potential health hazards nowadays. Applications like Food-
primarily presented in text format. do Nascimento, Fiates, Wiki could assist customers to make wiser and healthier pur-
dos Anjos, and Teixeira (2013) analyzed ingredient lists of chases. Foodon, another example of food ontology, has been
gluten-free food and food with gluten by mining ingredient created to increase global food traceability, quality control,
list of commercial products. They found that the diversity and risk management (Dooley et al., 2018). ISO-FOOD ontol-
of ingredients in gluten-free products is significantly lower. ogy was created for sharing and organizing stable isotope
12 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
data across food science (Eftimov et al., 2019). ONE ontol- 4.7 Online food services
ogy was developed for enhancing reporting and communica-
With the rise of digital technology, online food delivery ser-
tion of nutritional epidemiologic studies and data (Yang et al.,
vices are booming worldwide in recent years (Correa et al.,
2019). And an ontology-based recipe repository that describes
2019; Xu, & Huang, 2019). In particular, the development
cooking terms and activities was developed for facilitating
of food ordering/delivery mobile applications (e.g., UberEats,
the sharing and searching of recipe data (Öztürk, & Özacar,
GrubHub, Dianping, Meituan, Ele.me) has changed the way
2019).
in which customers interact with food services (Kapoor and
Vij, 2018; Xu & Huang, 2019). The data generated from food
delivery apps can be used for optimizing delivery route for
4.6 Food supply chain management reducing the time consumption of the consumers, identify-
ing individual’s ordering behavior online, and improving food
The advances in the IoT and big data technologies have
services. For example, Jia (2018) examined restaurant cus-
contributed to transform today’s manufacturing paradigm to
tomers’ ratings and reviews, and identified high-frequency
smart manufacturing (Tao, Qi, Liu, & Kusiak, 2018). With the
words, major topics, and subtopics using text mining. It has
ability of data collection from multiple stages in food supply
proven that digital text data analytics is a cost-effect approach
chains, the IoT has made it possible for creating a more trans-
for restaurants to gain quality improvement ideas from cus-
parent, sustainable, and efficient food manufacturing (Astill
tomers. In particular, topic modeling, sentiment analysis, and
et al., 2019). Blockchain, a technology that enables the shar-
network analysis are popular text analysis methods used in
ing of encrypted records or digital events among collaborating
mining customer comments (Ibrahim & Wang, 2019).
parties, has recently been applied to agriculture and food retail
supply chain for increasing food traceability (Astill et al.,
2019; Kamilaris, Fonts, & Prenafeta-Bold𝜐, ́ 2019). The com-
4.8 Summary
bined usage of the IoT and Blockchain technology would have
the potential to improve supply chain management (Rejeb, Based on the discussion in this section, a core collection of
Keogh, & Treiblmaier, 2019). For instance, a significant issue 57 publications was further summarized by their publication
in food supply chain is food waste (Pearson & Perera, 2018). time, publication type, data source used, and data analysis
It was estimated that one third of the foods for human con- methods used (as shown in Table S1). These papers were all
sumption is either lost or wasted throughout the food supply technical and research articles published from 2010 to 2019,
chain, from farmer to processing, transportation, retailing, including 45 journal papers and 12 conference papers. No
and consumption (Ishangulyyev, Kim, & Lee, 2019). The review article or short communication article was included.
IoT and Blockchain technologies have enabled more efficient The text data sources, and text analysis methods used in each
information sharing and communication through the supply paper were identified in the table.
chain, which would facilitate the detection and prevention We summarized the usage of different data sources in var-
of food safety issues, and reducing the food waste (Minnens, ious applications in Figure 4. The total number of papers
Lucas Luijckx, & Verbeke, 2019). However, the majority of using a specific source of data (e.g., social media) was divided
data leveraged in supply chain optimization are from quality by the purposes/topics of study in each paper. Among all
sensing technologies such as RFID, bar code identifier, quick sources of data included in this work, social media has been
response code scanner, infrared sensors, video camera, and so the primary source of data being analyzed for identifying food
on (Kodan, Parmar, & Pathania, 2019). The utilization of text integrity issues, consumers’ opinions, and the dietary pat-
data analytics for assisting food supply chain management tern of a population. The large volume of social media would
is rarely reported. The application of text information from provide unprecedented opportunities in future food innova-
the semantic web in assisting agricultural decision-makings tion and production (Kosior, 2019). The Internet data such as
was emphasized in a recent survey (Drury & Roche, 2019). online recipes are useful for investigating users’ food pref-
Also, the rapid development of business intelligence from erences, designing food formulation, and refining food rec-
consumer opinions might be useful for optimizing food pro- ommendation system. On the other hand, news media and
duction. For example, a case study of text mining on Twitter government Web sites can be used for constructing systems
posts discovered main concerns related to beef products, for early detection of food safety/fraud and fraud hazards.
which could be used for developing a consumer-centric Although database data have met their main application in dis-
beef supply chain that evolves with consumer needs (Singh, covering food and nutrition knowledge, they have also shown
Shukla, & Mishra, 2018). In addition, digital technologies potential in preventing food safety or fraud issues. In addition,
hold potential of increasing efficiencies within food retail digital text data analytics can also be used for acquiring new
supply chains, and transforming the food systems to become insights of optimizing food operations, and minimizing food
more sustainable (El Bilali & Allahyari, 2018). lost in supply chain.
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 13
The methods used in the fields of studies are shown in of computer science, such as chemistry, biomedicine, phar-
Figure 5. The total number of papers using a specific data macy, and agriculture (Chiang et al., 2017; Drury & Roche,
analysis method (e.g., text classification) was also divided by 2019). With increasing number of food-related data produced
the purposes/topics of study in each paper. A variety of text on the semantic web, it is natural to think that the lever-
analysis methods have been applied to food-related studies. age of these data can benefit the food industry widely. This
Word-level analysis is the most popular method followed review includes, in specific, the use of text information in
by joint analysis of text and nontext data, with the largest food-related research subjects. We have identified a number of
proportion focusing on dietary pattern characterization. Word applications of big data in the food industry. For example, food
association analysis (e.g., network analysis) has been mostly companies are innovating their NPD process (e.g., new fla-
applied to discover knowledge from massive documents. vor combinations) through analysis of multisourced data from
We have also seen the adoption of advanced text analysis ingredient lists, sensory results, and consumer preferences.
methods such as text classification, IR, topic modelling, Public health departments are developing new approaches for
sentiment analysis, and text clustering. They are mainly used identifying people’s consumption patterns for improving their
for detecting and prevention of food safety risks. Sentiment surveillance of chronic diseases and foodborne outbreaks.
analysis has met its applications in helping understanding Consumers are making food purchase or dining decisions
consumer preferences toward food product or food brand. based on recommendation systems that can provide sugges-
tions based on their eating habits. We can foresee that text data
5 OPPORTUN I T I E S analysis will meet its wider applications in the near future.
For example, we have seen the fusion of different sources
Big data has found extensive use in solving complicated of data helping to identify food safety and fraud hazards and
real-world issues with strong applications in fields outside characterize the consumption patterns of people in connection
14 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
with health such as obesity rate. As food has always been of 6 CHALLENGES
great importance to public health, the potential of text data
analytics in public health domain such as detection of food- Unlike structured information gathered from instrumental
borne illnesses and discovery of healthy and unhealthy dietary detectors, text information collected from web pages, blogs,
patterns have been shown in a variety of studies (Abbar et al., and social media are unstructured and hard to analyze.
2015; Harris et al., 2014; Harrison et al., 2014; Huang et al., Although social media is the most promising source of data
2019; Sadilek et al., 2016; Vidal et al., 2015). As a result, pub- with extensive applications, the ambiguities in social media
lic health departments, in the future, may be able to identify data have made it difficult to parse and interpret mean-
safety and health hazards earlier to improve the performance ing, even with state-of-the-art techniques. Hence, researchers
of food administration (Duggirala et al., 2015). have posed conservative attitudes regarding the application
Business intelligence is another area for utilization of dig- of assessment of social media text, such as sensory research,
ital text data. Web-based text information is used to help are partially due to this restriction (Piqueras-Fiszman, 2015).
companies make more informed choices on brand manage- Furthermore, due to missing, mislabeled, inaccurate, or
ment, customer preference, and NPD. Furthermore, target potentially spurious information representation, the nature of
marketing and recommendation systems would allow distrib- indigenous unstructured information is often in dirty format
utors to provide the individuals with the right products based (Wang & Jones, 2017). One of the biggest issues affecting
on their prior buying habits or preferential profiles. Digi- data credibility is the resulting low data quality. The lack of
tal text analytics can be viewed as a driver for the com- large volumes of labeled data also restricts the application of
petitive benefits of food distributors by gathering, storing, advanced ML and deep learning methods. Data representativ-
and analyzing enormous amounts of customer information ity is another issue associated with web mining. For instance,
(Galletti & Papadimitriou, 2013). However, maintaining the researchers have recently questioned the usage of social media
balance between business intelligence and health outcome data for inferring health-related outcomes due to the issues
is a new challenge in the food industry that should not be of sampling bias (Cesare, Grant, & Nsoesie, 2019; Mooney
ignored (Montgomery, Chester, Nixon, Levy, & Dorfman, & Garber, 2019), and effects on validating complex models
2019). In the meantime, these studies may also benefit cus- compared with expert reports (Sandhu, Giabbanelli, & Mago,
tomers. For example, a variety of information systems and 2019). Likewise, the integrity of database elements is their
smartphone applications have been created to help customers accuracy. Data integrity is also a critical issue for database
buy, prepare and manage their diets smarter and more per- data. For example, authorized users might make mistakes col-
sonalized (Ahn et al., 2011; De Clercq et al., 2016; Ertuğrul, lecting data, computing results, and entering values. Hence,
2016). database management systems sometimes have to take action
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 15
to catch and correct errors after they are inserted. On the 7 CONC LU SI ON S
other hand, maintenance of a large number of transactions in a
large database is a rather expensive task (Doorn, 2001). As in This paper summarizes the use of text data in seven areas of
the example of the blockchain environment, it is necessary to research linked to food, nutrition, and health. It is evident
make sure that after data get recorded, they will not be altered that social media has received considerable attention from
(Yli-Huumo, Ko, Choi, Park, & Smolander, 2019). academia and industry and has been applied widely to almost
The privacy issue remains to be a significant concern in every aspect of the research mentioned above. The use of
cultures when it comes to digitalized information, particu- social media information has allowed us to “see” hidden pat-
larly social media data (Sapienza & Palmirani, 2018). Lim- terns via a “big data scope.” Nevertheless, broad application
ited data access and anonymization of sensitive information usage of social media is restricted by inferior data quality and
are two primary methods of protecting privacy (Wu, Zhu, privacy issues. Other sources of text data, though small, have
Wu, & Ding, 2014). In addition, only population statistics are been used to create useful information systems for helping
revealed without accessing individual identification informa- consumers making decisions on purchasing, cooking, and eat-
tion. There are often restrictions on accessible information ing. Text has been the vital media in human communication in
that could be extracted due to API constraints. For instance, any natural language. Our communication is made more effi-
less than 1% of Twitter’s data, the most common research- cient either by discovering hidden knowledge from text or by
based social media platform, could be gathered using its API developing text-based information systems. Although several
streaming technique (Vidal et al., 2015). One issue raised by pitfalls remain to be the area for further investigation, we have
this is that the result may be biased if the sample size is not seen that advanced text mining techniques could help address
sufficient. Although paid services such as the Twitter Enter- critical issues in food and nutrition sciences.
prise API can provide scientists access to all historical data, it
is costly to use. Besides social media, the use of other infor- ACK NOW L E D G M E N T
mation sources for public health surveillance such as digital
records (e.g., EMRs and Fitbit) is also restricted due to privacy This study was partially supported by a USDA Specialty Crop
issues. As Paul and Dredze (2017) pointed out, consumers Block Grant award through Illinois Department of Agriculture
may not want their information to be used without asking. (698 IDOA SC-19-06) and the Illinois Agricultural Experi-
When user information is collected, for example, on social ment Station.
media, while the user is unaware, privacy issue can occur.
In addition to consumer data, there can be privacy issue on AU T H O R CO N T R I B U T I O N S
industry sources of data. Kamilaris et al. (2019) discussed
the problem in blockchain environment, pointing out that the Dandan Tao searched the literature and drafted the
technology may increase transparency on one hand, while cre- manuscript. Dr. Pengkun Yang edited Section 3 and provided
ating privacy issues on the other hand. In food supply sys- suggestions on discussion of technical parts. Dr. Hao Feng
tems, privacy is particularly important for keeping one’s com- reviewed and edited the manuscript.
petitive advantages in the market. Privacy issues would be
an ongoing subject of discussion; however, the advantages of O RC I D
using these information sources with the objective of informa-
Hao Feng https://orcid.org/0000-0002-1703-2194
tion sharing should not be abandoned entirely. More efforts
should be made on improving user privacy while allowing
data sharing and utilization. REFERENCES
Furthermore, the lack of computational skills for big data
Abbar, S., Mejova, Y., & Weber, I. (2015). You tweet what you eat:
analysis is one of the biggest barriers to the full realization of
Studying food consumption through twitter. Proceedings of the 33rd
its potential in other fields (Chiang et al., 2017). Piqueras- Annual ACM Conference on Human Factors in Computing Systems
Fiszman (2015) presented a similar concern when consid- (pp. 3197–3206), Seoul, Korea.
ering its applications in rapid sensory evaluations. Today, Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Berlin:
a lot of user-friendly software are being created to assist Springer Science & Business Media.
individuals to collect and analyze information more readily Ahn, Y. Y., Ahnert, S. E., Bagrow, J. P., & Barabási, A. L. (2011). Flavor
without a computer science background. It is expected that network and the principles of food pairing. Scientific Reports, 1, 196.
https://doi.org/10.1038/srep00196.
the fast advances in the area of data science would reduce
Ahnert, S. E. (2013). Network analysis and data mining in food sci-
the cost of skill learning and implementation. However, ence: The emergence of computational gastronomy. Flavour, 2(1),
for anyone interested in the technology, fundamental under- 4. https://doi.org/10.1186/2044-7248-2-4.
standing of digital data analysis along with popular tools is Aiello, L. M., Schifanella, R., Quercia, D., & Del Prete, L. (2019).
crucial. Large-scale and high-resolution analysis of food purchases and
16 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
health outcomes. EPJ Data Science, 8(1), 14. https://doi.org/10. Çelik, D. (2015). FoodWiki: Ontology-driven mobile safe food con-
1140/epjds/s13688-019-0191-y. sumption system. Scientific World Journal, 2015. https://doi.org/10.
Al-Gaadi, K. A., Hassaballa, A. A., Tola, E., Kayad, A. G., Madugundu, 1155/2015/475410.
R., Alblewi, B., & Assiri, F. (2016). Prediction of potato crop Cesare, N., Grant, C., & Nsoesie, E. O. (2019). Understanding demo-
yield using precision agriculture techniques. PloS One, 11(9), graphic bias and representation in social media health data. Proceed-
e0162219. ings of the Companion Publication of the 10th ACM Conference on
Alamsyah, A., & Peranginangin, Y. (2015). Network market analysis Web Science (pp. 7–9). New York, NY: ACM.
using large scale social network conversation of indonesia’s fast food Chaix, E., Deléger, L., Bossy, R., & Nédellec, C. (2019). Text mining
industry. Proceedings of the 2015 3rd International Conference on tools for extracting information about microbial biodiversity in food.
Information and Communication Technology (ICoICT) (pp. 327– Food Microbiology, 81, 63–75.
331). Piscataway, NJ: IEEE. Chen, M., Jia, X., Gorbonos, E., Hong, C. T., Yu, X., & Liu, Y. (2019).
Anbarkhan, S., Stanier, C., & Sharp, B. (2018). Text mining approach Eating healthier recipe recommendation. Information Processing &
to extract associations between obesity and arabic herbal plants. Management, 10251.
Proceedings of the International Conference on Advanced Machine Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview
Learning Technologies and Applications (pp. 211–220). Cham: from a database perspective. IEEE Transactions on Knowledge and
Springer. data Engineering, 8(6), 866–883.
Ariyasriwatana, W., & Quiroga, L. M. (2016). A thousand ways to say Chen, S., Huang, D., Nong, W., & Kwan, H. S. (2016). Development of
’delicious!’ categorizing expressions of deliciousness from restaurant a food safety information database for Greater China. Food Control,
reviews on the social network site yelp. Appetite, 104, 18–32. 65, 54–62.
Asano, Y. M., & Biermann, G. (2019). Rising adoption and retention Chiang, L., Lu, B., & Castillo, I. (2017). Big data analytics in chemi-
of meat-free diets in online recipe data. Nature Sustainability, 2(7), cal engineering. Annual Review of Chemical and Biomolecular Engi-
621–627. neering, 8, 63–85.
Astill, J., Dara, R. A., Campbell, M., Farber, J. M., Fraser, E. D., Sharif, Christensen, K., Nørskov, S., Frederiksen, L., & Scholderer, J. (2017).
S., & Yada, R. Y. (2019). Transparency in food supply chains: A In search of new product ideas: Identifying ideas in online communi-
review of enabling technology solutions. Trends in Food Science & ties by machine learning and text mining. Creativity and Innovation
Technology, 91, 240–247. Management, 26(1), 17–30.
Bashir, N., Papamichail, K. N., & Malik, K. (2017). Use of social media Correa, J. C., Garzón, W., Brooker, P., Sakarkar, G., Carranza, S. A.,
applications for supporting new product development processes in Yunado, L., & Rincón, A. (2019). Evaluation of collaborative con-
multinational corporations. Technological Forecasting and Social sumption of food delivery services through web mining techniques.
Change, 120, 176–183. Journal of Retailing and Consumer Services, 46, 45–50.
Bhimani, H., Mention, A. L., & Barlatier, P. J. (2019). Social media and De Choudhury, M., Sharma, S., & Kiciman, E. (2016). Characterizing
innovation: A systematic literature review and future research direc- dietary choices, nutrition, and language in food deserts via social
tions. Technological Forecasting and Social Change, 144, 251–269. media. Proceedings of the 19th ACM Conference on Computer-
Blackburn, K. G., Yilmaz, G., & Boyd, R. L. (2018). Food for thought: supported Cooperative Work & Social Computing (pp. 1157–1170).
Exploring how people think and talk about food online. Appetite, 123, New York, NY: ACM.
390–401. De Clercq, M., Stock, M., De Baets, B., & Waegeman, W. (2016). Data-
Boulos, M. N. K., Yassine, A., Shirmohammadi, S., Namahoot, C. S., & driven recipe completion using machine learning methods. Trends in
Brückner, M. (2015). Towards an “Internet of Food”: Food ontologies Food Science & Technology, 49, 1–13.
for the Internet of Things. Future Internet, 7(4), 372–392. do Nascimento, A. B., Fiates, G. M. R., dos Anjos, A., & Teixeira, E.
Bouzembrak, Y., & Marvin, H. J. (2016). Prediction of food fraud type (2013). Analysis of ingredient lists of commercially available gluten-
using data from Rapid Alert System for Food and Feed (RASFF) and free and gluten-containing food products using the text mining tech-
Bayesian network modelling. Food Control, 61, 180–187. nique. International Journal of Food Sciences and Nutrition, 64(2),
Bouzembrak, Y., & Marvin, H. J. (2019). Impact of drivers of change, 217–222.
including climatic factors, on the occurrence of chemical food safety Dooley, D. M., Griffiths, E. J., Gosal, G. S., Buttigieg, P. L., Hoehndorf,
hazards in fruits and vegetables: A Bayesian Network approach. Food R., Lange, M. C., … Hsiao, W. W. (2018). FoodOn: A harmonized
Control, 97, 67–76. food ontology to increase global food traceability, quality control and
Bouzembrak, Y., Camenzuli, L., Janssen, E., & Van der Fels-Klerx, H. data integration. NPJ Science of Food, 2(1), 23.
J. (2018). Application of Bayesian Networks in the development of Doorn, J. H. (Ed.). (2001). Database integrity: Challenges and solutions.
herbs and spices sampling monitoring system. Food Control, 83, 38– Hershey, PA: IGI Global.
44. Drury, B., & Roche, M. (2019). A survey of the applications of text min-
Carr, J., Decreton, L., Qin, W., Rojas, B., Rossochacki, T., & wen Yang, ing for agriculture. Computers and Electronics in Agriculture, 163,
Y. (2015). Social media in product development. Food Quality and 104864.
Preference, 40, 354–364. Du, C. J., & Sun, D. W. (2006). Learning techniques used in computer
Centers for Disease Control and Prevention (CDC). (2012). Overweight vision for food quality evaluation: A review. Journal of Food Engi-
& obesity. Retrieved from http://www.cdc.gov/obesity neering, 72(1), 39–55.
Centers for Disease Control and Prevention (CDC). (2018). Annual Duggirala, H. J., Tonning, J. M., Smith, E., Bright, R. A., Baker, J. D.,
summaries of foodborne outbreaks. Atlanta, GA: US Depart- Ball, R., … Boyer, M. (2015). Use of data mining at the Food and
ment of Health and Human Services, CDC. Retrieved from Drug Administration. Journal of the American Medical Informatics
https://www.cdc.gov/fdoss/annual-reports/index.html Association, 23(2), 428–434.
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 17
Effland, T., Lawson, A., Balter, S., Devinney, K., Reddy, V., Waechter, Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Wit-
H., … Hsu, D. (2018). Discovering foodborne illness in online ten, I. H. (2009). The WEKA data mining software: An update. ACM
restaurant reviews. Journal of the American Medical Informatics SIGKDD Explorations Newsletter, 11(1), 10–18.
Association, 25(12), 1586–1592. Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and tech-
Eftimov, T., Ispirova, G., Potočnik, D., Ogrinc, N., & Seljak, B. K. niques. Amsterdam: Elsevier.
(2019). ISO-FOOD ontology: A formal representation of the knowl- Harris, J. K., Mansour, R., Choucair, B., Olson, J., Nissen, C., & Bhatt,
edge within the domain of isotopes for food science. Food Chemistry, J. (2014). Health department use of social media to identify food-
277, 382–390. borne illness-Chicago, Illinois, 2013–2014. Morbidity and Mortality
El Bilali, H., & Allahyari, M. S. (2018). Transition toward sustainability Weekly Report, 63(32), 681–685.
in agriculture and food systems: Role of information and commu- Harrison, C., Jorder, M., Stern, H., Stavinsky, F., Reddy, V., Hanson,
nication technologies. Information Processing in Agriculture, 5(4), H., … Balter, S. (2014). Using online reviews by restaurant patrons
456–464. to identify unreported cases of foodborne illness-new york city,
Ertuğrul, D. Ç. (2016). Foodwiki: A mobile app examines side effects of 2012–2013. Morbidity and Mortality Weekly Report, 63(20), 441–
food additives via semantic web. Journal of Medical Systems, 40(2), 445.
41. Hayashi, Y., Hsieh, M.-H., & Setiono, R. (2009). Predicting consumer
Esteki, M., Regueiro, J., & Simal-Gándara, J. (2019). Tackling fraudsters preference for fast-food franchises: A data mining approach. Journal
with global strategies to expose fraud in the food chain. Comprehen- of the Operational Research Society, 60(9), 1221–1229.
sive Reviews in Food Science and Food Safety, 18(2), 425–440. He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and
Fardet, A., Lakhssassi, S., & Briffaz, A. (2018). Beyond nutrient-based text mining: A case study in the pizza industry. International Journal
food indices: A data mining approach to search for a quantitative of Information Management, 33(3), 464–472.
holistic index reflecting the degree of food processing and including Hofmann, T. (2017). Probabilistic latent semantic indexing. ACM SIGIR
physicochemical properties. Food & Function, 9(1), 561–572. Forum, 51(2), 211–218.
Fried, D., Surdeanu, M., Kobourov, S., Hingle, M., & Bell, D. (2014). Hornik, K., & Hornik, M. K. (2018). Package ‘NLP’.
Analyzing the language of food on social media. Proceedings of the Huang, Y., Huang, D., & Nguyen, Q. C. (2019). Census tract food tweets
2014 IEEE International Conference on Big Data (Big Data) (pp. and chronic disease outcomes in the US, 2015–2018. International
778–783), Washington DC. Journal of Environmental Research and Public Health, 16(6), 975.
Fritsche, J. (2018). Recent developments and digital perspectives in food Ibrahim, N. F., & Wang, X. (2019). A text analytics approach for online
safety and authenticity. Journal of Agricultural and Food Chemistry, retailing service improvement: Evidence from Twitter. Decision Sup-
66(29), 7562–7567. port Systems, 121, 37–50.
Galletti, A., & Papadimitriou, D. C. (2013). How big data analytics are Ishangulyyev, R., Kim, S., & Lee, S. H. (2019). Understanding food loss
perceived as a driver for competitive advantage: A qualitative study and waste—Why are we losing and wasting food? Foods, 8(8), 297.
on food retailers, pp. 1–59 (Master’s thesis, Uppsala University, Upp- Jensen, K., Panagiotou, G., & Kouskoumvekaki, I. (2014). Nutrichem:
sala, Sweden). A systems chemical biology resource to explore the medicinal value
Gan, Q., Ferns, B. H., Yu, Y., & Jin, L. (2017). A text mining and multi- of plant-based foods. Nucleic Acids Research, 43(D1), D940–D945.
dimensional sentiment analysis of online restaurant reviews. Journal Jia, S. (2018). Behind the ratings: Text mining of restaurant customers’
of Quality Assurance in Hospitality & Tourism, 18(4), 465–492 online reviews. International Journal of Market Research, 60(6),
García-León, R. A. (2019). Twitter and Food Well-being: Analysis of 561–572.
#Slowfood Postings Reflecting the Food Well-being of Consumers. Jiménez-Carvelo, A. M., González-Casado, A., Bagur-González, M. G.,
Global Media Journal México, 16(30), 91–112. & Cuadros-Rodríguez, L. (2019). Alternative data mining/machine
Ge, M., Elahi, M., Fernaández-Tobías, I., Ricci, F., & Massimo, D. learning methods for the analytical evaluation of food quality and
(2015). Using tags and latent factors in a food recommender system. authenticity—A review. Food Research International, 122, 25–39.
Proceedings of the 5th International Conference on Digital Health Jurafsky, D., & Martin, J. H. (2008). Speech and language processing:
(pp. 105–112). New York, NY: ACM. An introduction to natural language processing, computational lin-
Ghosh, D., & Guha, R. (2013). What are we tweeting about obesity? guistics, and speech recognition. Upper Saddle River, NJ: Prentice
Mapping tweets with topic modeling and Geographic Information Hall.
System. Cartography and Geographic Information Science, 40(2), Kamilaris, A., Fonts, A., & Prenafeta-Bold𝜐, ́ F. X. (2019). The rise of
90–102. blockchain technology in agriculture and food supply chains. Trends
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. in Food Science & Technology, 91, 640–652.
S., & Brilliant, L. (2009). Detecting influenza epidemics using search Kapoor, A. P., & Vij, M. (2018). Technology at the dinner table: Ordering
engine query data. Nature, 457(7232), 1012–1014. food online through mobile apps. Journal of Retailing and Consumer
Godfray, H. C. J., Beddington, J. R., Crute, I. R., Haddad, L., Lawrence, Services, 43, 342–351.
D., Muir, J. F., … Toulmin, C. (2010). Food security: The challenge Karaa, W. B. A., Mannai, M., Dey, N., Ashour, A. S., & Olariu,
of feeding 9 billion people. Science, 327(5967), 812–818. I. (2016). Gene-disease-food relation extraction from biomedical
Greenfield, H., & Southgate, D. A. (2003). Food composition data: Pro- database. Proceedings of the International Workshop Soft Comput-
duction, management, and use. Rome: FAO. ing Applications (pp. 394–407). Berlin: Springer.
Greis, N. P., & Nogueira, M. L. (2017). A data-driven approach to food Kate, K., Chaudhari, S., Prapanca, A., & Kalagnanam, J. (2014). Food-
safety surveillance and response. In S. Kennedy (Ed.), Food protec- SIS: A text mining system to improve the state of food safety in
tion and security (pp. 75–99). Amsterdam: Elsevier. Singapore. Proceedings of the 20th ACM SIGKDD International
18 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
Conference on Knowledge Discovery and Data Mining (pp. 1709– Meyer, C. H., Hamer, M., Terlau, W., Raithel, J., & Pongratz, P. (2015).
1718). New York, NY: ACM. Web data mining and social media analysis for better communication
Kim, A. Y., Ha, J. G., Choi, H., & Moon, H. (2018). Automated text in food safety crises. International Journal on Food System Dynam-
analysis based on skip-gram model for food evaluation in predicting ics, 6(3), 129–138.
consumer acceptance. Computational Intelligence and Neuroscience, Miller, G. A. (1995). WordNet: A lexical database for English. Commu-
2018. https://doi.org/10.1155/2018/9293437. nications of the ACM, 38(11), 39–41.
Kim, S., Sung, J., Foo, M., Jin, Y.-S., & Kim, P.-J. (2015). Uncovering Minnens, F., Lucas Luijckx, N., & Verbeke, W. (2019). Food
the nutritional landscape of food. PloS One, 10(3), e0118697. supply chain stakeholders’ perspectives on sharing information
King, T., Cole, M., Farber, J. M., Eisenbrand, G., Zabaras, D., Fox, E. to detect and prevent food integrity issues. Foods, 8(6), 225.
M., & Hill, J. P. (2017). Food safety for food security: Relationship https://doi.org/10.3390/foods8060225.
between global megatrends and developments in food safety. Trends Montgomery, K., Chester, J., Nixon, L., Levy, L., & Dorfman, L. (2019).
in Food Science & Technology, 68, 160–175. Big data and the transformation of food and beverage marketing:
Kodan, R., Parmar, P., & Pathania, S. (2019). Internet of things for food Undermining efforts to reduce obesity? Critical Public Health, 29(1),
sector: Status quo and projected potential. Food Reviews Interna- 110–117.
tional, 1–17. https://doi.org/10.1080/87559129.2019.1657442. Mooney, S. J., & Garber, M. D. (2019). Sampling and sampling frames
Kosior, K. (2019). Social media analytics in food innovation and pro- in big data epidemiology. Current Epidemiology Reports, 6(1),
duction: A review. Proceedings in Food System Dynamics, 205–219. 14–22.
https://doi.org/https://doi.org/10.18461/pfsd.2019.1921. Moore, J. C., Spink, J., & Lipp, M. (2012). Development and applica-
Kuttschreuter, M., Rutsaert, P., Hilverda, F., Regan, Á., Barnett, J., & tion of a database of food ingredient fraud and economically moti-
Verbeke, W. (2014). Seeking information about food-related risks: vated adulteration from 1980 to 2010. Journal of Food Science, 77(4),
The contribution of social media. Food Quality and Preference, 37, R118–R126.
10–18. Moskowitz, H. R., & Saguy, I. S. (2013). Reinventing the role
Lee, J., Ghimire, D., & Rho, J. O. (2013). Rough clustering of Korean of consumer research in today’s open innovation ecosystem.
foods based on adjectives for taste evaluation. Proceedings of the Critical Reviews in Food science and Nutrition, 53(7), 682–
2013 10th International Conference on Fuzzy Systems and Knowl- 693.
edge Discovery (FSKD) (pp. 472–475). Piscataway, NJ: IEEE. Mostafa, M. M. (2018). Mining and mapping halal food consumers: A
Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. arXiv geo-located Twitter opinion polarity analysis. Journal of Food Prod-
preprint cs/0205028. ucts Marketing, 24(7), 858–879.
Lucas Luijckx, N. B., van de Brug, F. J., Leeman, W. R., van der Mouritsen, O. G., Edwards-Stuart, R., Ahn, Y.-Y., & Ahnert, S. E.
Vossen, J. M., & Cnossen, H. J. (2016). Testing a text mining tool for (2017). Data-driven methods for the study of food perception,
emerging risk identification. EFSA Supporting Publications, 13(12), preparation, consumption, and culture. Frontiers in ICT, 4, 15.
1154E. https://doi.org/10.3389/fict.2017.00015.
Maeda, Y., Kurita, N., & Ikeda, S. (2005). An early warning support Muninger, M. I., Hammedi, W., & Mahr, D. (2019). The value of social
system for food safety risks. Proceedings of the Annual Conference of media for innovation: A capability perspective. Journal of Business
the Japanese Society for Artificial Intelligence (pp. 446–457). Berlin: Research, 95, 116–127.
Springer. Nsoesie, E. O., Kluberg, S. A., & Brownstein, J. S. (2014). Online reports
Maharana, A., Cai, K., Hellerstein, J., Hswen, Y., Munsell, M., Staneva, of foodborne illness capture foods implicated in official foodborne
V., … Nsoesie, E. O. (2019). Detecting reports of unsafe foods in outbreak reports. Preventive Medicine, 67, 264–269.
consumer product reviews. JAMIA Open, 2(3), 330–338. Nyati, U., Rawat, S., Gupta, D., Aggrawal, N., & Arora, A. (2019).
Manning, C., Raghavan, P., & Schütze, H. (2010). Introduction to infor- Characterize ingredient network for recipe suggestion. Interna-
mation retrieval. Natural Language Engineering, 16(1), 100–103. tional Journal of Information Technology, 1–8. https://doi.org/10.
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & 1007/s41870-019-00277-y
McClosky, D. (2014). The Stanford CoreNLP natural language pro- Ordun, C., Blake, J. W., Rosidi, N., Grigoryan, V., Reffett, C., Aslam,
cessing toolkit. Proceedings of 52nd Annual Meeting of the Asso- S., … Klenk, J. (2013). Open source health intelligence (OSHINT)
ciation for Computational Linguistics: System Demonstrations (pp. for foodborne illness event characterization. Online Journal of Public
55–60), Baltimore, MD. Health Informatics, 5(1), e128.
Marvin, H. J., Janssen, E. M., Bouzembrak, Y., Hendriksen, P. J., & Öztürk, Ö., & Özacar, T. (2019). A case study for block-based linked
Staats, M. (2017). Big data in food safety: An overview. Critical data generation: Recipes as jigsaw puzzles. Journal of Information
Reviews in Food Science and Nutrition, 57(11), 2286–2295. Science. https://doi.org/10.1177/0165551519849518.
Masson, E., Bubendorff, S., & Fraïssé, C. (2018). Toward new forms Paul, M. J., & Dredze, M. (2017). Social monitoring for public health.
of meal sharing? Collective habits and personal diets. Appetite, 123, Synthesis Lectures on Information Concepts, Retrieval, and Services,
108–113. 9(5), 1–183.
Massung, S., Geigle, C., & Zhai, C. (2016). Meta: A unified toolkit for Pearson, D., & Perera, A. (2018). Reducing food waste: A practitioner
text retrieval and analysis. Proceedings of ACL-2016 System Demon- guide identifying requirements for an integrated social marketing
strations (pp. 91–96), Berlin. communication campaign. Social Marketing Quarterly, 24(1), 45–
McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: 57.
Understanding rating dimensions with review text. Proceedings of Phan, T. T., Muralidhar, S., & Gatica-Perez, D. (2019). Drinks &
the 7th ACM Conference on Recommender Systems (pp. 165–172). crowds: Characterizing alcohol consumption through crowdsens-
New York, NY: ACM. ing and social media. Proceedings of the ACM on Interactive,
TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION… 19
Mobile, Wearable and Ubiquitous Technologies, 3(2), 59. https://doi. Spink, J., & Moyer, D. C. (2011). Defining the public health threat of
org/10.1145/3328930. food fraud. Journal of Food Science, 76(9), R157–R163.
Pinel, F., Varshney, L. R., & Bhattacharjya, D. (2015). A culinary com- Steinberger, R., Pouliquen, B., & Van der Goot, E. (2013). An
putational creativity system. In T. Besold, M. Schorlemmer, & A. introduction to the Europe media monitor family of applications.
Smaill (Eds.), Computational creativity research: Towards creative arXiv:1309.5290.
machines (pp. 327–346). Berlin: Springer. Tao, F., Qi, Q., Liu, A., & Kusiak, A. (2018). Data-driven smart manu-
Piqueras-Fiszman, B. (2015). Open-ended questions in sensory testing facturing. Journal of Manufacturing Systems, 48, 157–169.
practice. In J. Delarue, J. B. Lawlor, & M. Rogeaux (Eds.), Rapid Thakur, M., Olafsson, S., Lee, J. S., & Hurburgh, C. R. (2010). Data min-
Sensory Profiling Techniques (pp. 247–267). Amsterdam: Elsevier. ing for recognizing patterns in foodborne disease outbreaks. Journal
Rakhi, N. K., Tuwani, R., Mukherjee, J., & Bagler, G. (2018). Data- of Food Engineering, 97(2), 213–227.
driven analysis of biomedical literature suggests broad-spectrum ben- Tiozzo, B., Pinto, A., Neresini, F., Sbalchiero, S., Parise, N., Ruzza, M.,
efits of culinary herbs and spices. PloS One, 13(5), e0198030. & Ravarotto, L. (2019). Food risk communication: Analysis of the
Rapid Alert System for Food and Feed (RASFF). (2017). Directorate media coverage of food risk on Italian online daily newspapers. Qual-
general for health and consumer protection. Brussels: European ity & Quantity, 53(6), 2843–2866.
Commission. Trattner, C., & Elsweiler, D. (2019). What online data say about eating
Rejeb, A., Keogh, J. G., & Treiblmaier, H. (2019). Leveraging the Inter- habits. Nature Sustainability, 2(7), 545–546.
net of things and blockchain technology in supply chain management. Trattner, C., Kusmierczyk, T., & Nørvåg, K. (2019). Investigating and
Future Internet, 11(7), 161. https://doi.org/10.3390/fi11070161. predicting online food recipe upload behavior. Information Process-
Roman, L., Belorio, M., & Gomez, M. (2019). Gluten-free breads: ing & Management, 56(3), 654–673.
The gap between research and commercial reality. Comprehensive U.S. Department of Agriculture (USDA). (2019). Agricultural Research
Reviews in Food Science and Food Safety, 18(3), 690–702. Service. FoodData Central. Retrieved from https://fdc.nal.usda.gov
Rong, C., Liu, Z., Huo, N., & Sun, H. (2019). Exploring Chinese dietary Van de Brug, F. J., Luijckx, N. L., Cnossen, H. J., & Houben, G. F. (2014).
habits using recipes extracted from websites. IEEE Access, 7, 24354– Early signals for emerging food safety risks: From past cases to future
24361. identification. Food Control, 39, 75–86.
Ropodi, A., Panagou, E., & Nychas, G.-J. (2016). Data mining derived Varshney, L. R., Pinel, F., Varshney, K. R., Bhattacharjya, D., Schörgen-
from food analyses using non-invasive/non-destructive analytical dorfer, A., & Chee, Y. M. (2019). A big data approach to computa-
techniques; determination of food authenticity, quality & safety in tional creativity: The curious case of Chef Watson. IBM Journal of
tandem with computer science disciplines. Trends in Food Science & Research and Development, 63(1), 7–1.
Technology, 50, 11–25. Vidal, L., Ares, G., Machín, L., & Jaeger, S. R. (2015). Using Twitter data
Rutsaert, P., Regan, Á., Pieniak, Z., McConnon, Á., Moss, A., Wall, P., & for food-related consumer research: A case study on “what people say
Verbeke, W. (2013). The use of social media in food risk and benefit when tweeting about different eating situations.” Food Quality and
communication. Trends in Food Science & Technology, 30(1), 84–91. Preference, 45, 58–69.
Sadilek, A., Kautz, H. A., DiPrete, L., Labus, B., Portman, E., Teitel, Wagner, C., Singer, P., & Strohmaier, M. (2014). The nature and evolu-
J., & Silenzio, V. (2016). Deploying nEmesis: Preventing foodborne tion of online food preferences. EPJ Data Science, 3(1), 38.
illness by data mining social media. Proceedings of the 28th IAAI Waldner, C. (2017). Big data for infectious diseases surveillance and
Conference (pp. 3982–3990), Phoenix, AZ. the potential contribution to the investigation of foodborne disease in
Sandhu, M., Giabbanelli, P. J., & Mago, V. K. (2019). From social media Canada. Winnipeg, Canada: National Collaborating Centre for Infec-
to expert reports: The impact of source selection on automatically tious Diseases.
validating complex conceptual models of obesity. Proceedings of the Wang, L., & Jones, R. (2017). Big data analytics for disparate data. Amer-
International Conference on Human-Computer Interaction (pp. 434– ican Journal of Intelligent Systems, 7(2), 39–46.
452). Cham: Springer. West, R., White, R. W., & Horvitz, E. (2013). From cookies to cooks:
Sapienza, S., & Palmirani, M. (2018). Emerging data governance issues Insights on dietary patterns via analysis of web usage logs. Proceed-
in big data applications for food safety. Proceedings of the Inter- ings of the 22nd International Conference on World Wide Web (pp.
national Conference on Electronic Government and the Information 1399–1410), Brazil.
Systems Perspective (pp. 221–230). Cham: Springer. Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big
Sharma, S. S., & De Choudhury, M. (2015). Measuring and character- data. IEEE Transactions on Knowledge and Data Engineering, 26(1),
izing nutritional information of food and ingestion content in insta- 97–107.
gram. Proceedings of the 24th International Conference on World Xu, X., & Huang, Y. (2019). Restaurant information cues, Diners’ expec-
Wide Web (pp. 115–116). New York, NY: ACM. tations, and need for cognition: Experimental studies of online-to-
Simas, T., Ficek, M., Diaz-Guilera, A., Obrador, P., & Rodriguez, offline mobile food ordering. Journal of Retailing and Consumer Ser-
P. R. (2017). Food-bridging: A new network construction to vices, 51, 231–241.
unveil the principles of cooking. Frontiers in ICT, 4, 14. Yang, C., Ambayo, H., Baets, B. D., Kolsteren, P., Thanintorn, N.,
https://doi.org/10.3389/fict.2017.00014. Hawwash, D., … Lachat, C. (2019). An ontology to standardize
Singh, A., Shukla, N., & Mishra, N. (2018). Social media data analytics research output of nutritional epidemiology: From paper-based stan-
to improve supply chain management in food industries. Transporta- dards to linked content. Nutrients, 11(6), 1300.
tion Research Part E: Logistics and Transportation Review, 114, Yang, H., Swaminathan, R., Sharma, A., Ketkar, V., & Jason, D. (2011).
398–415. Mining biomedical text toward building a quantitative food-disease-
Spence, C., Wang, Q. J., & Youssef, J. (2017). Pairing flavours and the gene network. In M. Biba, & F. Xhafa (Eds.), Learning structure and
temporal order of tasting. Flavour, 6(1), 4. schemas from documents (pp. 205–225). Berlin: Springer.
20 TEXT DATA ANALYTICS IN FOOD SCIENCE AND NUTRITION…
Yli-Huumo, J., Ko, D., Choi, S., Park, S., & Smolander, K. (2016). S U P P O RT I NG IN FO R M AT I O N
Where is current research on blockchain technology? A systematic
review. PloS One, 11(10), e0163477. Additional supporting information may be found online in the
Zhai, C., & Massung, S. (2016). Text data management and analysis: A Supporting Information section at the end of the article.
practical introduction to information retrieval and text mining. San
Rafael, CA: Morgan & Claypool.
Zhao, Z., Li, M., Li, C., Wang, T., Xu, Y., Zhan, Z., … Chen, Y. (2019).
Dietary preferences and diabetic risk in China: A large-scale nation- How to cite this article: Tao D, Yang P,
wide Internet data based study. Journal of Diabetes. https://doi.org/ Feng H. Utilization of text mining as a big
10.1111/1753-0407.12967. data analysis tool for food science and nutri-
Zhu, X., Huang, I. Y., & Manning, L. (2019). The role of media reporting tion. Compr Rev Food Sci Food Saf. 2020;1–20.
in food safety governance in China: A dairy case study. Food Control,
https://doi.org/10.1111/1541-4337.12540
96, 165–179.