Introduction
While the healthcare costs have been constantly rising, the quality of care
provided to the patients in the United States have not seen considerable
improvements.
Recently, several researchers have conducted studies which showed that
by incorporating the current healthcare technologies, they are able to
reduce mortality rates, healthcare costs, and medical complications at
various hospitals.
The recent advances in information technology have led to an increasing
ease in the ability to collect various forms of healthcare data. In this digital
world, data becomes an integral part of healthcare.
A recent report on Big Data suggests that the overall potential of
healthcare data will be around $300 billion .
Due to the rapid advancements in the data sensing and acquisition
technologies, hospitals and healthcare institutions have started collecting
vast amounts of healthcare data about their patients.
Effectively understanding and building knowledge from healthcare data
requires developing advanced analytical techniques that can effectively
transform data into meaningful and actionable information.
General computing technologies have started revolutionizing the manner
in which medical care is available to the patients. Data analytics, in
particular, forms a critical component of these computing technologies.
The analytical solutions when applied to healthcare data have an immense
potential to transform healthcare delivery from being reactive to more
proactive.
The impact of analytics in the healthcare domain is only going to grow
more in the next several years. Typically, analyzing health data will allow
us to understand the patterns that are hidden in the data.
Also, it will help the clinicians to build an individualized patient profile and
can accurately compute the likelihood of an individual patient to suffer
from a medical complication in the near future.
Healthcare data is particularly rich and it is derived from a wide variety of
sources such as sensors, images, text in the form of biomedical
literature/clinical notes, and traditional electronic records.
This heterogeneity in the data collection and representation process leads
to numerous challenges in both the processing and analysis of the
underlying data.
There is a wide diversity in the techniques that are required to analyze
these different forms of data. In addition, the heterogeneity of the data
naturally creates various data integration and data analysis challenges.
In many cases, insights can be obtained from diverse data types, which
are otherwise not possible from a single source of the data. It is only
recently that the vast potential of such integrated data analysis methods
is being realized.
From a researcher and practitioner perspective, a major challenge in
healthcare is its interdisciplinary nature.
The field of healthcare has often seen advances coming from diverse
disciplines such as databases, data mining, information retrieval, medical
researchers, and healthcare practitioners.
While this interdisciplinary nature adds to the richness of the field, it also
adds to the challenges in making significant advances. Computer
scientists are usually not trained in domain-specific medical concepts,
whereas medical practitioners and researchers also have limited exposure
to the mathematical and statistical background required in the data
analytics area.
This has added to the difficulty in creating a coherent body of work in this
field even though it is evident that much of the available data can benefit
from such advanced analysis techniques.
The result of such a diversity has often led to independent lines of work
from completely different perspectives.
Researchers in the field of data analytics are particularly susceptible to
becoming isolated from real domain-specific problems, and may often
propose problem formulations with excellent technique but with no
practical use.
Healthcare Data Sources
Electronic Health Records Electronic health records (EHRs) contain a
digitized version of a patient’s medical history.
It encompasses a full range of data relevant to a patient’s care such as
demographics, problems, medications, physician’s observations, vital
signs, medical history, laboratory data, radiology reports, progress notes,
and billing data.
Many EHRs go beyond a patient’s medical or treatment history and may
contain additional broader perspectives of a patient’s care.
An important property of EHRs is that they provide an effective and
efficient way for healthcare providers and organizations to share with one
another.
In this context, EHRs are inherently designed to be in real time and they
can instantly be accessed and edited by authorized users. This can be
very useful in practical settings.
For example, a hospital or specialist may wish to access the medical
records of the primary provider. An electronic health record streamlines
the workflow by allowing direct access to the updated records in real
time . It can generate a complete record of a patient’s clinical encounter,
and support other care-related activities such as evidence-based decision
support, quality management, and outcomes reporting. The storage and
retrieval of health-related data is more efficient using EHRs. It helps to
improve quality and convenience of patient care, increase patient
participation in the healthcare process, improve accuracy of diagnoses and
health outcomes, and improve care coordination.
Biomedical Image
Medical imaging plays an important role in modern-day healthcare due to
its immense capability in providing high-quality images of anatomical
structures in human beings.
Effectively analyzing such images can be useful for clinicians and medical
researchers since it can aid disease monitoring, treatment planning, and
prognosis . The most popular imaging modalities used to acquire a
biomedical image are magnetic resonance imaging (MRI), computed
tomography (CT), positron emission tomography (PET), and ultrasound
(U/S).
Being able to look inside of the body without hurting the patient and being
able to view the human organs has tremendous implications on human
health.
Such capabilities allow the physicians to better understand the cause of
an illness or other adverse conditions without cutting open the patient.
However, merely viewing such organs with the help of images is just the
first step of the process.
The final goal of biomedical image analysis is to be able to generate
quantitative information and make inferences from the images that can
provide far more insights into a medical condition.
Such analysis has major societal significance since it is the key to
understanding biological systems and solving health problems. However, it
includes many challenges since the images are varied, complex, and can
contain irregular shapes with noisy values.
A number of general categories of research problems that arise in
analyzing images are object detection, image segmentation, image
registration, and feature extraction.
All these challenges when resolved will enable the generation of
meaningful analytic measurements that can serve as inputs to other areas
of healthcare data analytics
Sensor Data
Sensor data is ubiquitous in the medical domain both for real time and for
retrospective analysis.
Several forms of medical data collection instruments such as
electrocardiogram (ECG), and electroencaphalogram (EEG) are essentially
sensors that collect signals from various parts of the human body .
These collected data instruments are sometimes used for retrospective
analysis, but more often for real-time analysis.
Perhaps, the most important use-case of real-time analysis is in the
context of intensive care units (ICUs) and real-time remote monitoring of
patients with specific medical conditions.
In all these cases, the volume of the data to the processed can be rather
large. For example, in an ICU, it is not uncommon for the sensor to receive
input from hundreds of data sources, and alarms need to be triggered in
real time.
Such applications necessitate the use of big-data frameworks and
specialized hardware platforms.
In remote-monitoring applications, both the real-time events and a long-
term analysis of various trends and treatment alternatives is of great
interest.
While rapid growth in sensor data offers significant promise to impact
healthcare, it also introduces a data overload challenge.
Hence, it becomes extremely important to develop novel data analytical
tools that can process such large volumes of collected data into
meaningful and interpretable knowledge.
Such analytical methods will not only allow for better observing patients’
physiological signals and help provide situational awareness to the
bedside, but also provide better insights into the inefficiencies in the
healthcare system that may be the root cause of surging costs.
The research challenges associated with the mining of sensor data in
healthcare settings and the sensor mining applications and systems in
both clinical and non-clinical settings .
Biomedical Signal
Biomedical Signal Analysis consists of measuring signals from biological
sources, the origin of which lies in various physiological processes.
Examples of such signals include the electroneurogram (ENG),
electromyogram (EMG), electrocardiogram (ECG), electroencephalogram
(EEG), electrogastrogram (EGG), phonocardiogram (PCG), and so on.
The analysis of these signals is vital in diagnosing the pathological
conditions and in deciding an appropriate care pathway.
The measurement of physiological signals gives some form of quantitative
or relative assessment of the state of the human body.
These signals are acquired from various kinds of sensors and transducers
either invasively or non-invasively
These signals can be either discrete or continuous depending on the kind
of care or severity of a particular pathological condition. The processing
and interpretation of physiological signals is challenging due to the low
signal-to-noise ratio (SNR) and the interdependency of the physiological
systems.
The signal data obtained from the corresponding medical instruments can
be copiously noisy, and may sometimes require a significant amount of
preprocessing.
Several signal processing algorithms have been developed that have
significantly enhanced the understanding of the physiological processes.
A wide variety of methods are used for filtering, noise removal, and
compact methods.
More sophisticated analysis methods including dimensionality reduction
techniques such as Principal Component Analysis (PCA), Singular Value
Decomposition (SVD), and wavelet transformation have also been widely
investigated in the literature.
Genomic Data
A significant number of diseases are genetic in nature, but the nature of
the causality between the genetic markers and the diseases has not been
fully established.
For example, diabetes is well known to be a genetic disease; however, the
full set of genetic markers that make an individual prone to diabetes are
unknown.
In some other cases, such as the blindness caused by Stargardt disease,
the relevant genes are known but all the possible mutations have not
been exhaustively isolated.
Clearly, a broader understanding of the relationships between various
genetic markers, mutations, and disease conditions has significant
potential in assisting the development of various gene therapies to cure
these conditions.
One will be mostly interested in understanding what kind of health-related
questions can be addressed through in-silico analysis of the genomic data
through typical data-driven studies.
Moreover, translating genetic discoveries into personalized medicine
practice is a highly non-trivial task with a lot of unresolved challenges. For
example, the genomic landscapes in complex diseases such as cancers
are overwhelmingly complicated, revealing a high order of heterogeneity
among different individuals.
Solving these issues will be fitting a major piece of the puzzle and it will
bring the concept of personalized medicine much more closer to reality.
Recent advancements made in the biotechnologies have led to the rapid
generation of large volumes of biological and medical information and
advanced genomic research.
This has also led to unprecedented opportunities and hopes for genome
scale study of challenging problems in life science. For example, advances
in genomic technology made it possible to study the complete genomic
landscape of healthy individuals for complex diseases .
Many of these research directions have already shown promising results
in terms of generating new insights into the biology of human disease and
to predict the personalized response of the individual to a particular
treatment.
Also, genetic data are often modeled either as sequences or as networks.
Therefore, the work in this field requires a good understanding of
sequence and network mining techniques.
Various data analytics-based solutions are being developed for tackling
key research problems in medicine such as identification of disease
biomarkers and therapeutic targets and prediction of clinical outcome.
Clinical Text Mining
Most of the information about patients is encoded in the form of clinical
notes.
These notes are typically stored in an unstructured data format and is the
backbone of much of healthcare data.
These contain the clinical information from the transcription of dictations,
direct entry by providers, or use of speech recognition applications.
These are perhaps the richest source of unexploited information.
It is needless to say that the manual encoding of this free-text form on a
broad range of clinical information is too costly and time consuming,
though it is limited to primary and secondary diagnoses, and procedures
for billing purposes.
Such notes are notoriously challenging to analyze automatically due to the
complexity involved in converting clinical text that is available in free-text
to a structured format.
It becomes hard mainly because of their unstructured nature,
heterogeneity, diverse formats, and varying context across different
patients and practitioners.
Natural language processing (NLP) and entity extraction play an
important part in inferring useful knowledge from large volumes of clinical
text to automatically encoding clinical information in a timely manner .
In general, data preprocessing methods are more important in these
contexts as compared to the actual mining techniques.
The processing of clinical text using NLP methods is more challenging
when compared to the processing of other texts due to the ungrammatical
nature of short and telegraphic phrases, dictations, shorthand lexicons
such as abbreviations and acronyms, and often misspelled clinical terms.
All these problems will have a direct impact on the various standard NLP
tasks such as shallow or full parsing, sentence segmentation, text
categorization, etc., thus making the clinical text processing highly
challenging. A wide range of NLP methods and data mining techniques for
extracting information from the clinical text .
Mining Biomedical Literature
A significant number of applications rely on evidence from the biomedical
literature.
The latter is copious and has grown significantly over time.
The use of text mining methods for the long-term preservation,
accessibility, and usability of digitally available resources is important in
biomedical applications relying on evidence from scientific literature.
Text mining methods and tools offer novel ways of applying new
knowledge discovery methods in the biomedical field .
Such tools offer efficient ways to search, extract, combine, analyze and
summarize textual data, thus supporting researchers in knowledge
discovery and generation.
One of the major challenges in biomedical text mining is the
multidisciplinary nature of the field. For example, biologists describe
chemical compounds using brand names, while chemists often use less
ambiguous IUPAC-compliant names or unambiguous descriptors such as
International Chemical Identifiers.
While the latter can be handled with cheminformatics tools, text mining
techniques are required to extract less precisely defined entities and their
relations from the literature.
In this context, entity and event extraction methods play a key role in
discovering useful knowledge from unstructured databases. Because the
cost of curating such databases is too high, text mining methods offer
new opportunities for their effective population, update, and integration.
Text mining brings about other benefits to biomedical research by linking
textual evidence to biomedical pathways, reducing the cost of expert
knowledge validation, and generating hypotheses.
The approach provides a general methodology to discover previously
unknown links and enhance the way in which biomedical knowledge is
organized.
EHR
An Electronic Health Record (EHR) is a digital version of a patient’s
medical history.
It is a longitudinal record of patient health information generated by one
or several encounters in any healthcare providing setting. The term is
often used interchangeably with EMR (Electronic Medical Record) and CPR
(Computer-based Patient Record).
It encompasses a full range of data relevant to a patient’s care such as
demographics, problems, medications, physician’s observations, vital
signs, medical history, immunizations, laboratory data, radiology reports,
personal statistics, progress notes, and billing data.
The EHR system automates the data management process of complex
clinical environments and has the potential to streamline the clinician’s
workflow.
It can generate a complete record of a patient’s clinical encounter, and
support other care-related activities such as evidence-based decision
support, quality management, and outcomes reporting.
An EHR system integrates data for different purposes. It enables the
administrator to utilize the data for billing purposes, the physician to
analyze patient diagnostics information and treatment effectiveness, the
nurse to report adverse conditions, and the researcher to discover new
knowledge.
EHR has several advantages over paper-based systems. Storage and
retrieval of data is obviously more efficient using EHRs. It helps to improve
quality and convenience of patient care, increase patient participation in
the healthcare process, improve accuracy of diagnoses and health
outcomes, and improve care coordination.
It also reduces cost by eliminating the need for paper and other storage
media. It provides the opportunity for research in different disciplines. In
2011, 54% of physicians had adopted an EHR system, and about three-
quarters of adopters reported that using an EHR system resulted in
enhanced patient care .
Usually, EHR is maintained within an institution, such as a hospital, clinic,
or physician’s office. An institution will contain the longitudinal records of a
particular patient that have been collected at their end.
The institution will not contain the records of all the care provided to the
patient at other venues. Information regarding the general population may
be kept in a nationwide or regional health information system. Depending
on the goal, service, venue, and role of the user, EHR can have different
data formats, presentations, and level of detail.
Components of EHR
The main purpose of EHR is to support clinical care and billing.
This also includes other functionalities, such as improving the quality and
convenience of patient care, improving the accuracy of diagnoses and
health outcomes, improving care coordination and patient participation,
improving cost savings, and finally, improving the general health of the
population.
Most modern EHR systems are designed to integrate data from different
components such as administrative, nursing, pharmacy, laboratory,
radiology, and physician’ entries, etc.
Electronic records may be generated from any department. Hospitals and
clinics may have a number of different ancillary system providers; in that
case, these systems are not necessarily integrated to the main EHR
system. It is possible that these systems are stand-alone, and different
standards of vocabularies have been used.
If appropriate interfaces are provided, data from these systems can be
incorporated in a consolidated fashion; otherwise a clinician has to open
and log into a series of applications to get the complete patient record.
The number of components present may also vary depending on the
service provided.
Administrative System Components
Administrative data such as patient registration, admission, discharge, and
transfer data are key components of the EHR. It also includes name,
demographics, employer history, chief compliant, patient disposition, etc., along
with the patient billing information. Social history data such as marital status,
home environment, daily routine, dietary patterns, sleep patterns, exercise
patterns, tobacco use, alcohol use, drug use and family history data such as
personal health history, hereditary diseases, father, mother and sibling(s) health
status, age, and cause of death can also be a part of it. Apart from the fields like
“comments” or “description,” these data generally contain pairs. This
information is used to identify and assess a patient, and for all other
administrative purposes. During the registration process, a patient is generally
assigned a unique identification key comprising of a numeric or alphanumeric
sequence. This key helps to link all the components across different platforms.
For example, lab test data can create an electronic record; and another record is
created from radiology results. Both records will have the same identifier key to
represent a single patient. Records of a previous encounter are also pulled up
using this key. It is often referred to as the medical record number or master
patient index (MPI). Administrative data allows the aggregation of a person’s
health information for clinical analysis and research.
Laboratory System Components & Vital Signs
Generally, laboratory systems are stand-alone systems that are interfaced to the
central EHR system. It is a structured data that can be expressed using standard
terminology and stored in the form of a name-value pair. Lab data plays an
extremely important part in the clinical care process, providing professionals the
information needed for prevention, diagnosis, treatment, and health
management. About 60% to 70% of medical decisions are based on laboratory
test results [7]. Electronic lab data has several benefits including improved
presentation and reduction of error due to manual data entry. A physician can
easily compare the results from previous tests. If the options are provided, he
can also analyze automatically whether data results fall within normal range or
not. The most common coding system used to represent the laboratory test data
is Logical Observation Identifiers Names and Codes (LOINC). Many hospitals use
their local dictionaries as well to encode variables. A 2009–2010 Vanderbilt
University Medical Center data standardization study found that for simple
concepts such as “weight” and “height,” there were more than five internal
representations. In different places there are different field names for the same
feature and the values