KEMBAR78
A Deep Learning Architecture For Temporal | PDF | Electroencephalography | Applied Mathematics
0% found this document useful (0 votes)
17 views12 pages

A Deep Learning Architecture For Temporal

Uploaded by

Komal Kalkutkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

A Deep Learning Architecture For Temporal

Uploaded by

Komal Kalkutkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

758 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO.

4, APRIL 2018

A Deep Learning Architecture for Temporal


Sleep Stage Classification Using Multivariate
and Multimodal Time Series
Stanislas Chambon , Mathieu N. Galtier, Pierrick J. Arnal, Gilles Wainrib, and Alexandre Gramfort

Abstract — Sleep stage classification constitutes an I. I NTRODUCTION


important preliminary exam in the diagnosis of sleep dis-
orders. It is traditionally performed by a sleep expert who
assigns to each 30 s of the signal of a sleep stage, based on
the visual inspection of signals such as electroencephalo-
grams (EEGs), electrooculograms (EOGs), electrocardio-
S LEEP stage identification, a.k.a. sleep scoring or sleep
stage classification, is of great interest to better under-
stand sleep and its disorders. Indeed, the construction of an
grams, and electromyograms (EMGs). We introduce here hypnogram, the sequence of sleep stages over a night, is often
the first deep learning approach for sleep stage classifi- involved, as a preliminary exam, in the diagnosis of sleep
cation that learns end-to-end without computing spectro- disorders such as insomnia or sleep apnea [1]. Traditionally,
grams or extracting handcrafted features, that exploits all
multivariate and multimodal polysomnography (PSG) sig- this exam is performed as follows. First a subject sleeps with
nals (EEG, EMG, and EOG), and that can exploit the temporal a medical device which performs a polysomnography (PSG),
context of each 30-s window of data. For each modality, i.e., it records electroencephalography (EEG) signals at dif-
the first layer learns linear spatial filters that exploit the array ferent locations over the head, electrooculography (EOG)
of sensors to increase the signal-to-noise ratio, and the last signals, electromyography (EMG) signals, and eventually
layer feeds the learnt representation to a softmax classifier.
Our model is compared to alternative automatic approaches more. Second, a human sleep expert looks at the different
based on convolutional networks or decisions trees. Results time series recorded over the night and assigns to each 30 s
obtained on 61 publicly available PSG records with up to time segment a sleep stage following a reference nomenclature
20 EEG channels demonstrate that our network architecture such as American Academy of Sleep Medicine (AASM)
yields the state-of-the-art performance. Our study reveals a rules [2] or Rechtschaffen and Kales (RK) rules [3]. Regarding
number of insights on the spatiotemporal distribution of the
signal of interest: a good tradeoff for optimal classification the AASM rules, 5 stages are identified: Wake (W), Rapid Eye
performance measured with balanced accuracy is to use Movements (REM), Non REM1 (N1), Non REM2 (N2) and
6 EEG with 2 EOG (left and right) and 3 EMG chin channels. Non REM3 (N3) also known as slow wave sleep or even deep
Also exploiting 1 min of data before and after each data sleep. They are characterized by distinct time and frequency
segment offers the strongest improvement when a limited patterns and they also differ in proportions over a night. For
number of channels are available. As sleep experts, our sys-
tem exploits the multivariate and multimodal nature of PSG instance, transitory stages such as N1 are less frequent than
signals in order to deliver the state-of-the-art classification REM or N2. In the case of AASM rules, the transitions
performance with a small computational cost. between two different stages are also documented and the
Index Terms — Sleep stage classification, multivariate transition rules may modulate the final decision of a human
time series, deep learning, spatio-temporal data, transfer scorer. Indeed, some transitions are prohibited or others are
learning, EEG, EOG, EMG. strengthened depending on the occurence of some events such
as arousal, K-complex or spindles regarding the transition
Manuscript received July 10, 2017; revised November 27, 2017;
accepted January 26, 2018. Date of publication March 7, 2018; date N1-N2 [2], [4]. Although very precious information is col-
of current version April 6, 2018. This work was supported by the French lected thanks to this exam, sleep scoring is a tedious and
Association Nationale de la Recherche et de la Technologie under Grant time consuming task which is furthermore subject to the scorer
2015/1005. (Corresponding author: Stanislas Chambon.)
S. Chambon is with the Research & Algorithms Team, Rythm Inc., subjectivity and variability [5], [6].
Paris, France, and also with the Laboratoire Traitement et Communica- The use of automatic sleep scoring methods or at least an
tion de l’Information, Télécom ParisTech, Université Paris-Saclay, Paris, automatic assistance has been investigated for several years
France (e-mail: stanislas@rythm.co).
M. N. Galtier and P. J. Arnal are with the Research & Algorithms and has driven much interest. From a statistical machine
Team, Rythm Inc., Paris, France (e-mail: mathieu@rythm.co; pierrick@ learning perspective, the problem is an imbalanced multi-class
rythm.co). prediction problem. State-of-the-art automatic approaches can
G. Wainrib is with the DATA Team, Département d’Informatique, Ecole
Normale Supérieure, 75005 Paris, France (e-mail: gilles.wainrib@ens.fr). be classified into two categories depending on whether the
A. Gramfort is with LTCI, Télécom ParisTech, Université features used for classification are extracted using expert
Paris-Saclay, Paris, France, with INRIA, Université Paris-Saclay, Paris, knowledge or if they are learnt from the raw signals. Methods
France, and also with CEA, Université Paris-Saclay, Paris, France
(e-mail: alexandre.gramfort@inria.fr). of the first category rely on a priori knowledge about the
Digital Object Identifier 10.1109/TNSRE.2018.2813138 signals and events that enables to design hand-crafted features

1534-4320 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 759

(see [7] for a very extensive list of references). Methods in learning approaches have been proposed to learn from EEG
the second category consist in learning appropriate feature data [21]–[24] and some of these contributions use a first layer
representations from transformed data [5], [8]–[10] or directly that boils down to a spatial filter [25]–[30]. Note that using
from raw data with convolutional neural networks [11]–[13]. a deep neural network to learn a feature representation and
Recently, another method was proposed to perform sleep stage classify sleep stages on data coming from multiple sensors has
classification onto radio waves signals, with an adversarial been recently investigated in parallel of our work [5], [9]. Yet
deep neural network [14]. our study further investigates and quantifies how much using
One of the main statistical learning challenges is the imbal- a spatial filtering step enhances the prediction performance.
anced nature of the classification task which has important This paper is organized as follows. First we introduce our
practical implications for this application. Typically sleep end-to-end deep learning approach to perform temporal sleep
stages such as N1 are rare compared to N2 stages. When stage classification using multivariate time series coming from
learning a predictive algorithm with very imbalanced classes, multiple modalities (EEG, EOG, EMG). We furthermore detail
what classically happens is that the resulting system tends how the temporal context of each segment of data can be
to never predict the rarest classes. One way to address this exploited by our model. Then, we benchmark our approach
issue is to reweight the model loss function so that the cost on publicly available data and compare it to state-of-the-
of making an error on a rare sample is larger [15]. With an art sleep stage classification methods. Finally, we explore
online training approach as used with neural networks, one the dependencies of our approach regarding the spatial con-
way to achieve this is to employ balanced sampling, i.e. to text, the temporal context and the amount of training data
feed the network with batches of data which contain as many at hand.
data points from each class [4], [5], [9]–[13]. This indeed Notation: We denote by X ∈ RC×T a segment of mul-
prevents the predictive models to be biased towards the most tivariate time series with its label y ∈ Y which maps to
frequent stages. Yet, such a strategy raises the question of the the set {W, N1, N2, N3, R E M}. Here,  X corresponds to a

choice of the evaluation metric used. The standard Accuracy sample lasting 30 seconds and Y = y ∈ R5+ : 5i=1 yi = 1
metric (Acc.) considers that any prediction mistake has the corresponds to the probability simplex. Precisely, each label
same cost. Imagine that N2 would represent 90 % of the data, is encoded as a vector of R5 with 4 coefficients equal to 0
predicting always N2 would lead to a 90 % accuracy, which and a single coefficient equal to 1 which indicates the sleep
is obviously bad. A natural way to better evaluate a model in stage. Here C refers to the number of channels and T to the
the presence of imbalanced classes is to use the Balanced number of time steps. Stk = {X t −k , . . . , X t , . . . , X t +k } stands
Accuracy (B. Acc.) metric. With this metric the cost of a for an ordered sequence of 2k + 1 neighboring segments of
mistake on a sample of type N2 is inversely proportional to signal. Xk = (RC×T )2k+1 is the space of 2k + 1 neighboring
the fraction of samples of type N2 in the data. By doing so, segments of signal. Finally,  stands for the categorical cross
every sleep stage has the same impact on the final figure of entropy loss function. Given a true label y  ∈ Y and a predicted
merit [16]. label p ∈ Y it is defined as: (y, p) = − 5i=1 yi log pi .
Another statistical learning challenge concerns the way
transition rules are handled. Indeed, as the transition rules may II. M ATERIAL AND M ETHODS
impact the final decision of a scorer, a predictive model might In this section, we present a deep learning architecture
take them into account in order to increase its performance. to perform temporal sleep stage classification from multi-
Doing so is possible by feeding the final classifier with the variate and multimodal time series. We first define formally
features from the neighboring time segments [4], [5], [9]–[13]. the classification problem addressed here. Then we present
This is referred to as temporal sleep stage classification. the network architecture used to predict without temporal
A number of public sleep datasets contain PSG records context (k = 0). Then we describe the time distributed
with several EEG channels, and additional modalities such multivariate network proposed to perform temporal sleep stage
as EOG or EMG channels [17]. Although these modalities classification (k > 0). Finally, we present and discuss the
are used by human experts for sleep scoring, seldom are alternative state-of-the-art methods used for comparison in our
they considered by automatic systems [16]. Focusing only on experiments.
the EEG modality, it is natural to think that the multivariate
nature of EEG data does carry precious information. This can A. Machine Learning Problem
be exploited at least to cope with electrode removal or bad
channels, and thus improve the robustness of the prediction In this paragraph, we formalize in mathematical terms the
algorithm. However, this can also be exploited as a leverage temporal classification task considered here. Let k be a non-
to improve the predictive capacities of the algorithm. Indeed, negative integer. Let f : Xk −→ Y stand for a predictive
the EEG community has designed a number of methods to model that belongs to a parametric set denoted F . Here f takes
increase the signal-to-noise ratio (SNR) of an effect of interest as input an ordered sequence of 2k+1 neighboring segments of
from a full array of sensors. Among these methods are so signal, and outputs a probability vector p ∈ Y. For simplicity
called linear spatial filters and include classical techniques the parameters of the network are not written. The machine
such as PCA/ICA [18], Common Spatial Patterns for BCI learning problem tackled then reads:
applications [19] or beamforming methods for source local- fˆ = arg min Ex,y∈Xk ×Y [( f (x), y)] . (1)
ization [20]. Less classically and more recently various deep f ∈F
760 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018

TABLE I
D ETAILED A RCHITECTURE FOR THE F EATURE E XTRACTOR FOR C EEG C HANNELS W ITH T IME S ERIES OF L ENGTH T. T HE S AME A RCHITECTURE
I S E MPLOYED FOR C  EMG C HANNELS . W HEN B OTH EEG / EOG AND EMG A RE C ONSIDERED, THE O UTPUTS OF THE D ROPOUT L AYERS
A RE C ONCATENATED AND F ED I NTO THE F INAL C LASSIFIER . T HE N UMBER OF PARAMETERS OF THE F INAL D ENSE L AYER
B ECOMES T HUS E QUAL TO 5 × ((C + C ) × (T // 256) × 8)

input channels. It implements a spatial filtering driven by the


classification task to perform [25]–[30]. In our experiments,
the number of virtual channels was set to the number of input
channels making the first layer a multiplication with a square
matrix. This square matrix plays the same role as the unmixing
matrix estimated by ICA algorithms. This step will be further
Fig. 1. Network general architecture: the network processes C
EEG/EOG channels and C EMG channels through separate pipelines.
discussed in the discussion. Note that this first layer based on
For each modalitity, it performs spatial filtering and applies convolutions, spatial filters can be implemented with a 2D valid convolution
non linear operations and max pooling (MP) over the time axis. The with kernels of shape (C, 1), see layer 3 in Tab. I.
outputs of the different pipelines are finally concatenated to feed a
softmax classifier.
Following this linear operation, the dimensions are per-
muted, see layer 4 in Tab. I. Then two blocks of temporal
Equation (1) implies that the parameters of the neural convolution followed by non-linearity and max pooling are
network f are optimized by minimizing the expected value consecutively applied. The parameters have been set for sig-
of the categorical cross entropy between the output of this nals sampled at 128 Hz. In this case the number of time steps
network f (x) and the true label y. is T = 128 × 30 = 3840. Each block first convolves its input
Whenever k > 0 the neural network has access to the signal with 8 estimated kernels of length 64 with stride 1
temporal context of the segment of signal to classify, it is the (∼ 0.5 s of record) before applying a rectified linear unit, a.k.a.
temporal sleep stage classification problem, and when k = 0 ReLU non-linearity x → max(x, 0) [31]. The outputs are then
the problem boils down to the standard formulation of sleep reduced along the time axis with a max pooling layer (size
stage classification. of 16 without overlap). The output of the two convolution
blocks is finally passed through a dropout layer [32] which
B. Multivariate Network Architecture randomly prevents updates of 25% of its output neurons at
The deep network architecture we propose to perform each gradient step.
sleep stage classification from multivariate time series without As represented in Fig. 1, we process jointly the EEG and
temporal context (k = 0) has three key features: linear spatial EOG time series since these modalities are comparable in
filtering to estimate so called virtual channels, convolutive magnitudes and both measure similar signals, namely electric
layers to capture spectral features and separate pipelines for potential up to a few hundreds of microvolts on the surface of
EEG/EOG and EMG respectively. This network constitutes a the scalp. The same idea is used by EEG practitioners when
general feature extractor we denote by Z : RC×T → R D , the EOG channels are kept in the ICA decomposition to better
where D is the size of the estimated feature space. Our reject EOG artifacts [33]. The EMG time series which have
network can handle various number of input channels and different statistical and spectral properties are processed in a
several modalities at the same time. The general architecture parallel pipeline.
is represented in Fig. 1. The resulting outputs are then concatenated to form the
We now detail the different blocks of the network, which feature space of dimension D before being fed into a final
are summarized in Tab. I. The first layer of the network is a layer with 5 neurons and a softmax non-linearity to obtain
time-independent linear operation that outputs a set of virtual a probability vector which sums to one. This final layer is
channels, each obtained by linear combination of the original referred to as a softmax classifier [34]. Let a ∈ R5 be the
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 761

An early stopping callback on the validation loss with


patience of 5 epochs was used to stop the training process
when no improvements were detected. Weights were initialized
with a normal distribution with mean μ = 0, and standard
deviation σ = 0.1. Those values were obtained empirically by
monitoring the loss during training. The implementation was
written in Keras [36] with a Tensorflow backend [37].
The training of the time distributed network was done in two
steps. First, we trained the multivariate network, especially its
feature extractor part Z t without temporal context (k = 0). The
trained model was then used to set the weights of the feature
extractor distributed in time. Second, we freezed the weights
of the feature extractor distributed in time and we trained the
final softmax classifier with aggregated features.

III. E XPERIMENTS
Fig. 2. Time distributed architecture to process a sequence of inputs
In this section, we first introduce the dataset and the pre-
Stk = Xt−k , . . . , Xt , . . . , Xt+k with k = ½. Xk stands for the multivariate
input data over ¿¼ s that is fed into the feature extractor Z. Features processing steps used. Then, we present the different features
are extracted from consecutive ¿¼ s samples:  Xt −k , . . . , Xt , . . . , X
 t +k . extractors of the literature which we use in our benchmark.
Then the obtained features are aggregated zt−k , . . . , zt , . . . , zt+k . The We then present the experiments which aim at (i) establishing
resulting aggregation of features is finally fed into a classifier to predict
the label yt associated to the sample Xt . a general benchmark of our feature extractor against state-
of-the art approaches in univariate (single derivation) and
bivariate (2 channels) contexts, (ii) studying the influence of
pre-activation of the last layer. The output of the
network is a the spatial context, (iii) evaluating the gain obtained by using
vector p ∈ Y. p is obtained as: pi = exp(ai )/ 5j =1 exp(a j ). the temporal context and (iv) evaluating the impact of the
quantity of training data.
C. Time Distributed Multivariate Network
In this paragraph, we describe the Time Distributed Multi- A. Data and Preprocessing Steps
variate Network we propose to perform temporal sleep stage Data used in our experiments is the publicly available
classification (k > 0). It builds on the Multivariate Network MASS dataset - session 3 [17]. It corresponds to 62 night
Architecture presented previously and distributes it in time to records, each one coming from a different subject. Because
take into account the temporal context. Indeed a sample of of preprocessing issues we removed the record 01-03-0034.
class N2 is very likely to be close to another N2 sample, but Each record contains data from 20 EEG channels which were
also to an N1 or an N3 sample [2]. referenced with respect to the A2 electrode. We did not modify
To take into account the statistical properties of the signals the referencing scheme, hence removed the A2 electrode from
before and after the sample of interest, we propose to aggregate our study. Each record also includes signals from 2 EOG
the different features extracted by Z on a number of time (horizontal left and right) and 3 EMG channels (chin channels)
segments preceding or following the sample of interest. More that we considered as additional modalities.
formally, let Stk = {X t −k , . . . , X t , . . . , X t +k } ∈ Xk be a The time series from all the available sensors were first
sequence of 2k + 1 neighboring samples (k samples in the low-pass filtered with a 30 Hz cutoff frequency. Then they
past and k samples in the future). Distributing in time the were downsampled to a sampling rate of 128 Hz. The down-
features extractor consists in applying Z to each sample in sampling step speeds up the computations for the neural
Stk and aggregating the 2k + 1 outputs forming a vector of networks, while keeping the information up to 64 Hz (Nyquist
size D(2k + 1). Then, the obtained vector is fed into the final frequency). Downsampling and low / band pass filtering are
softmax classifier. This is summarized in Fig. 2. commonly used preprocessing steps [5], [16]. The data extrac-
tion and the filtering steps were performed with the MNE
D. Training software [38]. The filter employed was a zero-phase finite
The minimization in (1) is done with an online procedure impulse response (FIR) filter with transition bandwidth of
based on stochastic gradient descent using mini batches of approximately 7 Hz. Sleep stages were marked according to
data. Yet, to be able to learn to discriminate under-represented the AASM rules by a single sleep expert per record [2], [17].
classes (typically W and N1 stages), and since we are When investigating the use of temporal context by feeding the
interested in optimizing the balanced accuracy, we propose predictors with sequences of consecutive samples Sk , we used
to balance the distribution of each class in minibatches of zero padding to complete the samples at the beginning and at
size 128. As we have 5 classes it means that during training, the end of the night. This enables to feed the models with
each batch has about 20% of samples of each class. The Adam all the samples of a night record while keeping fixed the
optimizer [35] is used for optimization with the following dimension of the input batches.
parameters α = 0.001 (learning rate), β1 = 0.9, β2 = 0.999 The time series fed into the different neural networks were
and  = 10−8 . additionnaly standardized. Indeed, for each channel, every
762 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018

30 s sample is standardized individually such that it has zero hyperopt Python package [41]. Concretely, we considered only
mean and unit variance. For the specific task of sleep stage the data from the training and validation subjects at hand.
classification this is particularly relevant since records are For each set of hyper-parameters, we trained and evaluated
carried out over nearly 8 hours. During such a long period the classifier on data from 5 different splits of training and
the recording conditions vary such as skin humidity, body evaluation subjects (80% for training 20% for evaluation). The
temperature, body movements or even worse electrode contact search was done with 50 sets of hyper-parameters and the set
loss. Giving to each 30 s time series the same first and second which achieved the best balanced accuracy averaged on the
order moments enables to cope with this likely covariate shift 5 splits was selected. The following parameters were tuned:
that may occur during a night record. This operation only learning rate in interval 10−4 , 10−1 , the minimum weight
rescales the frequency powers in every frequency band, with- of a child tree in set {1, 2, . . . , 10}, the maximum depth of
out altering their relative amplitudes where the discriminant trees in {1, 2, . . . , 10}, the regularization parameter in [0, 1],
information for the considered sleep stage classification task the subsampling parameter in [0.5, 1], the sampling level of
lies (see Parseval’s theorem). Note that this preprocessing step columns by tree in [0.5, 1].
can be done online before feeding the network with a batch 2) Convolutional Networks on Raw Univariate Time Series:
of data. We reimplemented and benchmarked 2 end-to-end deep learn-
Cross-validation was used to have an unbiased estimate of ing approaches. We detail each of them in the following
the performance of our model on unseen records. To reduce paragraphs and explain how we used these methods.
variance in the reported scores, the data were randomly split a) Tsinalis et al. 2016: The approach by Tsinalis et al. [11]
5 times between train, validation and testing set. The splits is a deep convolution network that processes univariate time
were performed with respect to records in order to guarantee series (a single EEG signal). It was reimplemented according
that a record used in the training set was never used in the to the paper details. The approach originally takes into account
validation or the testing set. For each split, 41 records were the temporal context, by feeding the network with 150 s of
included in the training set, 10 records in the validation set signals, i.e. the sample to classify plus the 2 previous and
and 10 records in the testing set. 2 following samples. When used without temporal context in
the experiments, the network is fed with 30 s samples.
B. Related Work and Compared Approaches Training was performed by minimizing the categorical cross
We now introduce the three state-of-the-art approaches entropy, and a similar balanced sampling strategy with Adam
that we used for comparison with our approach: a gradient optimizer was used. An additional 2 regularization set to 0.01
boosting classifier [39] trained on hand-crafted features and was applied onto the convolution filters [11]. The code was
two convolutional networks trained on raw univariate time written in Keras [36] with a Tensorflow backend [37].
series following the approach of [11] and [12]. b) Supratak et al. 2017: The approach by
1) Features Based Approach: The Gradient Boosting model Supratak et al. [12] is also an end-to-end deep convolutional
was learnt on hand-crafted features: time domain features network which contains two blocks: a feature extractor that
and frequency domain features computed for each input processes the frequency content of the signal and a recurrent
sensor as described in [16]. More precisely, we extracted neural network that processes a sequence of consecutive
from each channel the power and relative power in 5 bands: 30 s samples of signal. The feature extractor processes
δ (0.5 − 4.5 Hz), θ (4.5 − 8.5 Hz), α (8.5 − 11.5 Hz), low frequency information and high frequency information
σ (11.5 − 15.5 Hz), β (15.5 − 30 Hz), giving both 5 features. into two distinct convolutional sub-neural networks before
We furthermore extracted power ratios between these bands merging the feature representations. The resulting tensor is
(which amount for 5 × 4/2 = 10 supplementary features) and then fed into a softmax classifier. This block is trained with
spectral entropy features as well as statistics such as mean, balanced sampling. Then the feature extractor is linked to a
variance, skewness, kurtosis, 75% quantile. This gives in the recurrent neural network composed of 2 bi-LSTM layers. The
end a set of 26 features per channel. whole architecture is fed with sequences of 25 consecutive
The implementation used is from the XGBoost package [40], 30 s samples from the same record.
which internally employs decisions trees. This model is known The first block was used for comparison in our experiment.
for its high predictive performance, robustness to outliers, Its training was performed by minimizing the categorical
robustness to unbalanced classes and parallel search of the crossentropy, and a balanced sampling strategy with Adam
best split. Training was performed by minimizing also the optimizer was used. The code was written in Keras [36] with
categorical cross entropy. The training set was balanced using a Tensorflow backend [37].
under sampling. The maximum number of trees in the model
was set to 1000. An early stopping callback on the validation C. Experiment 1: Comparison of Feature
categorical cross entropy with patience equal to 10 was used to Extractors on the Fz/Cz Channels
stop the training when no improvement was observed. Training In this experiment, we perform a general benchmark of
never led to more than 1000 trees in a model. our feature extractor against hand-crafted features classified
The model has several hyper-parameters that need to be with Gradient Boosting, and the two network architectures
tuned to improve classification performances and cope with just described [11], [12]. The purpose of this experiment is to
unbalanced classes. To find the best hyper-parameters for benchmark different feature representations on a similar spatial
each experiment, we performed random searches with the context, Fz-Cz, without using the temporal context, and to
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 763

Besides, in Fig. 4, the univariate proposed method, trained


on Fz-Cz, yields equal or higher diagonal coefficients in its
confusion matrix than the other feature extractors for sleep
stages W, N1, N3. Supratak et al. 2017 outperforms the
different univariate approaches on N1 and N3.
Moreover, the multivariate proposed approach yields higher
diagonal coefficients in its confusion matrix than its univari-
ate counterpart and the other feature extractor, except for
N1 where Supratak et al. 2017 exhibits the highest classi-
fication accuracy. The analysis of the other per-class metrics
agree with these facts.
Fig. 3. General classification and run time metrics of several feature
extractors benchmarked on the Fz-Cz derivation or Fz-A2, Cz-A2 chan- D. Experiment 2: More Sensors Increase Performance
nels. The proposed approach trained on Fz-A2, Cz-A2 channels obtained
higher classification performance than the other feature extractor trained In this experiment, we investigated the influence of the mul-
on the Fz-Cz derivation, included its univariate counted-part while having tivariate spatial context on the performance of our approach.
a very low number of parameters and run time at training and prediction We considered 7 different configurations of EEG sensors
time.
which varied both in the number of recording sensors from
emphasize the benefits of processing multivariate time series 2 to 20 as well as in their positions over the head. We report
instead of a pre-computed derivation. the classification results for each configuration in Fig. 5.
Only time series coming from the channels Fz and Cz are One observes that both Gradient Boosting and our approach
considered here. First, the four predictive models were fed benefit from the increased number of EEG sensors. However,
with the time series or the features from the derivation Fz-Cz the B. Acc. obtained with our approach does not improve once
that was computed manually. Second, our approach was fed we have 6 well distributed channels. This is certainly due to
with the time series from the derivations Fz-A2 and Cz-A2, the redundancy of the EEG channels, yet more channels could
i.e., the original time series of the dataset with pre-computed make on some data the model more robust to the presence of
references. This version of our approach is referred to as bad sensors. First, this demonstrates that it is worth adding
Proposed approach - multivariate. No temporal context was more EEG sensors, but up to a certain point. Second, it shows
used for this experiment (k = 0). that our approach exploits well the multivariate nature of
Finally, the experiment was carried out using balanced signals to improve classification performances. Third, it shows
sampling at training time. For Gradient Boosting, an under that the channel agnostic features extractor, i.e. the use of the
sampling strategy was used to balance the training and the spatial projection and the features extractor is a good option
validation sets. to fully exploit the spatial distribution of the sensors.
The performance of the different algorithms is evaluated Restricting the number of EEG channels to 6 and 20,
with general classification metrics: Accuracy, Balanced Accu- we further investigated the influence of additionnal modalities
racy, Cohen Kappa, F1 score. Furthermore, run time metrics (EOG, EMG). Classification results are provided in Fig. 6.
were computed such as: the number of parameters, the total Considering additional modalities also increases the clas-
training time, the training time per pass over the train set sification performances of the considered classifiers. It gives
(called epoch), the prediction time per record (nearly 1k them a significative boost of performance, especially when the
samples). These metrics are reported in Fig. 3. Finally per EMG modality is considered. This means that both approaches
class metrics were used: F1, Precision, Sensitivity, Specificity successfully integrate the new features with the previous
along with confusion matrices (C.M.). The C.M. were obtained ones. This suggests that our feature extractor was sufficiently
by (i) normalizing the C.M. evaluated per testing subject such data agnostic and versatile to handle both modalities. Finally,
that its rows sum up to 1, (ii) computing the average C.M. it again stresses the importance of considering the spatial con-
over all testing subjects. These metrics are reported in Fig. 4. text, here the additionnal modalities, to improve classification
It can be observed in Fig. 3 that our feature extrac- performances.
tor reaches classification performance comparable to that Interestingly, the boost of performance is more important
obtained by Supratak et al. 2017 and higher than those from in the 6 channel setting rather than in the 20 channel setting.
Tsinalis et al. 2016 and Gradient Boosting on the Fz-Cz We further observe that both EEG configuration with EOG and
derivation. It also uses a very low number of parameters and EMG modalities reach the same performances. Thus, the use
a low training and prediction run time compared to the other of additional modalities compensate the use of a larger spatial
deep learning approaches. context in this situation. Practically speaking, to obtain the
Furthermore, the proposed feature extractor trained on the highest performances at a reduced computational cost, one
Fz-A2, Cz-A2 channels, i.e. that is fed with multivariate shall consider few well located EEG sensors with additional
time series, significantly outperforms its univariate counterpart modalities.
and the other feature extractors which receive univariate time
series. Processing two channels instead of a single induces E. Experiment 3: Temporal Context Boosts Performance
a limited increase in number of parameters, training and In this experiment, we investigate the influence of the
prediction run time. temporal context on the classification performances and
764 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018

Fig. 4. Per class metrics of several feature extractor trained on the Fz-Cz derivation or Fz-A2, Cz-A2 channels.

Fig. 5. Influence of channel selection on the classification performances:


Fig. 7. Influence of temporal context: considering the close temporal
increasing the number of EEG sensors increases B. Acc. context induces a boost in performance especially when the spatial
context is limited. From left to right: spatial configuration with 2 frontal
EEG channels, 6 EEG channels, 6 EEG channels plus 2 EOG and 3
EMG channels.

We furthermore evaluated the spatial configuration with


only 2 frontal EEG channels for which we report the average
confusion matrices as well as the average transition matrices
of the predicted hypnograms. We additionally included the
transition matrix of the true hypnogram according to the labels
given by the sleep expert. The matrices are presented in Fig 8.
We observe in Fig. 7 that considering the close temporal
context induces a boost in classification performances whereas
Fig. 6. Influence of additional modalities on the classification perfor- considering a too large temporal context induces a decrease
mances: adding EOG and EMG induces a boost in performance in performance. The gain strongly depends on the spatial
context taken into account. Indeed, our model trained on
demonstrate that considering the data from the neighboring 2 frontal channels with −30/ + 30 s of context achieves
samples increases classification performances especially if the similar performances than with the 6 EEG channel montage
spatial context is limited. We also report what is the impact without temporal context. On the other hand, when considering
of temporal context on confusion matrices, and also on the an extended spatial context, the gain due to the temporal
matrices of transition probabilities between sleep stages. The context turns out to be limited, as the performances of our
coefficient Pi j of the transition matrix P ∈ R5 is equal to the approach or Gradient Boosting with the 6 EEG channels + 2
probability of going from a sleep stage i to a sleep stage j . EOG and 3 EMG channels suggest.
We considered the spatial configurations with 2 frontal EEG The finer analysis operated on the confusion matrices and
channels, 6 EEG channels, and 6 EEG channels plus 2 EOG transition matrices indicates a trade-off when integrating some
and 3 EMG channels. We varied the size of the temporal input temporal context: integrating the close temporal context brings
sequence Sk from k = 0, i.e. without temporal context, up to benefits in the detection of some sleep stages specifically
k = 5. The classification results are reported in Fig. 7. (N1, N2, REM) but a too large temporal context has a
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 765

Fig. 8. Influence of temporal context on the confusion matrices (top row) and the transition matrices (bottom row). Including more temporal context
induces an increase of performance in the discrimination of stages N1, N2 and REM whereas it induces a slight decrease in the discrimination of W
and N3 when the temporal context is too wide. Including more temporal context smooths the hypnogram.

negative effect on the detection of W and N3 as emphasized


by Fig. 8.
Besides, the transition matrices of predictions compared to
the true transition matrix in Fig. 8 indicate that processing a
larger temporal context smooths the hypnogram. This corre-
sponds to an increase of the diagonal coefficient in the tran-
sition matrices. As a consequence, the transition probabilities
from stages W, N1, N2 and REM are improved but on the
other hand, the transition probabilities from N3 (especially
from N3 to N3) are negatively impacted.
Fig. 9. Influence of the number of training records: the more training
F. Experiment 4: More Training Data Boost Performance records the better performances are.

In this experiment, we investigated the influence of the


quantity of data on the classification performances of our with many training records and few channels. Said differently,
approach. To do this we considered the spatial configu- a rich spatial context can compensate for the scarcity of train-
rations with 2 frontal EEG channels, 6 EEG channels, ing data. Indeed, the input configuration with 6 EEG channels
and 6 EEG channels plus 2 EOG and 3 EMG channels. plus 2 EOG and 3 EMG channels with only 12 training
Concretely, we varied the number of training records n in subjects (right sub-figure) reaches the same performance as
{3, 12, 22, 31, 41}. We considered the same number of records the 2 EEG channels input configuration (left sub-figure) with
for validation and testing as previously, i.e. 10. We furthermore 41 training subjects.
carried out the experiments over 5 random splits of training,
validation and testing subjects. The classification results are G. Experiment 5: Opening the Model Box
reported in Fig. 9. In this experiment, we aimed at understanding what the deep
Every algorithm with any spatial context exhibits an neural network learns. More precisely, we want to understand
increase in performance when there is more training data. how the predictor relates a specific frequency content to the
Gradient Boosting is more resilient than the proposed different sleep stages. We did so by occluding almost the
approach to the little data situation especially with a large whole frequency content, except a specific frequency band
spatial context. On the other hand, our deep learning model and monitoring the classification performances of the network
exhibits stronger increase in performance as a function of the while predicting on the filtered data. Such an operation,
quantity of data. referred to as occlusion sensitivity has been successfully
Furthermore, it appears that having few training records but used to better understand how deep neural networks classify
an extended spatial context delivers as good performances as images [42].
766 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018

motivated by the fact that a linear combination of the input


channels should enhance the information useful for the task,
and so even more if the spatial filters are optimized via back
propagation on the training data. Motivated by simplicity,
we chose the number of virtual channels equal to the number
of input channels. Yet, this constitutes a degree of freedom one
may play with to increase the performances of the network as
was explored in [26].
As a comparison, Biswal et al. [9] average the input time
series to obtain a single one which is then fed into a 1D
convolutional network. This can be seen as a particular case of
our spatial filtering step where the number of virtual channels
is equal to 1 and where the unique spatial filter coefficients
are fixed to 1/C, with C the number of input channels. On the
contrary, Stephansen et al. [5] proposed an approach that also
Fig. 10. Prediction on filtered data: confusion matrices associated to
unfiltered and filtered signals from testing records. takes as input a multivariate time series but does not perform
a particular spatial processing.
We occluded almost the whole frequency domain and just
kept a specific frequency band: either δ (0.5 − 4.5 Hz), B. Feature Extractor Architecture
θ (4.5 − 8.5 Hz), α (8.5 − 11.5 Hz), σ (11.5 − 15.5 Hz) or The proposed feature extractor exhibits a simple and ver-
β (15.5 − 30 Hz). Each time, we took the neural network satile 2 layer architecture. Considering fewer or more layers
trained on the original signal, and made it predict on sig- was explored but did not deliver any extra gain in performance.
nals obtained after applying a band-pass filter with cutoff We furthermore opted to perform spatial and temporal convo-
frequencies given by the considered frequency band. This lutions strictly separately. By doing so we replaced possible
means that for any filtered sample, the frequency content 2D expensive convolutions by a 1D spatial convolution and
outside this frequency band was removed. We compared the a 1D temporal convolution. Such a low rank spatio-temporal
predictions on the filtered signals with the original labels. The convolution strategy turned out to be successful in our
confusion matrices associated to the different band-pass filters experiments.
are reported in Fig. 10. Regarding the dimensions of the convolution filters and
Using the network on filtered signals enables to reveal the pooling regions, our approach was motivated by the ability
relationship between a specific frequency content and the sleep of neural networks to learn a hierarchical representation of
stages predicted by the network. Indeed, when only the delta input data, extracting low level and small scale patterns in
band is kept, the network assigns N2 or N3 to all the samples. the shallow layers and more complex and large scale patterns
This implies that the network associates a low frequency in the deep layers. Our strategy is quite different from [11]
content to N2 and N3 stages where there are actually low and [12] which use large temporal convolution filters. Despite
frequency events such as slow oscillations or K-complex. the use of smaller filters, Fig. 3 and Fig. 10 demonstrate that
Similarly, we observe that when the network predicts on our architecture is able to discriminate stages with low fre-
signals where only the alpha band is kept, the network predicts quency content, such as N3, from stages with higher frequency
mostly W. This is in agreement with the rules human scorers content such as N2 due to the presence of spindles, or even
follow. A similar approach could be performed with much from W and N1 with the presence of α (8 − 12Hz) bursts.
finer frequency bands. Besides, our proposed architecture turns out to be data agnostic
Thus, despite the black-box nature of the proposed and handles well both EEG, EOG and EMG signals as shown
approach, this occluding procedure allows to open the box by the results of experiment 2, see Fig. 5 and Fig 6.
and to reveal interesting insights about how the model relates Yet it is to be noticed, that recent approaches use even
a particular frequency content to the different sleep stages. smaller convolution filters, of size 2, 3, 5, or 7 [5], [9], [13].
On the contrary they also use a larger number of features
IV. D ISCUSSION maps from 64 up to 512 [5], [13]. The use of small filters in
In this section, we discuss the architecture characteristics combination with a larger number of features maps is worth
of our approach and put them in perspective with state-of- investigating and quantifying and might result in more signal
the art methods. We furthermore discuss the use of temporal agnostic neural networks.
context to take into account transitions between sleep stage
and discuss its use for applications. Finally, we discuss points C. Number of Parameters
about the training of the proposed architecture and how this The complexity of the proposed network and its number
one can meet a specific need. of parameters are quite small thanks to specific architecture
choices. The overall network does not exhibit more than ∼ 104
A. Spatial Filtering parameters when considering an extended spatial context, and
The proposed architecture was designed to handle a mul- not more than ∼ 105 parameters when considering both an
tivariate input thanks to a spatial filtering step. This step is extended spatial context and an extended temporal context.
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 767

This is quite simple and compact compared to the recent extractor always receives 30 s of signals, and is therefore
approaches in [11] which has up to ∼ 14.107 parameters applied to a sequence of neighboring 30 s samples. On the
and [12] which exhibits ∼ 6.105 parameters for the feature contrary, Tsinalis et al. [11] and Sors et al. [13] extended
extractor and 2.107 parameters for the sequence learning the feature extractor input window to 150 s, respectively 120 s.
part using BiLSTM. This significant difference with [11] is In [12], a temporal context of 25 neighboring 30 s samples is
mainly due to our choice of using small convolution filters processed.
(64 time steps after low pass filtering and downsampling), Our experiment on temporal context highlights a trade-
large pooling regions (pooling over 16 time steps) according to off when integrating some temporal context: integrating some
the 128 Hz sampling frequency and removing the penultimate temporal context brings benefits in the detection of some sleep
fully connected layers before the final softmax classifier. Such stages specifically (N1, N2, REM) but a too large temporal
a strategy has already been successul in computer vision [43] context has a negative effect on the detection of W and
and EEG [30]. N3 stages as emphasized by Fig. 8. This naturally translates
to the balanced accuracy scores which exhibit a significative
D. Classification Metrics increase for small temporal context and no increase, or even
The proposed approach yields equal (univariate) or higher a decrease, for large temporal context (cf. Fig. 7). Looking at
(multivariate) classification metrics than the other bench- the transitions matrices, it appears that more temporal context
marked feature extractors while presenting a limited train- smoothes the hypnograms which might be detrimental to the
ing run time per epoch or prediction time per night record quality of the system. For these reasons, temporal context
(cf. Fig. 3). The analysis of per class metrics shows that the should be used, but its width must be cross-validated.
proposed approach might not reach the highest performance Besides, some subjects might exhibit abnormal sleep struc-
on every stages (cf. Fig. 4). Indeed, Supratak et al. 2017 tures related to a sleep disorder [6]. There is thus a trade-off
outperforms on N1, and Gradient Boosting exhibits a similar between boosting the classification performance by integrat-
accuracy in N3. However, the proposed approach performs ing as much context as possible and not over-fitting sleep
globally well and appears to be quite robust in comparison to transitions in order to not miss a sleep disorder related to a
the other approaches. fragmented sleep. This is an additional argument in favor of
The proposed approach is particularly good at detecting cross-validating the temporal context width.
W (high sensitivity 0.85 and specificity close to 1). This An extension of our approach, for example to capture
characteristic might be particularly interesting for clinical complex stage transitions or long term dependencies would be
applications where a diagnosis of fragmented sleep might rely to employ a recurrent network architecture. Along these lines
on the detection of W. recent approaches have proposed more complex strategies
In order to measure the relevance of our approach for differ- to integrate the temporal context with LSTM unit cells or
ent types of subjects, we monitored the balanced accuracy of Bi-LSTM unit cells [5], [9], [10], [12], [45]. Integrating our
a subject as a function of the sleep fragmentation index (total feature extractor with such recurrent networks remains to be
number of awakenings and sleep stage shifts divided by total done and should lead to further performance improvements.
sleep time) [44]. The results (not shown) did not exhibit a
particular correlation between this measure of sleep quality F. Influence of Dataset
and the classification performances. This indicates that the Figure 9 raises an important question: how much data is
proposed approach could be used for clinical purposes with needed to establish a correct benchmark of predictive models
patients whose sleep exhibit abnormal structures. for sleep stage classification? This is particularly interesting
Unfortunately, the different classification performances can- concerning the deep learning approaches. Indeed, the Gradient
not be compared with inter-scorer agreement on this dataset Boosting handles quite well the small data situation and
since the night records have only been annotated by a single does not exhibit a huge increase in performances with the
expert. Yet, a 0.80 agreement has been reported between increase of the number of training records. On the contrary our
scorers [6]. Furthermore, Stephansen et al. [5] monitored approach delivers particularly good performances if enough
the classification accuracy of their model as a function of training data are available. Extrapolation of the learning curves
the consensus from 1 to 6 scorers. The reported curve was (performance as a function of the number of training records)
linearly increasing from 0.76 accuracy for 1 scorer up to in Fig. 9 suggests that one could expect better performances
0.87 accuracy for a 6 scorer consensus. We shall reproduce if more data were accessible. This forces us to reconsider the
such an experiment with the proposed approach in our future way we compare predictive models when training dataset sizes
work. differ between experiments since the quantity of training data
plays the role of a hyper-parameter for some algorithms like
E. Temporal Context and Transitions ours. Some algorithms become indeed better when more data
Our architecture allows naturally to learn from the temporal are available (see for example [46, Fig. 1]).
context as it only relies on the aggregation of temporal
features and a softmax classifier. Such a choice, enabled us to G. Choice of Sampling and Metrics
measure the influence of the close temporal context and better Our approach was particularly motivated by the accurate
understand its impact. It differs from the approaches proposed detection of any sleep stage independently to its propor-
by Tsinalis et al. [11] and Sors et al. [13] as our features tion. To achieve this goal, all approaches have been trained
768 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018

using balanced sampling and evaluated with balanced metrics [4] O. Tsinalis, P. M. Matthews, and Y. Guo, “Automatic sleep stage scoring
(except for experiment 1 where more metrics have been used). using time-frequency analysis and stacked sparse autoencoders,” Ann.
Biomed. Eng., vol. 44, no. 5, pp. 1587–1597, 2016.
We observed that the choice of sampling strategies employed [5] J. B. Stephansen et al., “The use of neural networks in the analysis of
during online learning impacts the evaluation metrics and sleep stages and the diagnosis of narcolepsy,” CoRR, pp. 1–41, 2017.
conversely the choice of metrics should motivate the choice [Online]. Available: https://arxiv.org/abs/1710.02094
[6] R. S. Rosenberg and S. Van Hout, “The American academy of sleep
of sampling strategies. Indeed, balanced sampling should be medicine inter-scorer reliability program: Sleep stage scoring,” J. Clin.
used to optimize the balanced accuracy of the model. On the Sleep Med., vol. 10, no. 4, pp. 447–454, 2014.
other hand, random sampling should be used to boost the [7] K. A. I. Aboalayon, M. Faezipour, W. S. Almuhammadi, and
S. Moslehpour, “Sleep stage classification using EEG signal analysis:
accuracy. The use of balanced sampling has been reportedly A comprehensive survey and new investigation,” Entropy, vol. 18, no. 9,
used or commented in [11]–[13]. p. 272, 2016.
Nonetheless, for a specific clinical application, one may [8] A. Vilamala, K. H. Madsen, and L. K. Hansen, “Deep con-
volutional neural networks for interpretable analysis of EEG
decide that errors on a minor stage, such as N1, are not so sleep stage scoring,” CoRR, pp. 1–6, 2017. [Online]. Available:
dramatic and hence prefer to train the network with random https://arxiv.org/abs/1710.00633
batches of data. On the contrary, one might want to discrimi- [9] S. Biswal et al., “SLEEPNET: Automated sleep staging system
nate as accurately as possible N1 stages from W or R E M and via deep learning,” CoRR, pp. 1–17, 2017. [Online]. Available:
https://arxiv.org/abs/1707.08262
therefore one should use balanced sampling, or over sampling [10] H. Dong, A. Supratak, W. Pan, C. Wu, P. M. Matthews, and Y. Guo.
of N1. (2016). “Mixed neural network approach for temporal sleep stage
Sampling strategy and evaluation metrics is a degree of classification.” [Online]. Available: https://arxiv.org/abs/1610.06421
[11] O. Tsinalis, P. M. Matthews, Y. Guo, and S. Zafeiriou. (2016). “Auto-
freedom one can play with to adapt the network for his own matic sleep stage scoring with single-channel EEG using convolu-
experimental or clinical purposes. tional neural networks.” pp. 1–10. [Online]. Available: https://arxiv.
org/abs/1610.01683
[12] A. Supratak, H. Dong, C. Wu, and Y. Guo, “DeepSleepNet: A model
V. C ONCLUSION for automatic sleep stage scoring based on raw single-channel EEG,”
IEEE Trans. Neural Syst. Rehabil. Eng., vol. 25, no. 11, pp. 1998–2008,
In this study we introduced a deep neural network to Nov. 2017.
perform temporal sleep stage classification from multimodal [13] A. Sors, S. Bonnet, S. Mirek, L. Vercueil, and J.-F. Payen, “A convo-
and multivariate time series. The model pools information lutional neural network for sleep stage scoring from raw single-channel
EEG,” Biomed. Signal Process. Control, vol. 42, pp. 107–114, 2018.
from different sensors thanks to a linear spatial filtering [14] M. Zhao, S. Yue, D. Katabi, T. S. Jaakkola, and M. T. Bianchi,
operation and builds a hierarchical features representation of “Learning sleep stages from radio signals: A conditional adversarial
PSG data thanks to temporal convolutions. It additionally pools architecture,” in Proc. 34th Int. Conf. Mach. Learn., vol. 70. Aug. 2017,
pp. 4100–4109.
information from different modalities processed with separate [15] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.
pipelines. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
The proposed approach in this paper exhibits strong classifi- [16] T. Lajnef et al., “Learning machines and sleeping brains: Automatic
sleep stage classification using decision-tree multi-class support vector
cation performances compared to the state-of-the-art with a lit- machines,” J. Neurosci. Methods, vol. 250, pp. 94–105, Nov. 2015.
tle run time and computational cost. This makes the approach [17] C. O’Reilly, N. Gosselin, J. Carrier, and T. Nielsen, “Montreal archive
a potential good candidate for being used in a portable device of sleep studies: An open-access resource for instrument benchmarking
and exploratory research,” J. Sleep Res., vol. 23, no. 6, pp. 628–635,
and performing online sleep stage classification. 2014.
Our approach enables to quantify the use of multiple EEG [18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, “Recipes for
channels and additional modalities such as EOG and EMG. the linear analysis of EEG,” NeuroImage, vol. 28, no. 2, pp. 326–341,
2005.
Interestingly, it appears that a limited number of EEG channels
[19] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K. R. Müller,
(6 EEG: F3, F4, C3, C4, O1, O2) gives performances similar “Optimizing spatial filters for robust EEG single-trial analysis,” IEEE
to 20 EEG channels. Furthermore, using EMG channels boosts Signal Process. Mag., vol. 25, no. 1, pp. 41–56, Jan. 2008.
the model performances. [20] B. D. Van Veen, W. van Drongelen, M. Yuchtman, and A. Suzuki,
“Localization of brain electrical activity via linearly constrained min-
The use of temporal context is analyzed and quantified imum variance spatial filtering,” IEEE Trans. Biomed. Eng., vol. 44,
and appears to give significant increase in performance when no. 9, pp. 867–880, Sep. 1997.
the spatial context is limited. It is to be noticed that the [21] P. Mirowski, D. Madhavan, Y. LeCun, and R. Kuzniecky, “Classifica-
tion of patterns of EEG synchronization for seizure prediction,” Clin.
temporal context as explored in this paper might not be directly Neurophysiol., vol. 120, no. 11, pp. 1927–1940, 2009.
suitable for online prediction, but it is easily usable for offline [22] D. F. Wulsin, J. R. Gupta, R. Mani, J. A. Blanco, and B. Litt, “Modeling
prediction. electroencephalography waveforms with semi-supervised deep belief
nets: Fast classification and anomaly measurement,” J. Neural Eng.,
vol. 8, no. 3, p. 036015, 2011.
R EFERENCES [23] W. Zheng, J. Zhu, Y. Peng, and B. Lu, “EEG-based emotion classifica-
tion using deep belief networks,” in Proc. IEEE Int. Conf. Multimedia
[1] C. Berthomier et al., “Automatic analysis of single-channel sleep EEG: Expo (ICME), Jul. 2014, pp. 1–6.
Validation in healthy individuals,” Sleep, vol. 30, no. 11, pp. 1587–1595, [24] P. Bashivan, I. Rish, M. Yeasin, and N. Codella, “Learning representa-
2007. tions from EEG with deep recurrent-convolutional neural networks,” in
[2] C. Iber, S. Ancoli-Israel, A. Chesson, and S. F. Quan, “The AASM Proc. ICLR, 2016, pp. 1–15.
manual for the scoring of sleep and associated events: Rules, terminol- [25] H. Cecotti and A. Gräser, “Convolutional neural network with embedded
ogy and technical specifications,” Amer. Acad. Sleep Med., Darien, IL, Fourier transform for EEG classification,” in Proc. 19th Int. Conf.
USA, Tech. Rep., 2007. Pattern Recognit., Dec. 2008, pp. 1–4.
[3] J. Allan Hobson, “A manual of standardized terminology, techniques and [26] H. Cecotti and A. Gräser, “Convolutional neural networks for P300
scoring system for sleep stages of human subjects,” Electroencephalogr. detection with application to brain-computer interfaces,” IEEE Trans.
Clin. Neurophysiol., vol. 26, no. 6, p. 644, Jun. 1969. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 433–445, Mar. 2011.
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 769

[27] S. Stober, D. J. Cameron, and J. A. Grahn, “Using convolutional [36] F. Chollet. (2015). Keras. [Online]. Available: https://github.com/
neural networks to recognize rhythm stimuli from electroencephalog- fchollet/keras
raphy recordings,” in Proc. Adv. Neural Inf. Process. Syst., 2014, [37] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on
pp. 1449–1457. Heterogeneous Systems. [Online]. Available: http://tensorflow.org.
[28] R. Manor and A. B. Geva, “Convolutional neural network for multi- [38] A. Gramfort et al., “MNE software for processing MEG and EEG data,”
category rapid serial visual presentation BCI,” Frontiers Comput. Neu- NeuroImage, vol. 86, pp. 446–460, Feb. 2014.
rosci., vol. 9, p. 146, Dec. 2015. [39] J. H. Friedman, “Greedy function approximation: A gradient boosting
[29] S. Stober, A. Sternin, A. M. Owen, and J. A. Grahn. (2016). “Deep machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
feature learning for EEG recordings,” pp. 1–24. [Online]. Available: [40] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting sys-
https://arxiv.org/abs/1511.04306 tem,” in Proc. 22nd Int. Conf. Knowl. Discovery Data Mining, 2016,
[30] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, pp. 785–794.
and B. J. Lance. (2016). “EEGNet: A compact convolutional network for [41] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model
EEG-based brain-computer interfaces.” pp. 1–20. [Online]. Available: search: Hyperparameter optimization in hundreds of dimensions for
https://arxiv.org/abs/1611.08024 vision architectures,” in Proc. ICML, 2013, pp. 1–9.
[31] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [42] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
boltzmann machines,” in Proc. ICML, 2010, pp. 807–814. tional networks,” in Proc. 13th Eur. Conf. Comput. Vis. (ECCV), Zurich,
[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and Switzerland, Sep. 2014, pp. 818–833.
R. Salakhutdinov, “Dropout: A simple way to prevent neural net- [43] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striv-
works from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, ing for simplicity: The all convolutional net,” in Proc. ICLR, 2015,
Jun. 2014. pp. 1–14.
[33] C. A. Joyce, I. F. Gorodnitsky, and M. Kutas, “Automatic removal of [44] J. Haba-Rubio, V. Ibanez, and E. Sforza, “An alternative mea-
eye movement and blink artifacts from EEG data using blind component sure of sleep fragmentation in clinical practice: The sleep frag-
separation,” Psychophysiology, vol. 41, no. 2, pp. 313–325, 2004. mentation index,” Sleep Med., vol. 5, no. 6, pp. 577–581,
[34] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. 2004.
Cambridge, MA, USA: MIT Press, 2016. [Online]. Available: [45] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
http://www.deeplearningbook.org Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[35] D. P. Kingma and J. Ba, “Adam: A method for stochas- [46] M. Banko and E. Brill, “Scaling to very very large corpora for natural
tic optimization,” CoRR, pp. 1–15, 2014. [Online]. Available: language disambiguation,” in Proc. 39th Annu. Meeting Assoc. Comput.
https://arxiv.org/abs/1412.6980 Linguistics (ACL), 2001, pp. 26–33.

You might also like