A Deep Learning Architecture For Temporal
A Deep Learning Architecture For Temporal
4, APRIL 2018
1534-4320 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 759
(see [7] for a very extensive list of references). Methods in learning approaches have been proposed to learn from EEG
the second category consist in learning appropriate feature data [21]–[24] and some of these contributions use a first layer
representations from transformed data [5], [8]–[10] or directly that boils down to a spatial filter [25]–[30]. Note that using
from raw data with convolutional neural networks [11]–[13]. a deep neural network to learn a feature representation and
Recently, another method was proposed to perform sleep stage classify sleep stages on data coming from multiple sensors has
classification onto radio waves signals, with an adversarial been recently investigated in parallel of our work [5], [9]. Yet
deep neural network [14]. our study further investigates and quantifies how much using
One of the main statistical learning challenges is the imbal- a spatial filtering step enhances the prediction performance.
anced nature of the classification task which has important This paper is organized as follows. First we introduce our
practical implications for this application. Typically sleep end-to-end deep learning approach to perform temporal sleep
stages such as N1 are rare compared to N2 stages. When stage classification using multivariate time series coming from
learning a predictive algorithm with very imbalanced classes, multiple modalities (EEG, EOG, EMG). We furthermore detail
what classically happens is that the resulting system tends how the temporal context of each segment of data can be
to never predict the rarest classes. One way to address this exploited by our model. Then, we benchmark our approach
issue is to reweight the model loss function so that the cost on publicly available data and compare it to state-of-the-
of making an error on a rare sample is larger [15]. With an art sleep stage classification methods. Finally, we explore
online training approach as used with neural networks, one the dependencies of our approach regarding the spatial con-
way to achieve this is to employ balanced sampling, i.e. to text, the temporal context and the amount of training data
feed the network with batches of data which contain as many at hand.
data points from each class [4], [5], [9]–[13]. This indeed Notation: We denote by X ∈ RC×T a segment of mul-
prevents the predictive models to be biased towards the most tivariate time series with its label y ∈ Y which maps to
frequent stages. Yet, such a strategy raises the question of the the set {W, N1, N2, N3, R E M}. Here, X corresponds to a
choice of the evaluation metric used. The standard Accuracy sample lasting 30 seconds and Y = y ∈ R5+ : 5i=1 yi = 1
metric (Acc.) considers that any prediction mistake has the corresponds to the probability simplex. Precisely, each label
same cost. Imagine that N2 would represent 90 % of the data, is encoded as a vector of R5 with 4 coefficients equal to 0
predicting always N2 would lead to a 90 % accuracy, which and a single coefficient equal to 1 which indicates the sleep
is obviously bad. A natural way to better evaluate a model in stage. Here C refers to the number of channels and T to the
the presence of imbalanced classes is to use the Balanced number of time steps. Stk = {X t −k , . . . , X t , . . . , X t +k } stands
Accuracy (B. Acc.) metric. With this metric the cost of a for an ordered sequence of 2k + 1 neighboring segments of
mistake on a sample of type N2 is inversely proportional to signal. Xk = (RC×T )2k+1 is the space of 2k + 1 neighboring
the fraction of samples of type N2 in the data. By doing so, segments of signal. Finally, stands for the categorical cross
every sleep stage has the same impact on the final figure of entropy loss function. Given a true label y ∈ Y and a predicted
merit [16]. label p ∈ Y it is defined as: (y, p) = − 5i=1 yi log pi .
Another statistical learning challenge concerns the way
transition rules are handled. Indeed, as the transition rules may II. M ATERIAL AND M ETHODS
impact the final decision of a scorer, a predictive model might In this section, we present a deep learning architecture
take them into account in order to increase its performance. to perform temporal sleep stage classification from multi-
Doing so is possible by feeding the final classifier with the variate and multimodal time series. We first define formally
features from the neighboring time segments [4], [5], [9]–[13]. the classification problem addressed here. Then we present
This is referred to as temporal sleep stage classification. the network architecture used to predict without temporal
A number of public sleep datasets contain PSG records context (k = 0). Then we describe the time distributed
with several EEG channels, and additional modalities such multivariate network proposed to perform temporal sleep stage
as EOG or EMG channels [17]. Although these modalities classification (k > 0). Finally, we present and discuss the
are used by human experts for sleep scoring, seldom are alternative state-of-the-art methods used for comparison in our
they considered by automatic systems [16]. Focusing only on experiments.
the EEG modality, it is natural to think that the multivariate
nature of EEG data does carry precious information. This can A. Machine Learning Problem
be exploited at least to cope with electrode removal or bad
channels, and thus improve the robustness of the prediction In this paragraph, we formalize in mathematical terms the
algorithm. However, this can also be exploited as a leverage temporal classification task considered here. Let k be a non-
to improve the predictive capacities of the algorithm. Indeed, negative integer. Let f : Xk −→ Y stand for a predictive
the EEG community has designed a number of methods to model that belongs to a parametric set denoted F . Here f takes
increase the signal-to-noise ratio (SNR) of an effect of interest as input an ordered sequence of 2k+1 neighboring segments of
from a full array of sensors. Among these methods are so signal, and outputs a probability vector p ∈ Y. For simplicity
called linear spatial filters and include classical techniques the parameters of the network are not written. The machine
such as PCA/ICA [18], Common Spatial Patterns for BCI learning problem tackled then reads:
applications [19] or beamforming methods for source local- fˆ = arg min Ex,y∈Xk ×Y [( f (x), y)] . (1)
ization [20]. Less classically and more recently various deep f ∈F
760 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018
TABLE I
D ETAILED A RCHITECTURE FOR THE F EATURE E XTRACTOR FOR C EEG C HANNELS W ITH T IME S ERIES OF L ENGTH T. T HE S AME A RCHITECTURE
I S E MPLOYED FOR C EMG C HANNELS . W HEN B OTH EEG / EOG AND EMG A RE C ONSIDERED, THE O UTPUTS OF THE D ROPOUT L AYERS
A RE C ONCATENATED AND F ED I NTO THE F INAL C LASSIFIER . T HE N UMBER OF PARAMETERS OF THE F INAL D ENSE L AYER
B ECOMES T HUS E QUAL TO 5 × ((C + C ) × (T // 256) × 8)
III. E XPERIMENTS
Fig. 2. Time distributed architecture to process a sequence of inputs
In this section, we first introduce the dataset and the pre-
Stk = Xt−k , . . . , Xt , . . . , Xt+k with k = ½. Xk stands for the multivariate
input data over ¿¼ s that is fed into the feature extractor Z. Features processing steps used. Then, we present the different features
are extracted from consecutive ¿¼ s samples: Xt −k , . . . , Xt , . . . , X
t +k . extractors of the literature which we use in our benchmark.
Then the obtained features are aggregated zt−k , . . . , zt , . . . , zt+k . The We then present the experiments which aim at (i) establishing
resulting aggregation of features is finally fed into a classifier to predict
the label yt associated to the sample Xt . a general benchmark of our feature extractor against state-
of-the art approaches in univariate (single derivation) and
bivariate (2 channels) contexts, (ii) studying the influence of
pre-activation of the last layer. The output of the
network is a the spatial context, (iii) evaluating the gain obtained by using
vector p ∈ Y. p is obtained as: pi = exp(ai )/ 5j =1 exp(a j ). the temporal context and (iv) evaluating the impact of the
quantity of training data.
C. Time Distributed Multivariate Network
In this paragraph, we describe the Time Distributed Multi- A. Data and Preprocessing Steps
variate Network we propose to perform temporal sleep stage Data used in our experiments is the publicly available
classification (k > 0). It builds on the Multivariate Network MASS dataset - session 3 [17]. It corresponds to 62 night
Architecture presented previously and distributes it in time to records, each one coming from a different subject. Because
take into account the temporal context. Indeed a sample of of preprocessing issues we removed the record 01-03-0034.
class N2 is very likely to be close to another N2 sample, but Each record contains data from 20 EEG channels which were
also to an N1 or an N3 sample [2]. referenced with respect to the A2 electrode. We did not modify
To take into account the statistical properties of the signals the referencing scheme, hence removed the A2 electrode from
before and after the sample of interest, we propose to aggregate our study. Each record also includes signals from 2 EOG
the different features extracted by Z on a number of time (horizontal left and right) and 3 EMG channels (chin channels)
segments preceding or following the sample of interest. More that we considered as additional modalities.
formally, let Stk = {X t −k , . . . , X t , . . . , X t +k } ∈ Xk be a The time series from all the available sensors were first
sequence of 2k + 1 neighboring samples (k samples in the low-pass filtered with a 30 Hz cutoff frequency. Then they
past and k samples in the future). Distributing in time the were downsampled to a sampling rate of 128 Hz. The down-
features extractor consists in applying Z to each sample in sampling step speeds up the computations for the neural
Stk and aggregating the 2k + 1 outputs forming a vector of networks, while keeping the information up to 64 Hz (Nyquist
size D(2k + 1). Then, the obtained vector is fed into the final frequency). Downsampling and low / band pass filtering are
softmax classifier. This is summarized in Fig. 2. commonly used preprocessing steps [5], [16]. The data extrac-
tion and the filtering steps were performed with the MNE
D. Training software [38]. The filter employed was a zero-phase finite
The minimization in (1) is done with an online procedure impulse response (FIR) filter with transition bandwidth of
based on stochastic gradient descent using mini batches of approximately 7 Hz. Sleep stages were marked according to
data. Yet, to be able to learn to discriminate under-represented the AASM rules by a single sleep expert per record [2], [17].
classes (typically W and N1 stages), and since we are When investigating the use of temporal context by feeding the
interested in optimizing the balanced accuracy, we propose predictors with sequences of consecutive samples Sk , we used
to balance the distribution of each class in minibatches of zero padding to complete the samples at the beginning and at
size 128. As we have 5 classes it means that during training, the end of the night. This enables to feed the models with
each batch has about 20% of samples of each class. The Adam all the samples of a night record while keeping fixed the
optimizer [35] is used for optimization with the following dimension of the input batches.
parameters α = 0.001 (learning rate), β1 = 0.9, β2 = 0.999 The time series fed into the different neural networks were
and = 10−8 . additionnaly standardized. Indeed, for each channel, every
762 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018
30 s sample is standardized individually such that it has zero hyperopt Python package [41]. Concretely, we considered only
mean and unit variance. For the specific task of sleep stage the data from the training and validation subjects at hand.
classification this is particularly relevant since records are For each set of hyper-parameters, we trained and evaluated
carried out over nearly 8 hours. During such a long period the classifier on data from 5 different splits of training and
the recording conditions vary such as skin humidity, body evaluation subjects (80% for training 20% for evaluation). The
temperature, body movements or even worse electrode contact search was done with 50 sets of hyper-parameters and the set
loss. Giving to each 30 s time series the same first and second which achieved the best balanced accuracy averaged on the
order moments enables to cope with this likely covariate shift 5 splits was selected. The following parameters were tuned:
that may occur during a night record. This operation only learning rate in interval 10−4 , 10−1 , the minimum weight
rescales the frequency powers in every frequency band, with- of a child tree in set {1, 2, . . . , 10}, the maximum depth of
out altering their relative amplitudes where the discriminant trees in {1, 2, . . . , 10}, the regularization parameter in [0, 1],
information for the considered sleep stage classification task the subsampling parameter in [0.5, 1], the sampling level of
lies (see Parseval’s theorem). Note that this preprocessing step columns by tree in [0.5, 1].
can be done online before feeding the network with a batch 2) Convolutional Networks on Raw Univariate Time Series:
of data. We reimplemented and benchmarked 2 end-to-end deep learn-
Cross-validation was used to have an unbiased estimate of ing approaches. We detail each of them in the following
the performance of our model on unseen records. To reduce paragraphs and explain how we used these methods.
variance in the reported scores, the data were randomly split a) Tsinalis et al. 2016: The approach by Tsinalis et al. [11]
5 times between train, validation and testing set. The splits is a deep convolution network that processes univariate time
were performed with respect to records in order to guarantee series (a single EEG signal). It was reimplemented according
that a record used in the training set was never used in the to the paper details. The approach originally takes into account
validation or the testing set. For each split, 41 records were the temporal context, by feeding the network with 150 s of
included in the training set, 10 records in the validation set signals, i.e. the sample to classify plus the 2 previous and
and 10 records in the testing set. 2 following samples. When used without temporal context in
the experiments, the network is fed with 30 s samples.
B. Related Work and Compared Approaches Training was performed by minimizing the categorical cross
We now introduce the three state-of-the-art approaches entropy, and a similar balanced sampling strategy with Adam
that we used for comparison with our approach: a gradient optimizer was used. An additional 2 regularization set to 0.01
boosting classifier [39] trained on hand-crafted features and was applied onto the convolution filters [11]. The code was
two convolutional networks trained on raw univariate time written in Keras [36] with a Tensorflow backend [37].
series following the approach of [11] and [12]. b) Supratak et al. 2017: The approach by
1) Features Based Approach: The Gradient Boosting model Supratak et al. [12] is also an end-to-end deep convolutional
was learnt on hand-crafted features: time domain features network which contains two blocks: a feature extractor that
and frequency domain features computed for each input processes the frequency content of the signal and a recurrent
sensor as described in [16]. More precisely, we extracted neural network that processes a sequence of consecutive
from each channel the power and relative power in 5 bands: 30 s samples of signal. The feature extractor processes
δ (0.5 − 4.5 Hz), θ (4.5 − 8.5 Hz), α (8.5 − 11.5 Hz), low frequency information and high frequency information
σ (11.5 − 15.5 Hz), β (15.5 − 30 Hz), giving both 5 features. into two distinct convolutional sub-neural networks before
We furthermore extracted power ratios between these bands merging the feature representations. The resulting tensor is
(which amount for 5 × 4/2 = 10 supplementary features) and then fed into a softmax classifier. This block is trained with
spectral entropy features as well as statistics such as mean, balanced sampling. Then the feature extractor is linked to a
variance, skewness, kurtosis, 75% quantile. This gives in the recurrent neural network composed of 2 bi-LSTM layers. The
end a set of 26 features per channel. whole architecture is fed with sequences of 25 consecutive
The implementation used is from the XGBoost package [40], 30 s samples from the same record.
which internally employs decisions trees. This model is known The first block was used for comparison in our experiment.
for its high predictive performance, robustness to outliers, Its training was performed by minimizing the categorical
robustness to unbalanced classes and parallel search of the crossentropy, and a balanced sampling strategy with Adam
best split. Training was performed by minimizing also the optimizer was used. The code was written in Keras [36] with
categorical cross entropy. The training set was balanced using a Tensorflow backend [37].
under sampling. The maximum number of trees in the model
was set to 1000. An early stopping callback on the validation C. Experiment 1: Comparison of Feature
categorical cross entropy with patience equal to 10 was used to Extractors on the Fz/Cz Channels
stop the training when no improvement was observed. Training In this experiment, we perform a general benchmark of
never led to more than 1000 trees in a model. our feature extractor against hand-crafted features classified
The model has several hyper-parameters that need to be with Gradient Boosting, and the two network architectures
tuned to improve classification performances and cope with just described [11], [12]. The purpose of this experiment is to
unbalanced classes. To find the best hyper-parameters for benchmark different feature representations on a similar spatial
each experiment, we performed random searches with the context, Fz-Cz, without using the temporal context, and to
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 763
Fig. 4. Per class metrics of several feature extractor trained on the Fz-Cz derivation or Fz-A2, Cz-A2 channels.
Fig. 8. Influence of temporal context on the confusion matrices (top row) and the transition matrices (bottom row). Including more temporal context
induces an increase of performance in the discrimination of stages N1, N2 and REM whereas it induces a slight decrease in the discrimination of W
and N3 when the temporal context is too wide. Including more temporal context smooths the hypnogram.
This is quite simple and compact compared to the recent extractor always receives 30 s of signals, and is therefore
approaches in [11] which has up to ∼ 14.107 parameters applied to a sequence of neighboring 30 s samples. On the
and [12] which exhibits ∼ 6.105 parameters for the feature contrary, Tsinalis et al. [11] and Sors et al. [13] extended
extractor and 2.107 parameters for the sequence learning the feature extractor input window to 150 s, respectively 120 s.
part using BiLSTM. This significant difference with [11] is In [12], a temporal context of 25 neighboring 30 s samples is
mainly due to our choice of using small convolution filters processed.
(64 time steps after low pass filtering and downsampling), Our experiment on temporal context highlights a trade-
large pooling regions (pooling over 16 time steps) according to off when integrating some temporal context: integrating some
the 128 Hz sampling frequency and removing the penultimate temporal context brings benefits in the detection of some sleep
fully connected layers before the final softmax classifier. Such stages specifically (N1, N2, REM) but a too large temporal
a strategy has already been successul in computer vision [43] context has a negative effect on the detection of W and
and EEG [30]. N3 stages as emphasized by Fig. 8. This naturally translates
to the balanced accuracy scores which exhibit a significative
D. Classification Metrics increase for small temporal context and no increase, or even
The proposed approach yields equal (univariate) or higher a decrease, for large temporal context (cf. Fig. 7). Looking at
(multivariate) classification metrics than the other bench- the transitions matrices, it appears that more temporal context
marked feature extractors while presenting a limited train- smoothes the hypnograms which might be detrimental to the
ing run time per epoch or prediction time per night record quality of the system. For these reasons, temporal context
(cf. Fig. 3). The analysis of per class metrics shows that the should be used, but its width must be cross-validated.
proposed approach might not reach the highest performance Besides, some subjects might exhibit abnormal sleep struc-
on every stages (cf. Fig. 4). Indeed, Supratak et al. 2017 tures related to a sleep disorder [6]. There is thus a trade-off
outperforms on N1, and Gradient Boosting exhibits a similar between boosting the classification performance by integrat-
accuracy in N3. However, the proposed approach performs ing as much context as possible and not over-fitting sleep
globally well and appears to be quite robust in comparison to transitions in order to not miss a sleep disorder related to a
the other approaches. fragmented sleep. This is an additional argument in favor of
The proposed approach is particularly good at detecting cross-validating the temporal context width.
W (high sensitivity 0.85 and specificity close to 1). This An extension of our approach, for example to capture
characteristic might be particularly interesting for clinical complex stage transitions or long term dependencies would be
applications where a diagnosis of fragmented sleep might rely to employ a recurrent network architecture. Along these lines
on the detection of W. recent approaches have proposed more complex strategies
In order to measure the relevance of our approach for differ- to integrate the temporal context with LSTM unit cells or
ent types of subjects, we monitored the balanced accuracy of Bi-LSTM unit cells [5], [9], [10], [12], [45]. Integrating our
a subject as a function of the sleep fragmentation index (total feature extractor with such recurrent networks remains to be
number of awakenings and sleep stage shifts divided by total done and should lead to further performance improvements.
sleep time) [44]. The results (not shown) did not exhibit a
particular correlation between this measure of sleep quality F. Influence of Dataset
and the classification performances. This indicates that the Figure 9 raises an important question: how much data is
proposed approach could be used for clinical purposes with needed to establish a correct benchmark of predictive models
patients whose sleep exhibit abnormal structures. for sleep stage classification? This is particularly interesting
Unfortunately, the different classification performances can- concerning the deep learning approaches. Indeed, the Gradient
not be compared with inter-scorer agreement on this dataset Boosting handles quite well the small data situation and
since the night records have only been annotated by a single does not exhibit a huge increase in performances with the
expert. Yet, a 0.80 agreement has been reported between increase of the number of training records. On the contrary our
scorers [6]. Furthermore, Stephansen et al. [5] monitored approach delivers particularly good performances if enough
the classification accuracy of their model as a function of training data are available. Extrapolation of the learning curves
the consensus from 1 to 6 scorers. The reported curve was (performance as a function of the number of training records)
linearly increasing from 0.76 accuracy for 1 scorer up to in Fig. 9 suggests that one could expect better performances
0.87 accuracy for a 6 scorer consensus. We shall reproduce if more data were accessible. This forces us to reconsider the
such an experiment with the proposed approach in our future way we compare predictive models when training dataset sizes
work. differ between experiments since the quantity of training data
plays the role of a hyper-parameter for some algorithms like
E. Temporal Context and Transitions ours. Some algorithms become indeed better when more data
Our architecture allows naturally to learn from the temporal are available (see for example [46, Fig. 1]).
context as it only relies on the aggregation of temporal
features and a softmax classifier. Such a choice, enabled us to G. Choice of Sampling and Metrics
measure the influence of the close temporal context and better Our approach was particularly motivated by the accurate
understand its impact. It differs from the approaches proposed detection of any sleep stage independently to its propor-
by Tsinalis et al. [11] and Sors et al. [13] as our features tion. To achieve this goal, all approaches have been trained
768 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 26, NO. 4, APRIL 2018
using balanced sampling and evaluated with balanced metrics [4] O. Tsinalis, P. M. Matthews, and Y. Guo, “Automatic sleep stage scoring
(except for experiment 1 where more metrics have been used). using time-frequency analysis and stacked sparse autoencoders,” Ann.
Biomed. Eng., vol. 44, no. 5, pp. 1587–1597, 2016.
We observed that the choice of sampling strategies employed [5] J. B. Stephansen et al., “The use of neural networks in the analysis of
during online learning impacts the evaluation metrics and sleep stages and the diagnosis of narcolepsy,” CoRR, pp. 1–41, 2017.
conversely the choice of metrics should motivate the choice [Online]. Available: https://arxiv.org/abs/1710.02094
[6] R. S. Rosenberg and S. Van Hout, “The American academy of sleep
of sampling strategies. Indeed, balanced sampling should be medicine inter-scorer reliability program: Sleep stage scoring,” J. Clin.
used to optimize the balanced accuracy of the model. On the Sleep Med., vol. 10, no. 4, pp. 447–454, 2014.
other hand, random sampling should be used to boost the [7] K. A. I. Aboalayon, M. Faezipour, W. S. Almuhammadi, and
S. Moslehpour, “Sleep stage classification using EEG signal analysis:
accuracy. The use of balanced sampling has been reportedly A comprehensive survey and new investigation,” Entropy, vol. 18, no. 9,
used or commented in [11]–[13]. p. 272, 2016.
Nonetheless, for a specific clinical application, one may [8] A. Vilamala, K. H. Madsen, and L. K. Hansen, “Deep con-
volutional neural networks for interpretable analysis of EEG
decide that errors on a minor stage, such as N1, are not so sleep stage scoring,” CoRR, pp. 1–6, 2017. [Online]. Available:
dramatic and hence prefer to train the network with random https://arxiv.org/abs/1710.00633
batches of data. On the contrary, one might want to discrimi- [9] S. Biswal et al., “SLEEPNET: Automated sleep staging system
nate as accurately as possible N1 stages from W or R E M and via deep learning,” CoRR, pp. 1–17, 2017. [Online]. Available:
https://arxiv.org/abs/1707.08262
therefore one should use balanced sampling, or over sampling [10] H. Dong, A. Supratak, W. Pan, C. Wu, P. M. Matthews, and Y. Guo.
of N1. (2016). “Mixed neural network approach for temporal sleep stage
Sampling strategy and evaluation metrics is a degree of classification.” [Online]. Available: https://arxiv.org/abs/1610.06421
[11] O. Tsinalis, P. M. Matthews, Y. Guo, and S. Zafeiriou. (2016). “Auto-
freedom one can play with to adapt the network for his own matic sleep stage scoring with single-channel EEG using convolu-
experimental or clinical purposes. tional neural networks.” pp. 1–10. [Online]. Available: https://arxiv.
org/abs/1610.01683
[12] A. Supratak, H. Dong, C. Wu, and Y. Guo, “DeepSleepNet: A model
V. C ONCLUSION for automatic sleep stage scoring based on raw single-channel EEG,”
IEEE Trans. Neural Syst. Rehabil. Eng., vol. 25, no. 11, pp. 1998–2008,
In this study we introduced a deep neural network to Nov. 2017.
perform temporal sleep stage classification from multimodal [13] A. Sors, S. Bonnet, S. Mirek, L. Vercueil, and J.-F. Payen, “A convo-
and multivariate time series. The model pools information lutional neural network for sleep stage scoring from raw single-channel
EEG,” Biomed. Signal Process. Control, vol. 42, pp. 107–114, 2018.
from different sensors thanks to a linear spatial filtering [14] M. Zhao, S. Yue, D. Katabi, T. S. Jaakkola, and M. T. Bianchi,
operation and builds a hierarchical features representation of “Learning sleep stages from radio signals: A conditional adversarial
PSG data thanks to temporal convolutions. It additionally pools architecture,” in Proc. 34th Int. Conf. Mach. Learn., vol. 70. Aug. 2017,
pp. 4100–4109.
information from different modalities processed with separate [15] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.
pipelines. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
The proposed approach in this paper exhibits strong classifi- [16] T. Lajnef et al., “Learning machines and sleeping brains: Automatic
sleep stage classification using decision-tree multi-class support vector
cation performances compared to the state-of-the-art with a lit- machines,” J. Neurosci. Methods, vol. 250, pp. 94–105, Nov. 2015.
tle run time and computational cost. This makes the approach [17] C. O’Reilly, N. Gosselin, J. Carrier, and T. Nielsen, “Montreal archive
a potential good candidate for being used in a portable device of sleep studies: An open-access resource for instrument benchmarking
and exploratory research,” J. Sleep Res., vol. 23, no. 6, pp. 628–635,
and performing online sleep stage classification. 2014.
Our approach enables to quantify the use of multiple EEG [18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, “Recipes for
channels and additional modalities such as EOG and EMG. the linear analysis of EEG,” NeuroImage, vol. 28, no. 2, pp. 326–341,
2005.
Interestingly, it appears that a limited number of EEG channels
[19] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K. R. Müller,
(6 EEG: F3, F4, C3, C4, O1, O2) gives performances similar “Optimizing spatial filters for robust EEG single-trial analysis,” IEEE
to 20 EEG channels. Furthermore, using EMG channels boosts Signal Process. Mag., vol. 25, no. 1, pp. 41–56, Jan. 2008.
the model performances. [20] B. D. Van Veen, W. van Drongelen, M. Yuchtman, and A. Suzuki,
“Localization of brain electrical activity via linearly constrained min-
The use of temporal context is analyzed and quantified imum variance spatial filtering,” IEEE Trans. Biomed. Eng., vol. 44,
and appears to give significant increase in performance when no. 9, pp. 867–880, Sep. 1997.
the spatial context is limited. It is to be noticed that the [21] P. Mirowski, D. Madhavan, Y. LeCun, and R. Kuzniecky, “Classifica-
tion of patterns of EEG synchronization for seizure prediction,” Clin.
temporal context as explored in this paper might not be directly Neurophysiol., vol. 120, no. 11, pp. 1927–1940, 2009.
suitable for online prediction, but it is easily usable for offline [22] D. F. Wulsin, J. R. Gupta, R. Mani, J. A. Blanco, and B. Litt, “Modeling
prediction. electroencephalography waveforms with semi-supervised deep belief
nets: Fast classification and anomaly measurement,” J. Neural Eng.,
vol. 8, no. 3, p. 036015, 2011.
R EFERENCES [23] W. Zheng, J. Zhu, Y. Peng, and B. Lu, “EEG-based emotion classifica-
tion using deep belief networks,” in Proc. IEEE Int. Conf. Multimedia
[1] C. Berthomier et al., “Automatic analysis of single-channel sleep EEG: Expo (ICME), Jul. 2014, pp. 1–6.
Validation in healthy individuals,” Sleep, vol. 30, no. 11, pp. 1587–1595, [24] P. Bashivan, I. Rish, M. Yeasin, and N. Codella, “Learning representa-
2007. tions from EEG with deep recurrent-convolutional neural networks,” in
[2] C. Iber, S. Ancoli-Israel, A. Chesson, and S. F. Quan, “The AASM Proc. ICLR, 2016, pp. 1–15.
manual for the scoring of sleep and associated events: Rules, terminol- [25] H. Cecotti and A. Gräser, “Convolutional neural network with embedded
ogy and technical specifications,” Amer. Acad. Sleep Med., Darien, IL, Fourier transform for EEG classification,” in Proc. 19th Int. Conf.
USA, Tech. Rep., 2007. Pattern Recognit., Dec. 2008, pp. 1–4.
[3] J. Allan Hobson, “A manual of standardized terminology, techniques and [26] H. Cecotti and A. Gräser, “Convolutional neural networks for P300
scoring system for sleep stages of human subjects,” Electroencephalogr. detection with application to brain-computer interfaces,” IEEE Trans.
Clin. Neurophysiol., vol. 26, no. 6, p. 644, Jun. 1969. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 433–445, Mar. 2011.
CHAMBON et al.: DEEP LEARNING ARCHITECTURE FOR TEMPORAL SLEEP STAGE CLASSIFICATION 769
[27] S. Stober, D. J. Cameron, and J. A. Grahn, “Using convolutional [36] F. Chollet. (2015). Keras. [Online]. Available: https://github.com/
neural networks to recognize rhythm stimuli from electroencephalog- fchollet/keras
raphy recordings,” in Proc. Adv. Neural Inf. Process. Syst., 2014, [37] M. Abadi et al. (2015). TensorFlow: Large-Scale Machine Learning on
pp. 1449–1457. Heterogeneous Systems. [Online]. Available: http://tensorflow.org.
[28] R. Manor and A. B. Geva, “Convolutional neural network for multi- [38] A. Gramfort et al., “MNE software for processing MEG and EEG data,”
category rapid serial visual presentation BCI,” Frontiers Comput. Neu- NeuroImage, vol. 86, pp. 446–460, Feb. 2014.
rosci., vol. 9, p. 146, Dec. 2015. [39] J. H. Friedman, “Greedy function approximation: A gradient boosting
[29] S. Stober, A. Sternin, A. M. Owen, and J. A. Grahn. (2016). “Deep machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
feature learning for EEG recordings,” pp. 1–24. [Online]. Available: [40] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting sys-
https://arxiv.org/abs/1511.04306 tem,” in Proc. 22nd Int. Conf. Knowl. Discovery Data Mining, 2016,
[30] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, pp. 785–794.
and B. J. Lance. (2016). “EEGNet: A compact convolutional network for [41] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model
EEG-based brain-computer interfaces.” pp. 1–20. [Online]. Available: search: Hyperparameter optimization in hundreds of dimensions for
https://arxiv.org/abs/1611.08024 vision architectures,” in Proc. ICML, 2013, pp. 1–9.
[31] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [42] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
boltzmann machines,” in Proc. ICML, 2010, pp. 807–814. tional networks,” in Proc. 13th Eur. Conf. Comput. Vis. (ECCV), Zurich,
[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and Switzerland, Sep. 2014, pp. 818–833.
R. Salakhutdinov, “Dropout: A simple way to prevent neural net- [43] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striv-
works from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, ing for simplicity: The all convolutional net,” in Proc. ICLR, 2015,
Jun. 2014. pp. 1–14.
[33] C. A. Joyce, I. F. Gorodnitsky, and M. Kutas, “Automatic removal of [44] J. Haba-Rubio, V. Ibanez, and E. Sforza, “An alternative mea-
eye movement and blink artifacts from EEG data using blind component sure of sleep fragmentation in clinical practice: The sleep frag-
separation,” Psychophysiology, vol. 41, no. 2, pp. 313–325, 2004. mentation index,” Sleep Med., vol. 5, no. 6, pp. 577–581,
[34] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. 2004.
Cambridge, MA, USA: MIT Press, 2016. [Online]. Available: [45] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
http://www.deeplearningbook.org Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[35] D. P. Kingma and J. Ba, “Adam: A method for stochas- [46] M. Banko and E. Brill, “Scaling to very very large corpora for natural
tic optimization,” CoRR, pp. 1–15, 2014. [Online]. Available: language disambiguation,” in Proc. 39th Annu. Meeting Assoc. Comput.
https://arxiv.org/abs/1412.6980 Linguistics (ACL), 2001, pp. 26–33.