Data Mining: Mining stream time series and sequence data

Mining Stream, Time Series, and Sequence Data

Methodologies for Stream Data Processing and Stream Data SystemsRandom SamplingSliding WindowsHistogramsMulti resolution MethodsSketches Synopses

Randomized Algorithms to analyze Data StreamsRandomized algorithms, in the form of random sampling and sketching, are often used to deal with massive, high-dimensional data streams.

Data Stream Management Systems and Stream QueriesIn traditional database systems, data are stored in finite and persistent databases.stream data are infinite and impossible to store fully in a database. Data Stream Management System (DSMS), there may be multiple data streams.Once an element from a data stream has been processed, it is discarded or archived, and it cannot be easily retrieved unless it is explicitly stored in memory

Critical Layers of stream data cube Two critical cuboids (or layers)The first layer, called the minimal interest layer, is the minimally interesting layer that ananalyst would like to studyThe second layer, called the observation layer, is the layer at which an analyst (or anautomated system) would like to continuously study the data.

Hoeffding Tree AlgorithmThe Hoeffding tree algorithm is a decision tree learning method for stream data classification.It was initially used to track Web click streams and construct models to predict which Web hosts and Web sites a user is likely to access. It typically runs in sublinear time and produces a nearly identical decision tree to that of traditional batch learners.It uses Hoeffding trees, which exploit the idea that a small sample can often be enough to choose an optimal splitting attribute.

Very Fast Decision Tree (VFDT) The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the Hoeffding tree algorithm.The modifications include breaking near-ties during attribute selection more aggressively, computing the G function after a number of training examples, deactivating the least promising leaves whenever memory is running low, dropping poor splitting attributes, and improving the initialization method.VFDT works well on stream data and also compares extremely well to traditional classifiers in both speed and accuracy To adapt to concept-drifting data streams.

Concept-adapting Very Fast Decision Tree algorithm (CVFDT).CVFDT also uses a sliding window approach; however, it does not construct a new model from scratch each time. Rather, it updates statistics at the nodes by incrementing the counts associated with new examples and decrementing the counts associated with old ones. Therefore, if there is a concept drift, some nodes may no longer pass the Hoeffding bound. When this happens, an alternate subtree will be grown, with the new best splitting attribute at the root.

A Classifier Ensemble Approach to Stream Data ClassificationThe idea is to train an ensemble or group of classifiers (using, say naïve Bayes) from sequential chunks of the data stream.Whenever a new chunk arrives, we build a new classifier from it. The individual classifiers are weighted based on their expected classification accuracy in a time-changing environment. Only the top-k classifiers are kept. The decisions are then based on the weighted votes of the classifiers.

Clustering in evolving data streamsCompute and store summaries of past dataApply a divide-and-conquer strategyIncremental clustering of incoming data streamsPerform micro clustering as well as macro clustering analysisExplore multiple time granularity for the analysis of cluster evolutionDivide stream clustering into on-line and off-line processes

Mining Time-Series DataA time-series database consists of sequences of values or events obtained over repeated measurements of time.Trend AnalysisSimilarity Search in Time-Series Analysis

Markov Chain for sequence analysisA Markov chain is a model that generates sequences in which the probability of a symbol depends only on the previous symbol.

Tasks using hidden Markov models include:Evaluation: Given a sequence, x, determine the probability, P(x), of obtaining x in the model.Decoding: Given a sequence, determine the most probable path through the model that produced the sequence.Learning: Given a model and a set of training sequences, find the model parameters (i.e., the transition and emission probabilities) that explain the training sequences with relatively high probability.

Different algorithms in series analysisForward AlgorithmViterbi AlgorithmBaum-Welch Algorithm

Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net

Data Mining: Mining stream time series and sequence data

In this document

More Related Content

What's hot

Similar to Data Mining: Mining stream time series and sequence data

More from DataminingTools Inc

Recently uploaded

Data Mining: Mining stream time series and sequence data