KEMBAR78
Introduction To Kolmogorov Complexity | PDF | String (Computer Science) | Computing
0% found this document useful (0 votes)
58 views28 pages

Introduction To Kolmogorov Complexity

Uploaded by

付桡
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views28 pages

Introduction To Kolmogorov Complexity

Uploaded by

付桡
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

An introduction to

Kolmogorov complexity
Liliana Salvador and Gustavo Lacerda
Complex Systems Summer School 2009
Outline
Symbols, strings and languages

Turing machines

Kolmogorov complexity

Incompressibility

Kolmogorov Complexity and Shannon entropy

Symmetry of Information

Time-bounded Kolmogorov complexity

Learning as compression

Universal learning

Universal measures of similarity

Clustering by compression
Symbols, Strings and
Languages
Alphabet A = {0,1} : finite set of symbols
A string over an alphabet A is a finite ordered sequence of symbols
from A.

The empty string, denoted by ε is the (unique) string of length


zero.
Given an alphabet A, we define
A0 = {ε}
An+1 = AAn

A language L over an alphabet A is a subset of A∗. That is, L ⊂


A∗.
The Roots…
Probability theory, information theory, philosophical
notions of randomness, theory of algorithms
Regular sequence Pr(000000000000000000) = 1/220
Regular sequence Pr(01100110011001100110) = 1/220

Random sequence Pr(01000110101111100001) = 1/220

Classical probability theory cannot express the notion of


randomness of an individual sequence. It can only express
expectations of properties of the total set of sequences under
some distribution.
How to measure information
of a single object?
It has to be an attribute of the object itself
Independent of the description method
Is defined as the length of the shortest binary program from which
the object can be effectively reconstructed.
Example:
Consider the ensemble of all binary strings of length
9999999999999999.
By Shannon’s measure, we require 9999999999999999 bits on
average to encode a string in such an ensemble. However, this
number can be encoded in about 55 bits by expressing it in binary
and adding the repeated pattern 1, or even further 32 x
1111111111111111 (that consists of 24 1’s).
Turing Machine
Turing Machine
A Turing machine
. T = (S, A, δ, s0, b, F )
consists of:
. S = a finite set of states
. A = an alphabet
. δ : Sx(A ∪ {b}) → Sx(A ∪
{b})x{L, R}, the transition
function,
. s0 ∈ S, the start state,
. b, marking unused tape
cells,
. F ⊂ S, halting and/or D
accepting states.
Happy Birthday Alan!
(23 June 1912 – 7 June 1954) was a British
mathematician, logician, cryptanalyst and computer
scientist.
Considered to be the father of modern
computer science.

Provided an influential formalization of the


concept of the algorithm and computation with
the Turing machine.

Turing was homosexual, living in an era when


homosexuality was considered a mental illness
and homosexual acts were illegal. He was
criminally prosecuted, which essentially ended
his career.

He died not long after from what was officially


declared self-induced cyanide poisoning,
although his death was considered very
ambiguous.
Halting Problem
Given a program and an input to the program, whether
the program will eventually halt when run with that
input.

The halting problem is famous because it was one of


the first problems proven algorithmically undecidable
(not computable).

This means there is no algorithm which can be applied


to any arbitrary program and input to decide whether
the program stops when run with that input.
Occam’s Razor
William of Ockham (1290--1349)

 “Entities should not be multiplied beyond necessity.”

 Commonly explained as: when have choices, choose


the simplest theory.

 Bertrand Russell: ``It is vain to do with more what


can be done with fewer.'‘

 Newton (Principia): ``Natura enim simplex est, et


rerum causis superfluis non luxuriat''.
Ray Solomonoff
(born 1926, Cleveland, Ohio)

Algorithmic probability

Theory of inductive inference

Attended the first meeting


Where AI became a field
…..
Andrey Nikolaevich Kolmogorov
(1903, Tambov, Russia—1987 Moscow)‫‏‬

 Measure Theory
 Probability
 Analysis
 Intuitionistic Logic
 Cohomology
 Dynamical Systems
 Hydrodynamics
 Kolmogorov complexity
Gregory Chaitin
(born 1947, is an Argentine-American)

Algorithmic information theory


Chaitin’s constant Ω
Biology
Neuroscience
Philosophy
……
Algorithmic information
(Kolmogorov complexity)
Proof of the Invariance theorem
Fix an effective enumeration of all
Turing
m
achines (TM’s): T1, T2, … Define K = min {|p|: T(p) = x}
U is an optimal universal TM such that (p produces x)
U(1n0p) = Tn(p)‫‏‬
Then for all x:
KU(x) ≤ KTn(x) + n+1, and |KU(x) – KU’(x)| ≤ c
Fixing U, we write K(x) instead of KU(x).
[Paul Vitanyi slide]
Properties and examples
 Intuitively: K(x)= length of shortest description of x
K(x) ≤ |x|+O(1)

 K(x|y)=length of shortest description of x given y.


K(x|y) ≤ K(x)+O(1)‫‏‬

 For all x,
 K(x|x) = O(1)‫‏‬
 K(x|ε) = k(x); K(ε|x)=O(1)‫‏‬

 K(x,x) = K(x) + O(1)‫‏‬


 K(x,y) ≤ K(x) + K(y) + O(log(min{K(x),K(y)})‫‏‬
Randomness
 Randomness of strings mean that they do not contain
regularities.
 If the regularities are not effective, then we cannot use them.

 Hence, we consider randomness of strings as the lack of


effective regularities (that can be exploited).
 For example: a random string cannot be compressed by any
known or unknown real-world compressor.

[Paul Vitanyi slide]


Intuition:
Randomness = Incompressibility

 For constant c>0, a string x ε {0,1}* is c-incompressible if

K(x) ≥ |x|-c.
Usually, we often simply say that x is incompressible
(We will call incompressible strings random strings.)‫‏‬
Shannon Entropy and Kolmogorov
complexity
Shannon entropy of random variable X over sample size S

H(X) = ∑ P(X=x) log 1/P(X=x)

H(X) bits are necessary on P-average to describe the outcome x.

Example. For P uniform over finite S, we have

H(X)=∑ (1/|S|)log |S| = log |S|.

Kolmogorov complexity, is the minimum description (smallest program) for


one fixed x. And the expected H(X) and the P-expectation of K(x) converge to
the same thing.

H(P) = - ∑P(x) log P(x) is assimptotically equal to the expected complexity

∑x P(x)K(x) ≤ - log P(x) + O(1)

[Paul Vitanyi slide]


Symmetry of information
I(x;y) = K(y) - K(y|x)
= K(x) – K(x|y)
= I(x;y) (up to an additive log term)

(the first term is read as “the information x knows about y)


Entropy or CK?
“It has been shown that although in practice we
can’t be guaranteed to get the right answer to
either the entropy or computational complexity
values, we can be sure that they are (essentially)
equal to each other, so both methods can be
useful, depending on what we know about the
system, and what our local goals are. “
• Tom Carter’s notes
Resource bounded KC
 K complexity depends on unlimited computational
resources. Kolmogorov himself first observed that we can
put resource bounds on such computations. This was
subsequently studied by Barzdins, Loveland, Daley, Levin,
Adleman.

 In the 1960’s, two parallel theories were developed:


 Computational complexity – Hartmanis and Stearns,
Rabin, Blum, measuring time/space complexity
 Kolmogorov complexity, measuring information.
 Resource bounded KC links the two theories.

[Paul Vitanyi slide]


Theory
 Ct,s(x|y) is the t-time s-space limited Kolmogorov complexity
of x condition on y. I.e. the length of shortest program that
with input y, produces x in time t(n) and space s(n).

 In standard K complexity, it does not matter if we say


“produces x” or “accepts x”, they are the same. But in
resource bounded case, they are likely to be different. So
Sipser defined CDt,s(x|y) to be the length of the shortest
program that with input y accepts x in time t(n) and space
s(n).
 When we use just one parameter such as time, we will
simply write Ct or CDt.

[Paul Vitanyi slide]


Learning as (lossless)
compression:
Q: how do you compress well?

A: by finding patterns in the data, i.e. by learning to


predict it

compression is learning: in the long run, it's impossible


to compress a source without learning the patterns in it

learning is compression: the output of learning from


data is a more compact representation of the data
Universal learning
KC suggests a universal learning algorithm: given data
from a source, search for a TM that outputs the same
distribution.

Of course, there are infinitely many such programs.

One way to encode Occam's razor is to select the


smallest such program. This is roughly the idea behind
MDL learning (in reality, MDL uses a restricted class of
programs)
Universal Measures of
Similarity
Normalized Compression Distance:

Since KC is uncomputable, we estimate it using gzip (this is an


overestimate)

Grabbing random articles Wikipedia in 4 languages: (Portuguese,


Spanish, Dutch, German), we compute NCD, and find the
following distances:

NL1-NL2 0.9062

PT-ES .9774, .9698


NL-DE .9801, .9812
PT-NL .9872, .9871
PT-DE .9965, .9957
ES-NL .9917, .9961
ES-DE .9975, 1.000
Clustering by Compression

Clustering music
Warnings!
Warning: gzip uses superficial features! It won't capture
capture deeper similarities, since that would require
lots of data and computing time.

This is universal learning: an ideal compression


algorithm will find/exploit any pattern. Of course, this
means that this method really sucks in practice!

You might also like