Text Representation and Compression
Module 2
Dr Pushpavathi.K.P, BMSCE
Types of text
• Unformatted text – Plain text.
• Formatted text – Rich text.
• Hypertext.
Dr Pushpavathi.K.P, BMSCE
Unformatted text
• ASCII character set.
• Printable characters- alphabetic, numeric, punctuation.
• Control characters:
1. Format control character- BS, SP, DEL, ESC.
2. Information separators - FS, RS.
3. Transmission control characters – SOH, STX, ETX, ACK, NAK,
SYN.
• Mosaic characters- supplementary version – to create relatively
simple graphical images.
Dr Pushpavathi.K.P, BMSCE
Character set to produce unformatted text
ASCII
character set
Dr Pushpavathi.K.P, BMSCE
Character set to produce unformatted text (cntd.)
Mosaic characters
Dr Pushpavathi.K.P, BMSCE
Formatted text
• Publishing sector.
• Characters of different styles and variable size.
• Structure a document into chapters, paragraph.
• Graphics and pictures inserted at appropriate points.
• Enter a specific command.
• Reserved format control character.
• Numeric or alphabetic character.
Dr Pushpavathi.K.P, BMSCE
Formatted text (cntd.)
a) formatted text string b) printed version of string
Dr Pushpavathi.K.P, BMSCE
Hyper Text
• Pages
• Defined linkage points-hyperlinks
• Electronic version of the documents.
• Browser.
• Home page
• URL
• HTML- presentation of the document.
• Directives – page formatting commands sandwiched between pair of
tags , eg: <p> , <B>text</B>
Dr Pushpavathi.K.P, BMSCE
Electronic document written using hypertext
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Digital document
Dr Pushpavathi.K.P, BMSCE
Digital document (cntd.)
• Vertical resolution – 3.85 or 7.7 lines/mm or 100 or 200 lines/inch
• o/p of scanner – Resolution of apprx. 8 pels
• Bitonical images –black and white
• One binary digit –represent each pel
0 for white , 1 for black pel
• Typicla image – stream of 2 million bits - uncompressed
Dr Pushpavathi.K.P, BMSCE
Facsimile conversion codes
• ITU-T group 3 and group 4- standards
a) Termination codes – run length – 0 to 63
b) Make-up codes – run length – multiples of 64
• Table of code words –produced - based on relative frequency of
occurrence of no. of white and black pels in scanned lines.
• Code words are fixed
Dr Pushpavathi.K.P, BMSCE
Facsimile conversion codes (cntd.)
Dr Pushpavathi.K.P, BMSCE
Compression principles
• Source encoders and destination decoders.
• Lossless and lossy compression.
• Entropy encoding.
• Source encoding.
Dr Pushpavathi.K.P, BMSCE
Source Encoders and Destination Decoders
Dr Pushpavathi.K.P, BMSCE
Lossless and lossy compression
• Lossless compression algorithm – reversible.
• Example – text file.
• Lossy compression algorithm – version of it.
• Higher level of compression – more approximated the received
version.
• Example – images, audio and video streams.
Dr Pushpavathi.K.P, BMSCE
Entropy encoding:
• Lossless and independent.
• Information representation.
• Two examples:
1. Run length encoding.
2. Statistical encoding.
Dr Pushpavathi.K.P, BMSCE
Entropy encoding (cntd.)
• Entropy of the source – minimum average no of bits required to
transmit a particular source stream
• Qualitatively - defined as the useful information
• Ebits(C) = -log2(PC) , where Ebits(C) –entropy of character C, - in bits
PC – probability of character C in given data
• If the stream contains n characters or symbols ,
Η = 𝜂 = − σ𝑛𝑖=1 𝑃𝑖𝑙𝑜𝑔2(𝑃𝑖 ) = σ𝑛𝑖=1 𝑃𝑖𝑙𝑜𝑔2(1/𝑃𝑖 )
• Higher the probability of occurrence of a symbol, lower the entropy
Dr Pushpavathi.K.P, BMSCE
Run length encoding:
• Repetitive Character encoding
• Used - Source information comprises of long substrings of same
character or bit.
• Transmitted as different set of code words.
• Indicates particular transmitted bit and the number of characters in
substring.
• Destination knows the set of code words being used.
Dr Pushpavathi.K.P, BMSCE
Statistical encoding:
• Variable length code word.
• Prefix property.
• Huffman encoding algorithm.
• Entropy.
entropy of source
• Efficiency of encoding scheme =
average number of bits per codeword
Where, average number of bits per codeword = σ𝑛𝑖=1 𝑁𝑖𝑃𝑖
Dr Pushpavathi.K.P, BMSCE
Source encoding:
• Differential encoding.
• Transform encoding.
Dr Pushpavathi.K.P, BMSCE
Differential encoding:
• Amplitude of signal covers larger range, but difference in amplitude
between successive values is relatively small.
• Smaller set of codewords.
• Indicates difference in amplitude between current signal and
immediately preceding value.
• Can be lossless or lossy – depending on number of bits to encode
different value.
Dr Pushpavathi.K.P, BMSCE
Transform encoding
• Transforming the source information.
• No loss of information.
• Digitization of image produces two dimensional matrix.
• Rate of change in magnitude – Spatial frequency.
• Horizontal and vertical frequency components.
• Human eye is less sensitive to higher spatial frequency.
• DCT.
Dr Pushpavathi.K.P, BMSCE
Transform encoding: pixel pattern
Dr Pushpavathi.K.P, BMSCE
Transform encoding : DCT principle
Dr Pushpavathi.K.P, BMSCE
Text Compression
• Compression algorithm must be lossless.
• Entropy encoding – Statistical encoding.
• Two types of Statistical encoding:
1. Optimum set of codewords – Huffman coding, Arithmetic coding
2. Variable length characters – Lempel-Ziv (LZ) algoritm.
• Two types of coding :
1. Static coding.
2. Dynamic or adaptive coding.
Dr Pushpavathi.K.P, BMSCE
Static Huffman Coding
• Transmitted character string is analyzed.
• Determine character types and relative frequency.
• Coding operation involves creating an unbalanced tree.
• Huffman code tree – Binary tree.
• Branches are assigned the value 0 or 1.
• Root node (top), branch node , leaf node (termination point).
Dr Pushpavathi.K.P, BMSCE
Huffman code tree
Example - character string : AAAABBCD (Final tree with codes)
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Decoding Algorithm
• Assume code words is available at receiver.
• Received bit stream – variable BIT-STREAM.
• Bits in each code word – variable CODEWORD.
• ASCII code word – variable RECEIVE-BUFFER.
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Decoding (cntd.)
• code words related to data.
• Done in two ways.
• Adaptive Compression.
• Disadvantage- new set of code words for new data- overheads
Dr Pushpavathi.K.P, BMSCE
Dynamic (adaptive) Huffman Coding
• Both transmitter and receiver build the Huffman tree –dynamically.
• If the character to be transmitted is present in tree – its codeword is
determined and sent in normal way.
• If not present (first occurrence) – character is transmitted in
uncompressed form.
• Tree is updated by – 1. incrementing the frequency of occurrence
or 2. introducing the new character to the tree.
Dr Pushpavathi.K.P, BMSCE
Adaptive Huffman tree
Dr Pushpavathi.K.P, BMSCE
Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Incrementing the count : if next symbol is ‘A’
Dr Pushpavathi.K.P, BMSCE
Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Sibling property - swapping
One more increment in ‘A’
Dr Pushpavathi.K.P, BMSCE
Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Swapping of internal nodes
Receiving two more ‘A’ s
Dr Pushpavathi.K.P, BMSCE
Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Arithmetic Coding
• Complicated than Huffman coding.
• Assigns one code for each encoded string of characters – unlike
Huffman that use separate codeword for each character.
• Coding start with a certain interval – read input symbol by symbol –
use the probability of each symbol to narrow the interval
• High probability symbols contribute fewer bits to output
• Output is interpretedas a number in the range [0,1).
Dr Pushpavathi.K.P, BMSCE
Arithmetic Coding (cntd.)
e=0.3, n=0.3, t=0.2, w=0.1, .=0.1 ‘WENT’
Dr Pushpavathi.K.P, BMSCE
Arithmetic Coding (decoding)
• Decoder knows:
• Set of characters present in the received encoded messages
• Segment to which each character has been assigned
• Its related range.
• Follow the same procedure as in encoder.
• No. of decimal digits increase linearly with no. of characters in string.
• Complex messages –first fragmented into multiple smaller strings –
encoded separately
• Resulting set of code words – sent as blocks of floating point numbers
(binary) in known formats.
Dr Pushpavathi.K.P, BMSCE
Lempel-Ziv Coding
• Uses strings of characters for the coding instead of single characters.
• Table holds character string of the text (all possible words).
• Encoder sends only the index of the word.
• Decoder use this index to access the word.
• Table is used as dictionary.
• Dictionary based compression algorithm.
• Word processing packages have dictionary-15 bits (25000 words).
• Shorter words – lower compression ratio, longer words – higher.
Dr Pushpavathi.K.P, BMSCE
Lempel-Ziv-Welsh Coding
• Contents of dictionary is built dynamically by encoder and decoder.
• Initially dictionary consists of only the character set-ASCII.
• To determine compression level - no. of entries in dictionary – no.of
bits needed for index.
• Ex : this is simple as it is
Dr Pushpavathi.K.P, BMSCE
Lempel-Ziv-Welsh (LZW) compression algorithm
Basic operation
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Reference
• Multimedia communications- Applications , networks, Protocols and
Standards – Fred Halsall
Dr Pushpavathi.K.P, BMSCE