I want to know if I am calculating the results are right?

I have applied run length encoding on a string or characters. Let it be
'ttttttPPttttttPPNNNtttttt'
which after Run length encoding turns to be t6P2t6P2N3t6
so I calculated the compression ratio to be (12/25)*100= 48%
Is it correct???

 Réponse acceptée

Walter Roberson
Walter Roberson le 6 Mai 2015
Modifié(e) : Walter Roberson le 6 Mai 2015
Close. The original is length 25 not 26.
But how are you going to handle counts of greater than 9? And how are you going to handle characters that do not repeat?

8 commentaires

Why would it not work for counts greater than then, you could look in the string for something that's not a number and then select the number that comes after, untill another not a number is found. Characters that do not repeat don't have a number at all so they will have a 1 behind them, or if you code it a bit better, they won't have a number behind it at all.
Yes, you are right. I didn't think of this. I am executing the code below....can you help me to convert this to a program which is taking care of both point you mentioned. Please
close all;
clear all;
clc;
X= [ 12 1 2 2 2 2 2 2 2 2 2 2; % 12 1 2 4 12 1 2 4
12 1 2 2 2 2 2 2 2 2 2 2];
[D]=X';
lengh=numel(D);
aaAA=estring(D);
zz=numel(aaAA);
comp_ratio=(lengh)/(zz)
%%%%%------------------------------------------
function y = estring(str)
len = numel(str); %65536
i = 0;
count = zeros(1,len);
y=[];
while( i<len )
j=0;
count(i+1) = 1;
while( true )
j = j + 1;
if( i+j+1 > len )
break;
end
if( str(i+j+1)==str(i+1) )
count(i+1) = count(i+1) + 1;
else
break;
end
end
if( count(i+1)==1 )
a=str(i+1);
length(a);
y = [y a];
i = i + 1;
else
a=str(i+1);
b=count(i+1);
y =[y a b];
i = i + b;
end
end
end
Thomas, what if there was a digit in the original string? For example,
pp222222t
then the original encoding scheme would encode as p226t1 and that is fine as long as you are not allowed to have more than 9 of a character in succession, as you could count on it alternating source-letter and single digit of count. But if your count hit (say) 13, then you start to need two digits of count: ppppppppppppp222222t would encode as p1326t1 if you worked in a straight forward manner. Now try to decode that. Is that 'p1' '326' 't1' ? p33333333333333333333333333t would encode to the same thing. But so would repmat('p',1326) followed by 't'.
A standard way to handle this is to use byte pairs, with the first byte indicating the count, and the second indicating the original byte to be repeated. That limits you to 255 (maximum binary value of an 8 bit byte) repetitions, but since a repetition count of 0 will never occur you can encode the count as 1 lower than it actually is, allowing 1 to 256 repetitions. And if your block of bytes has more than 256 the same, then you then just start a new count for the same character.
There are also adaptations for the case where there are a lot of different bytes in a row but occasional bursts of repeated bytes.
I understand the issue if there are digits in the original string, but OP never gave this as a possibility. My solution only works if this is the case.
Tina, in your code, if the count goes past 2^53-1 then the double precision floating point counter would not be able to keep track of the count:
R = 2^53-1;
(R+1)==(R+2) %returns 1, indicating that it cannot tell the difference between the two values
This is really a hint about something else that is probably wrong with the code, but we cannot say because you did not document the conditions under which the code is expected to work. What is the source of input to be, what data type? What is the output to be?
I am applying run length encoding after EZW ( embedded zero tree wavelength encoding). EZW gives a dominant list D as output. the D contains the string containing mixture of the symbols T n Z P p N( 6 symbols), as I given in my example. Earlier,I didn't give attention to both points of Walter Roberson. But now I want to make changes in my code given earlier. Please help me.
Walter....I checked my code and I came to know that for each repetition it is giving a space in the output....either it is 10 times repetition or 1 time. So your second doubt of greater than 9 is solved here. Now please help me to solve your doubt no 1. that is putting non repeating character with 1 count....e.g. ttPtt is my input string than I am getting output as t Pt * . I want to change this to *t P t
Your aaAA is a vector that contains a mix of original characters and binary counts. Because of the details of how you built it, the vector will be of type char(). The spaces or not that you see are the results of looking at char() of the binary counts. For example if the count happened to be 9 then the corresponding position would hold char(9) which is the tab character. If you want the character '9' to show up instead of binary 9, then you need to change your program, and you need to take into account that counts of more than 9 might occur so that multiple output character positions in the string might be required to do the encoding.

Connectez-vous pour commenter.

Plus de réponses (1)

a='ttttttPPttttttPPNNNtttttt';
b='t6P2t6P2N3t6';
A=whos('a');
B=whos('b');
compression=B.bytes/A.bytes*100;
compression =
48

7 commentaires

compressio ratio can be up to 100?
So I actually made a mistake, it should be A/B instead of B/A. But you get the idea!
in this case compression ratio would be 208. Is it too possible? CR can be greater than 100?
yes, but you would probaly write it as 2.08, don't multiply my 100 :)
you sure about this?
What Thomas had written originally was the calculation for percentage of the original file size, but as I explained earlier the standard for compression ratio is number of original bytes to number of compressed bytes which is the inverse and has no percentage calculation. If the compressed file was 50% of the original size then it is half of the original size, so two bytes of original data are represented by 1 byte of compressed data for a ratio of 2:1. Better compression would have the first number increase. 4:1 would mean 4 bytes of original became 1 compressed, so the compressed file was 1/4 (25%) of the original size.
In general, using the whos() byte count to define the compression ratio is wrong. whos() will report the number of bytes used to store the data in the data type you used for your storage, but if you did not pack that datatype completely full then you could create a byte array that was smaller and would be the "real" size.
For example it might be convenient (or sloppy programming) to use double precision values to store results that are integral from 0 to 2^24-1 . That's only 3 bytes worth of value stored in 64 bits (8 bytes). The 3 useful bytes from each value could be saved to an array or written to a file, producing a smaller result.
When you use the whos() byte count you need to understand how much information you have packed into those bytes and adjust accordingly.
For this reason, the calculations would more often be done not by asking the programming language for the number of bytes used for storage (which might include overheads), but instead by multiplying the number of elements stored by the number of used bytes per element. In some cases it instead becomes more useful to multiply the count of the number of elements stored by the number of useful bits per element, and divide the total by the bits-per-byte of 8, rounding up: the result is the number of bytes that the information could be written to in a file, after going through procedures to pack down the information. It isn't always worth going through that trouble of packing down the information just to prove that you could. For example you might run different algorithms, find the one with the smallest theoretical size, and only then bother to do the packing down, since packing down can be a sort-of-expensive operation when you can't access hardware bit-rotate instructions.

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by