Checking repetition of random data
Afficher commentaires plus anciens
I heed your help please. I made a random data for example T1 = randn (1000,1); T2= randn (1000,1); .... T100=randn (1000,1); and I want check whether there is any repetition for T's if so then remove it. How can I do that ?? Thanks in advance :)
Regards, Ahmed
11 commentaires
Adam
le 4 Jan 2018
doc unique
Student for ever
le 4 Jan 2018
Jan
le 4 Jan 2018
Do you mean repetition inside each vector, or between elements of all different vectors?
Student for ever
le 4 Jan 2018
Image Analyst
le 4 Jan 2018
Are you setting a seed? Do you know what a seed is?
@Ahmed: This does not answer my question. Do you want to avoid repetitions of elements inside each T, or should different T do not have the same value at the same index, or should the elements of each T not appear anywhere in any other T, or should the vectors T be different, but single values can be identical? Which kind of "repetitions" has to be avoided in your problem? Should the corresponding T vector be removed, or replaced by new data, or combined, or re-ordered?
Are you talking about time series or data created by randn?
Student for ever
le 7 Jan 2018
Student for ever
le 7 Jan 2018
Student for ever
le 7 Jan 2018
Star Strider
le 7 Jan 2018
@Ahmed — See the documentation on rng (link), and more generally, the discussion on Generate Random Numbers That Are Repeatable (link).
Student for ever
le 7 Jan 2018
Modifié(e) : Jan
le 7 Jan 2018
Réponse acceptée
Plus de réponses (2)
Birdman
le 4 Jan 2018
Firstly, generate random data as follows:
T=randn(1000,100);
Secondly, as Adam said, use unique function to check repetitions.
Tun=unique(T,'stable');
stable command helps to protect the initial order of values.
5 commentaires
Student for ever
le 4 Jan 2018
Yes Jan, exactly. But of course the initial vector T can also be overwritten. It is up to the user.
You are welcome Ahmed.
Edit: Jan, using 'stable' flag is just a habit for me. I do not want to lose the order of data with the stuff that I am working, therefore I use it but of course it can be removed if wanted.
Jan
le 4 Jan 2018
A "habit"? :-) I'd suggest to use time consuming methods only, if they are needed for the results.
Birdman
le 4 Jan 2018
It is needed for result, exactly.
John BG
le 5 Jan 2018
Hi Ahmed
so far, the supplied answers increase the probability to generate all-different, random Ts.
Each of the answers improves generation randomness, yet if you really want to make sure that all T sequences are different, once generated, let's say you don't really have control on the randomness of the data and the the suggested randn(1000,1) is you model, then there's no other way than comparing them by pairs.
1.
Let be N the amount of T sequences
N=5
2.
then all possible pairs of T sequences are
L=combinator(N,2,'c')
=
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
3.
As Jan Simon mentions, sometimes it's more practical to put all data in a structure that can be indexed, instead of working with N different sequence names.
Let be T all your input Ti sequences compiled into a single matrix
T=randi([1 10],N)
T =
8 2 3 9 3
3 5 8 10 9
7 10 3 6 3
7 4 6 2 9
2 6 7 2 3
4.
Checking there are no 2 equal sequences
D=[0 0];
for k=1:1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
D=[D;L(k,:)];
end
end
5.
Removing repeated sequences
if size(D,1)>1
D(1,:)=[];
T(D(:,1),:)=[]; % removing one of the repeated identical pairs
end
T
.
Ahmed, I have overwritten some sequences on purpose, so the counter D shows spotted repeated sequences and these simple lines remove all repetition without losing data (when more than one repetition of same given sequence) and it works.
If you find this answer useful would you please be so kind to consider marking my answer as Accepted Answer?
To any other reader, if you find this answer useful please consider clicking on the thumbs-up vote link
thanks in advance for time and attention
John BG
12 commentaires
If a loop is wanted for any reasons, the iterative growing of arrays should be avoided, because it is extremely inefficient. Step 4 could be:
D = zeros(size(L, 1), 2); % Pre-allocation!!!
iD = 0;
for k = 1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
iD = iD + 1;
D(iD, :) = L(k, :);
end
end
D = D(1:iD, :); % Crop unneeded elements
Or even leaner by storing the indices k only:
dup = false(size(L, 1), 2); % Pre-allocation!!!
for k = 1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
dup(k) = true;
break;
end
end
L = L(dup, :);
Two loops are easy here, such that calling combinator is not needed:
nT = size(T, 1);
keep = true(nT, 1); % Pre-allocation!!!
for i1 = 1:nT
Ti1 = T(i1, :);
for i2 = i1 + 1:nT
if isequal(Ti1, T(i2, :))
keep(i1) = false;
break; % No need to proceed the search
end
end
end
T = T(keep, :);
But the set of unique vectors can be obtained much easier by a single built-in function:
[T, Idx] = unique(T, 'rows')
John BG
le 6 Jan 2018
Checking delays for 100 strings shows that unique is the fastest option:
N=100
M=1000
% T=randi([1000 9999],N,M);
T=repmat(randi([1000 9999],1,M),N,1);
tic
D=[0 0];
L=combinator(N,2,'c');
for k=1:1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
D=[D;L(k,:)];
end
end
if size(D,1)>1
D(1,:)=[];
T(D(:,1),:)=[]; % removing one of the repeated identical pairs
end
toc
100: Elapsed time is 0.068783 seconds.
1000: Elapsed time is 7.953736 seconds.
single string repeated 100 times: Elapsed time is 0.112921 seconds.
tic
L=combinator(N,2,'c');
D = zeros(size(L, 1), 2); % Pre-allocation!!!
iD = 0;
for k = 1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
iD = iD + 1;
D(iD, :) = L(k, :);
end
end
D = D(1:iD, :);
toc
100: Elapsed time is 0.075002 seconds.
1000: Elapsed time is 7.884542 seconds.
single string repeated 100 times: Elapsed time is 0.085103 seconds.
tic
L=combinator(N,2,'c');
dup = false(size(L, 1), 2); % Pre-allocation!!!
for k = 1:size(L,1)
if isequal(T(L(k,1),:),T(L(k,2),:))
dup(k) = true;
break;
end
end
L = L(dup, :);
toc
100: Elapsed time is 0.062778 seconds.
1000: Elapsed time is 7.863167 seconds.
single string repeated 100 times: Elapsed time is 0.030683 seconds.
tic
nT = size(T, 1);
keep = true(nT, 1); % Pre-allocation!!!
for i1 = 1:nT
for i2 = i1 + 1:nT
if isequal(T(i1, :), T(i2, :))
keep(i1) = false;
break; % No need to proceed the search
end
end
end
T = T(keep, :);
toc
100: Elapsed time is 0.068909 seconds.
1000: Elapsed time is 7.784486 seconds.
single string repeated 100 times: Elapsed time is 0.034376 seconds.
tic
[T2, Idx] = unique(T, 'rows');
toc
100: Elapsed time is 0.023476 seconds.
1000: Elapsed time is 0.061907 seconds.
single string repeated 100 times: Elapsed time is 0.024031 seconds.
When increasing the amount of strings, unique outperforms any other solution, regarding time delay.
Regards
John BG
Student for ever
le 7 Jan 2018
Student for ever
le 7 Jan 2018
Hi Ahmed
ok, columns
N=200
M=1000
% T=randi([1000 9999],N,M); % test
% T=repmat(randi([1000 9999],1,M),M,N); % test
T=randi([1000 9999],M,N);
tic
D=[0 0];
L=combinator(N,2,'c');
for k=1:1:size(L,1)
if isequal(T(:,L(k,1)),T(:,L(k,2)))
D=[D;L(k,:)];
end
end
if size(D,1)>1
D(1,:)=[];
T(:,D(:,1))=[]; % removing repetitions
end
toc
Elapsed time is 0.092280 seconds.
Also
Please correct me if wrong but:
1.- if you had wanted to use command unique you would have already done it, yet command unique requires all series to have same length.
2.- if the lengths of the series are variable then unique, just as now suggested, cannot be used, and even the suggested loops need further refinement.
How would you like to proceed, single command unique is ok?
Or the lengths of the 200 samples vary from sample to sample?
does the following solve the question? it's quite fast
D=[0 0];
L=combinator(N,2,'c');
for k=1:1:size(L,1)
if isequal(T(:,L(k,1)),T(:,L(k,2)))
D=[D;L(k,:)];
end
end
if size(D,1)>1
D(1,:)=[];
T(:,D(:,1))=[]; % removing repetitions
end
Student for ever
le 8 Jan 2018
Jan
le 8 Jan 2018
@Ahmed: Then see my answer. Just transpose the input.
John BG
le 8 Jan 2018
Hi Ahmed
correct, what's the point of simulating the data with randn if you can work directly on the data.
Ahmed, would you please be so kind to confirm that you have accepted the command unique answer?
@John BG: Why should Ahmed confirm this? As you know, only the OP can accept an answer in the first week. This fact was mentioned just some days ago: https://www.mathworks.com/matlabcentral/answers/375136-solving-system-of-equations#comment_520771 . You can take a look into "More > Recent Activity" also: See 8 Jan 2018 at 11:30.
You have started too many discussions about accepting answers already.
Student for ever
le 9 Jan 2018
Modifié(e) : Student for ever
le 9 Jan 2018
Catégories
En savoir plus sur Matrix Indexing dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!