How does 'unique' work under the hood (in general terms)

10 vues (au cours des 30 derniers jours)
Erik Johannes Loo
Erik Johannes Loo le 14 Fév 2022
I am just curious as to how this function works by default, under the hood.
I know that a couple of ways of implementing unique manually is to copy over the elements from one array into a hash table or a binary search tree.
Given that the default implementation of 'unique' also sorts the elements, I would guess that it maybe copies over the elements into a binary search tree.
As a result, if I need to both sort and remove duplicate elements/rows from an array I would only need to call 'unique' instead of sort(unique(...)) or sortrows(unique(...), ...)?
  3 commentaires
Erik Johannes Loo
Erik Johannes Loo le 14 Fév 2022
Thank you for the file - it could come in handy. I was asking in case someone did in fact know how TheMathWorks implemented the function, out of curiosity. It seems that for the 'sorted' case they probably use sort followed by diff.
What is interesting is that using hash should in theory have a smaller run-time complexity, of O(n) instead of O(n log n), and it would also preserve the order which might be the expected default behaviour by most MATLAB users.
Rik
Rik le 14 Fév 2022
Those who know can't tell, those who can tell don't know. You could try to dig in the source code, but you can't do anything with that knowledge.
I would be very hesitant to apply a big O notation on my function. I rolled my own hashing algorithm. You should be deeply, DEEPLY, mistrustful of me. It works for what I do with it: confirm my test functions return the same output. I only use this unique_array function if I'm there to catch any strange errors.
I suspect the unique function will run different code when it is called with the 'stable' flag. Mathworks employs smart people, who're probably aware of this trade-off.

Connectez-vous pour commenter.

Réponse acceptée

Bruno Luong
Bruno Luong le 14 Fév 2022
Modifié(e) : Bruno Luong le 14 Fév 2022
If I have to implement unique, I would implemented by sorting the find the difference similar to this
a=randi(10,1,100);
% u is unique of a
s=sort(a);
j=[true,diff(s)~=0];
u=s(j)
Working with contiguous array if much more efficient than binary search and hash function. But TMW doesn't not document how unique works.
  1 commentaire
Erik Johannes Loo
Erik Johannes Loo le 14 Fév 2022
Thanks for your reply. The fact that TMW does not disclose how 'unique' works explains why I wasn't able to find anything before.
In my mind I can see that the runtime complexity would be the same for your approach and the binary search approach O(n log n) - dominated by either the sort function or the BST insert function, but I have in fact noticed that working with arrays is much faster and can be done allocating minimal additional memory (the boolean j array).

Connectez-vous pour commenter.

Plus de réponses (1)

Walter Roberson
Walter Roberson le 14 Fév 2022
unique() does a sort() and then checks where adjacent elements are different.
You can read the code; it is a .m file.

Catégories

En savoir plus sur Shifting and Sorting Matrices dans Help Center et File Exchange

Produits


Version

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by