How to efficiently calculate item-based user similarity when there are huge number of users?

1 vue (au cours des 30 derniers jours)
I have about 75000 users, and I want to calculate the similarity between each of them based on whether they liked certain items or not (1 for liked, 0 for not liked). The items can be present multiple times for each user, and the user's responses to each item are independent of those in the previous instances. This is my code to calculate user similarities:
data=csvread('datafile.csv');
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));
users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity' (i.e., in order of the number of items liked or disliked by the users)
users = users(1:100,:); % For 100 users, it took about 2 hours to finish running.
filename = 'C:\Users\hp1\Desktop\location_similarity.csv'; % The file where I am saving the similarities, as the similarity matrix is too huge for storing in memory
for a=1:size(users,1)
A=data(data(:,10)==users(a,1),12); % Items are in the 12th column of the dataset
catA(:,1)=unique(A);
catA(:,2)=histc(A,unique(A));
totalA = sum(catA(:,2));
catA(:,3)=catA(:,2)/totalA; % Calculating the fraction of items the current user liked
allCatA(:,1)=unique(data(:,2));
allCatA(:,2)=zeros(size(allCatA,1),1);
% Calculating the current user's item preferences
for k=1:size(catA,1)
for l=1:size(allCatA,1)
if catA(k,1)==allCatA(l,1)
allCatA(l,2)=catA(k,3);
end
end
end
for b=1:size(users,1)
B=data(data(:,10)==users(b,1),2);
catB(:,1)=unique(B);
catB(:,2)=histc(B,unique(B));
totalB = sum(catB(:,2));
catB(:,3)=catB(:,2)/totalB;
allCatB(:,1)=unique(data(:,2));
allCatB(:,2)=zeros(size(allCatB,1),1);
for m=1:size(catB,1)
for n=1:size(allCatB,1)
if catB(m,1)==allCatB(n,1)
allCatB(n,2)=catB(m,3);
end
end
end
sim(1,b) = corr(allCatA(:,2),allCatB(:,2)); % Similarity between the 2 users based on correlation of their item preference vectors
clear catB; clear allCatB;
end
dlmwrite(filename, sim(1,:), '-append'); % Saving the similarities of the row in the file
clear catA; clear allCatA;
end
But this code takes a huge time to finish running (about 2 hours to run on only 100 users!). How do I calculate user similarities without using for loops to cut short the time required to run? Any help is appreciated. Thanks.

Réponses (2)

Guillaume
Guillaume le 8 Juin 2017
But this code takes a huge time to finish running (about 2 hours to run on only 100 users!).
No wonder! You're calculating your allCat 100x101 times, which is 101 times more than it needs to be. Plus you're doing it very inefficiently.
In fact, your code is full of convoluted statements such as:
users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity'
which is simply:
user = sortrows(users, 2, 'descend');
and plenty of repeated calls to unique on the same data, e.g.:
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));
which should be:
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10), user(:, 1));
But fixing your redundant calculation of allCat should be your priority. You need to go over each user only once, e.g:
allcat = unique(data(:, 2)); %No idea what data(:, 2) is
allcat = [allcat, zeros(size(allcat, 2), size(users, 1))]; %pre-allocate as many additional columns as there are users
for iuser = 1:numel(users);
items = data(data(:, 10) == users(iuser, 1), 12);
uitems = unique(items);
itemcount = histc(items, uitems);
itemratio = itemcount / sum(itemcount);
%the following is a lot more efficient that your double for loops k and l (and m and n):
[matched, whichitem] = ismember(allcat(:, 1), uitems));
allcat(matched, iuser) = itemratio(whichitem(matched));
end
Then you can do the correlation in just one go between all the user columns:
similarity = corr(allcat(:, 2:end));
  2 commentaires
Prasanta Saikia
Prasanta Saikia le 8 Juin 2017
Thanks for your answer.
1. I tried using
user = sortrows(users, 2, 'descend');
but for some reason, it gives the following error:
Error using sortrows
Too many input arguments.
That's why I had to use this convoluted way to arrange in descending order. But that was not the factor eating up the time anyway, like you correctly identified.
2. After some modifications to your code, I was able to get it to work. Now, while it is of course faster, there is significant deviation from similarities I obtained using other methods.
By using Pearson correlation based similarity, which is basically:
for a=1:size(users,1)
for b=1:size(users,1)
x = data(data(:,10)==users(a,1),12);
y = data(data(:,10)==users(b,1),12);
x1 = histc(x,unique(data(:,12)));
y1 = histc(y,unique(data(:,12)));
S(a,b) = corr(x1,y1);
end
end
the correlation between the similarity obtained from your code with that obtained by the Pearson correlation similarity is 72%.
Then, I calculated the similarity using cosine similarity, which is basically:
users=users(1:10,:);
for a=1:size(users,1)
for b=1:size(users,1)
x = data(data(:,10)==users(a,1),12);
y = data(data(:,10)==users(b,1),12);
x1 = histc(x,unique(data(:,12)));
y1 = histc(y,unique(data(:,12)));
S(a,b) = dot(x1,y1)/(norm(x1,2)*norm(y1,2));
end
end
the correlation between the similarity obtained from your code with that obtained by the cosine similarity is 87%.
And finally, the correlation between the similarity obtained from your code with that obtained by my original code in the question is 92%.
I don't know which why this difference occurs. Furthermore, these differences increase when I increase the number of users (for 1000 users, the correlations decrease to 64%, 73% and 77% respectively). Whereas, the correlations between the similarities obtained using my original code, Pearson correlation similarity, and cosine similarity are always around 80-95%. Would you know why this difference arises with your code?
P.S. I wrote my custom similarity code as it was still faster than the cosine and the Pearson correlation codes I wrote, even with the multiple for loops. Of course, your code with the loops removed is much much faster than all three, but as I said, gives the most different similarity values compared to the other 3.
Guillaume
Guillaume le 8 Juin 2017
sortrows has supported the 'descend' option since R2013b. If you're using an ancient version of matlab you need to say.
Please provide some sample data if you want me to test the code I've posted. Obviously without any data to test with, I've no idea if there are any typos or mistake. You said you needed to make so modification. In theory, no modification was needed so maybe I overlooked something.
If you want to recreate your correlation values with a double loop, you can still do so after the loop that create the allcat:
S = zeros(size(users, 1));
for iuser1 = 1:size(users, 1);
for iuser2 = 1:size(users, 2);
S(iuser, iuser2) = yoursimilarityfunction(allcat(:, iuser1), allcat(:, iuser2));
end
end
But as far as I understand, this is what corr does anyway when passed a matrix (I don't have the stats toolbox).

Connectez-vous pour commenter.


Ayush Jain
Ayush Jain le 6 Jan 2022
Input: User id ‘U1’ /*id for a user*/ Item id /*all items ratings which are rated by user*/ Cluster’s array /*using Fuzzy C-Mean*/ Colony’s array /*Artificial Algae Algorithm*/ Output: Similarity Array for U1 from other users Calculate Avg. Rating for U1 for each User ‘U2’ from the user set Initialization of variables numerator and denominator. if U1 and U2 are in the same cluster for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 assign similarity of U1 and U2 is 0 else assign similarity of U1 and U2 from Eq. (1) end if U1 and U2 are in the same colony for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 Res=0 (similarity of U1 and U2) else Res= similarity of U1 and U2 from Eq. (1) end Recalculate similarity using Eq. (6) Use function is written in Eq. (5) End

Catégories

En savoir plus sur Get Started with MATLAB dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by