How to efficiently calculate item-based user similarity when there are huge number of users?
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have about 75000 users, and I want to calculate the similarity between each of them based on whether they liked certain items or not (1 for liked, 0 for not liked). The items can be present multiple times for each user, and the user's responses to each item are independent of those in the previous instances. This is my code to calculate user similarities:
data=csvread('datafile.csv');
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));
users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity' (i.e., in order of the number of items liked or disliked by the users)
users = users(1:100,:); % For 100 users, it took about 2 hours to finish running.
filename = 'C:\Users\hp1\Desktop\location_similarity.csv'; % The file where I am saving the similarities, as the similarity matrix is too huge for storing in memory
for a=1:size(users,1)
A=data(data(:,10)==users(a,1),12); % Items are in the 12th column of the dataset
catA(:,1)=unique(A);
catA(:,2)=histc(A,unique(A));
totalA = sum(catA(:,2));
catA(:,3)=catA(:,2)/totalA; % Calculating the fraction of items the current user liked
allCatA(:,1)=unique(data(:,2));
allCatA(:,2)=zeros(size(allCatA,1),1);
% Calculating the current user's item preferences
for k=1:size(catA,1)
for l=1:size(allCatA,1)
if catA(k,1)==allCatA(l,1)
allCatA(l,2)=catA(k,3);
end
end
end
for b=1:size(users,1)
B=data(data(:,10)==users(b,1),2);
catB(:,1)=unique(B);
catB(:,2)=histc(B,unique(B));
totalB = sum(catB(:,2));
catB(:,3)=catB(:,2)/totalB;
allCatB(:,1)=unique(data(:,2));
allCatB(:,2)=zeros(size(allCatB,1),1);
for m=1:size(catB,1)
for n=1:size(allCatB,1)
if catB(m,1)==allCatB(n,1)
allCatB(n,2)=catB(m,3);
end
end
end
sim(1,b) = corr(allCatA(:,2),allCatB(:,2)); % Similarity between the 2 users based on correlation of their item preference vectors
clear catB; clear allCatB;
end
dlmwrite(filename, sim(1,:), '-append'); % Saving the similarities of the row in the file
clear catA; clear allCatA;
end
But this code takes a huge time to finish running (about 2 hours to run on only 100 users!). How do I calculate user similarities without using for loops to cut short the time required to run? Any help is appreciated. Thanks.
0 commentaires
Réponses (2)
Guillaume
le 8 Juin 2017
But this code takes a huge time to finish running (about 2 hours to run on only 100 users!).
No wonder! You're calculating your allCat 100x101 times, which is 101 times more than it needs to be. Plus you're doing it very inefficiently.
In fact, your code is full of convoluted statements such as:
users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity'
which is simply:
user = sortrows(users, 2, 'descend');
and plenty of repeated calls to unique on the same data, e.g.:
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));
which should be:
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10), user(:, 1));
But fixing your redundant calculation of allCat should be your priority. You need to go over each user only once, e.g:
allcat = unique(data(:, 2)); %No idea what data(:, 2) is
allcat = [allcat, zeros(size(allcat, 2), size(users, 1))]; %pre-allocate as many additional columns as there are users
for iuser = 1:numel(users);
items = data(data(:, 10) == users(iuser, 1), 12);
uitems = unique(items);
itemcount = histc(items, uitems);
itemratio = itemcount / sum(itemcount);
%the following is a lot more efficient that your double for loops k and l (and m and n):
[matched, whichitem] = ismember(allcat(:, 1), uitems));
allcat(matched, iuser) = itemratio(whichitem(matched));
end
Then you can do the correlation in just one go between all the user columns:
similarity = corr(allcat(:, 2:end));
2 commentaires
Guillaume
le 8 Juin 2017
sortrows has supported the 'descend' option since R2013b. If you're using an ancient version of matlab you need to say.
Please provide some sample data if you want me to test the code I've posted. Obviously without any data to test with, I've no idea if there are any typos or mistake. You said you needed to make so modification. In theory, no modification was needed so maybe I overlooked something.
If you want to recreate your correlation values with a double loop, you can still do so after the loop that create the allcat:
S = zeros(size(users, 1));
for iuser1 = 1:size(users, 1);
for iuser2 = 1:size(users, 2);
S(iuser, iuser2) = yoursimilarityfunction(allcat(:, iuser1), allcat(:, iuser2));
end
end
But as far as I understand, this is what corr does anyway when passed a matrix (I don't have the stats toolbox).
Ayush Jain
le 6 Jan 2022
Input: User id ‘U1’ /*id for a user*/ Item id /*all items ratings which are rated by user*/ Cluster’s array /*using Fuzzy C-Mean*/ Colony’s array /*Artificial Algae Algorithm*/ Output: Similarity Array for U1 from other users Calculate Avg. Rating for U1 for each User ‘U2’ from the user set Initialization of variables numerator and denominator. if U1 and U2 are in the same cluster for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 assign similarity of U1 and U2 is 0 else assign similarity of U1 and U2 from Eq. (1) end if U1 and U2 are in the same colony for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 Res=0 (similarity of U1 and U2) else Res= similarity of U1 and U2 from Eq. (1) end Recalculate similarity using Eq. (6) Use function is written in Eq. (5) End
0 commentaires
Voir également
Catégories
En savoir plus sur Get Started with MATLAB dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!