How to efficiently calculate item-based user similarity when there are huge number of users?

Question

Prasanta Saikia le 8 Juin 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/343821-how-to-efficiently-calculate-item-based-user-similarity-when-there-are-huge-number-of-users

Réponse apportée : Ayush Jain le 6 Jan 2022

I have about 75000 users, and I want to calculate the similarity between each of them based on whether they liked certain items or not (1 for liked, 0 for not liked). The items can be present multiple times for each user, and the user's responses to each item are independent of those in the previous instances. This is my code to calculate user similarities:

data=csvread('datafile.csv');
users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));
users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity' (i.e., in order of the number of items liked or disliked by the users)
users = users(1:100,:); % For 100 users, it took about 2 hours to finish running.
filename = 'C:\Users\hp1\Desktop\location_similarity.csv'; % The file where I am saving the similarities, as the similarity matrix is too huge for storing in memory
for a=1:size(users,1)
  A=data(data(:,10)==users(a,1),12); % Items are in the 12th column of the dataset
  catA(:,1)=unique(A);
  catA(:,2)=histc(A,unique(A));
  totalA = sum(catA(:,2));
  catA(:,3)=catA(:,2)/totalA; % Calculating the fraction of items the current user liked
  allCatA(:,1)=unique(data(:,2));
  allCatA(:,2)=zeros(size(allCatA,1),1);
        % Calculating the current user's item preferences
    for k=1:size(catA,1)
      for l=1:size(allCatA,1)
        if catA(k,1)==allCatA(l,1)
          allCatA(l,2)=catA(k,3);
        end
      end
    end
  for b=1:size(users,1)
    B=data(data(:,10)==users(b,1),2);
    catB(:,1)=unique(B);
    catB(:,2)=histc(B,unique(B));
    totalB = sum(catB(:,2));
    catB(:,3)=catB(:,2)/totalB;
    allCatB(:,1)=unique(data(:,2));
    allCatB(:,2)=zeros(size(allCatB,1),1);
    for m=1:size(catB,1)
      for n=1:size(allCatB,1)
        if catB(m,1)==allCatB(n,1)
          allCatB(n,2)=catB(m,3);
        end
      end
    end
    sim(1,b) = corr(allCatA(:,2),allCatB(:,2)); % Similarity between the 2 users based on correlation of their item preference vectors
    clear catB; clear allCatB;
  end
  dlmwrite(filename, sim(1,:), '-append'); % Saving the similarities of the row in the file
  clear catA; clear allCatA;
end

But this code takes a huge time to finish running (about 2 hours to run on only 100 users!). How do I calculate user similarities without using for loops to cut short the time required to run? Any help is appreciated. Thanks.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Guillaume le 8 Juin 2017

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/343821-how-to-efficiently-calculate-item-based-user-similarity-when-there-are-huge-number-of-users#answer_270005

Ouvrir dans MATLAB Online

But this code takes a huge time to finish running (about 2 hours to run on only 100 users!).

No wonder! You're calculating your allCat 100x101 times, which is 101 times more than it needs to be. Plus you're doing it very inefficiently.

In fact, your code is full of convoluted statements such as:

users = flipdim(sortrows(users,2),1); % Arranging users in descending order of 'activity'

which is simply:

user = sortrows(users, 2, 'descend');

and plenty of repeated calls to unique on the same data, e.g.:

users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10),unique(data(:,10)));

which should be:

users = unique(data(:,10)); % Users are in the 10th column in the dataset
users(:,2) = histc(data(:,10), user(:, 1));

But fixing your redundant calculation of allCat should be your priority. You need to go over each user only once, e.g:

allcat = unique(data(:, 2));  %No idea what data(:, 2) is
allcat = [allcat, zeros(size(allcat, 2), size(users, 1))];  %pre-allocate as many additional columns as there are users
for iuser = 1:numel(users);
   items = data(data(:, 10) == users(iuser, 1), 12);
   uitems = unique(items);
   itemcount = histc(items, uitems);
   itemratio = itemcount / sum(itemcount);
   %the following is a lot more efficient that your double for loops k and l (and m and n):
   [matched, whichitem] = ismember(allcat(:, 1), uitems)); 
   allcat(matched, iuser) = itemratio(whichitem(matched));
end

Then you can do the correlation in just one go between all the user columns:

similarity = corr(allcat(:, 2:end));

2 commentaires
Afficher AucuneMasquer Aucune

Prasanta Saikia le 8 Juin 2017

Ouvrir dans MATLAB Online

Thanks for your answer.

1. I tried using

user = sortrows(users, 2, 'descend');

but for some reason, it gives the following error:

Error using sortrows
Too many input arguments.

That's why I had to use this convoluted way to arrange in descending order. But that was not the factor eating up the time anyway, like you correctly identified.

2. After some modifications to your code, I was able to get it to work. Now, while it is of course faster, there is significant deviation from similarities I obtained using other methods.

By using Pearson correlation based similarity, which is basically:

for a=1:size(users,1)
  for b=1:size(users,1)
    x = data(data(:,10)==users(a,1),12);
    y = data(data(:,10)==users(b,1),12);
    x1 = histc(x,unique(data(:,12)));
    y1 = histc(y,unique(data(:,12)));
    S(a,b) = corr(x1,y1); 
  end
end

the correlation between the similarity obtained from your code with that obtained by the Pearson correlation similarity is 72%.

Then, I calculated the similarity using cosine similarity, which is basically:

users=users(1:10,:);
for a=1:size(users,1)
  for b=1:size(users,1)
    x = data(data(:,10)==users(a,1),12);
    y = data(data(:,10)==users(b,1),12);
    x1 = histc(x,unique(data(:,12)));
    y1 = histc(y,unique(data(:,12)));
    S(a,b) = dot(x1,y1)/(norm(x1,2)*norm(y1,2));
  end
end

the correlation between the similarity obtained from your code with that obtained by the cosine similarity is 87%.

And finally, the correlation between the similarity obtained from your code with that obtained by my original code in the question is 92%.

I don't know which why this difference occurs. Furthermore, these differences increase when I increase the number of users (for 1000 users, the correlations decrease to 64%, 73% and 77% respectively). Whereas, the correlations between the similarities obtained using my original code, Pearson correlation similarity, and cosine similarity are always around 80-95%. Would you know why this difference arises with your code?

P.S. I wrote my custom similarity code as it was still faster than the cosine and the Pearson correlation codes I wrote, even with the multiple for loops. Of course, your code with the loops removed is much much faster than all three, but as I said, gives the most different similarity values compared to the other 3.

Guillaume le 8 Juin 2017

Ouvrir dans MATLAB Online

sortrows has supported the 'descend' option since R2013b. If you're using an ancient version of matlab you need to say.

Please provide some sample data if you want me to test the code I've posted. Obviously without any data to test with, I've no idea if there are any typos or mistake. You said you needed to make so modification. In theory, no modification was needed so maybe I overlooked something.

If you want to recreate your correlation values with a double loop, you can still do so after the loop that create the allcat:

S = zeros(size(users, 1));
for iuser1 = 1:size(users, 1);
   for iuser2 = 1:size(users, 2);
      S(iuser, iuser2) = yoursimilarityfunction(allcat(:, iuser1), allcat(:, iuser2));
    end
 end

But as far as I understand, this is what corr does anyway when passed a matrix (I don't have the stats toolbox).

Connectez-vous pour commenter.

Answer 2

Ayush Jain le 6 Jan 2022

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/343821-how-to-efficiently-calculate-item-based-user-similarity-when-there-are-huge-number-of-users#answer_869510

Input: User id ‘U1’ /*id for a user*/ Item id /*all items ratings which are rated by user*/ Cluster’s array /*using Fuzzy C-Mean*/ Colony’s array /*Artificial Algae Algorithm*/ Output: Similarity Array for U1 from other users Calculate Avg. Rating for U1 for each User ‘U2’ from the user set Initialization of variables numerator and denominator. if U1 and U2 are in the same cluster for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 assign similarity of U1 and U2 is 0 else assign similarity of U1 and U2 from Eq. (1) end if U1 and U2 are in the same colony for each item i (int i=1;i<=total items;i++) if both U1 and U2 rated that item co-rated items incremented Calculate numerator and denominator of Eq. (1) end end end if numerator or denominator =0 Res=0 (similarity of U1 and U2) else Res= similarity of U1 and U2 from Eq. (1) end Recalculate similarity using Eq. (6) Use function is written in Eq. (5) End

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

How to efficiently calculate item-based user similarity when there are huge number of users?

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (2)

2 commentaires
Afficher AucuneMasquer Aucune

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

How to efficiently calculate item-based user similarity when there are huge number of users?

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (2)

2 commentaires Afficher AucuneMasquer Aucune

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

2 commentaires
Afficher AucuneMasquer Aucune

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens