how to find the similarity between two text documents

2 vues (au cours des 30 derniers jours)
Jothi
Jothi le 19 Déc 2012
Commenté : info info le 20 Mar 2020
i have two text document.
For example, a.txt file contains ' Hai How R U'.
and b.txt file contains 'Hai How are U'.
How I can calculate the cosine similarity or Euclidean Distance for these two documents (text files).
thanks in advance.
  2 commentaires
Jan
Jan le 19 Déc 2012
The Euclidean Distance requires vektors of the same size. There are different Edit Distances, but I do not know the cosine distance. Perhaps it is better that you explain the details that that we search in WikiPedia.
info info
info info le 20 Mar 2020
i think the best way to give the similarity text is "shinling"
Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :
## $Original
## [1] "The sky is blue and the sun is bright."
##
## $Shingled
## [1] "the sky is" "sky is blue" "is blue and" "blue and the"
## [5] "and the sun" "the sun is" "sun is bright"
then we virify if find in our textes
## doc_1 doc_2 doc_3
## the sky is 1 1 1
## sky is blue 1 0 1
## is blue and 1 0 0
## blue and the 1 0 0
## and the sun 1 0 0
## the sun is 1 0 0
## sun is bright 1 0 1
## the sun in 0 1 0
## sun in the 0 1 0
## in the sky 0 1 0
## sky is bright 0 1 0
## we can see 0 0 1
## can see sun 0 0 1
## see sun is 0 0 1
## is bright the 0 0 1
## bright the sky 0 0 1
then calculate .and take the big valeur

Connectez-vous pour commenter.

Réponses (1)

Jan
Jan le 19 Déc 2012

Catégories

En savoir plus sur Model Import dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by