How can I remove websites' links from a text?

Question

Dario Borrelli le 1 Fév 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/322851-how-can-i-remove-websites-links-from-a-text

Réponse apportée : Christopher Creutzig le 2 Nov 2017

I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Jan le 1 Fév 2017

Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Iddo Weiner le 1 Fév 2017

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/322851-how-can-i-remove-websites-links-from-a-text#answer_252963

Modifié(e) : Iddo Weiner le 1 Fév 2017

Ouvrir dans MATLAB Online

Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:

text = 'some words https:link some other words https:otherlink final words';
disp(text)

some words https:link some other words https:otherlink final words

text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
    next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
    while true
        if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
            last_del_idx(i) = next_idx;
            text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
            break %out of the while loop
        end
        next_idx = next_idx + 1;
    end
end
% let's see what we're left with
disp(text_copy)

some words some other words final words

Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.

There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.

I hope this helps

p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Answer 2

Christopher Creutzig le 2 Nov 2017

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/322851-how-can-i-remove-websites-links-from-a-text#answer_289032

Ouvrir dans MATLAB Online

The eraseURLs functions might help. Which does a little more work than what you describe.

Based on your description, the following should work, which uses \S8, the regex notation for “arbitrarily many not whitespace”:

regexprep(str,'https:\S*','')

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

How can I remove websites' links from a text?

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

How can I remove websites' links from a text?

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens