MATLAB Answers

0

How can I remove websites' links from a text?

Asked by Dario Borrelli on 1 Feb 2017
Latest activity Answered by Christopher Creutzig on 2 Nov 2017
I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.

  1 Comment

Jan
on 1 Feb 2017
Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?

Sign in to comment.

2 Answers

Answer by Iddo Weiner on 1 Feb 2017
Edited by Iddo Weiner on 1 Feb 2017

Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:
text = 'some words https:link some other words https:otherlink final words';
disp(text)
some words https:link some other words https:otherlink final words
text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
while true
if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
last_del_idx(i) = next_idx;
text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
break %out of the while loop
end
next_idx = next_idx + 1;
end
end
% let's see what we're left with
disp(text_copy)
some words some other words final words
Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.
There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.
I hope this helps
p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.

  0 Comments

Sign in to comment.


Answer by Christopher Creutzig on 2 Nov 2017

The eraseURLs functions might help. Which does a little more work than what you describe.
Based on your description, the following should work, which uses \S8, the regex notation for “arbitrarily many not whitespace”:
regexprep(str,'https:\S*','')

  0 Comments

Sign in to comment.