Breaking data from a large text file into groups

Question

Neil le 13 Juil 2020

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups

Commenté : Walter Roberson le 15 Juil 2020

I have a text file that has groups of elements formattted like this:

{"1":  [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423], .... }

and so on. The data is all in one row in this format. I just need to determine the group with the highest number of elements and output that list only. However these files can be fairly large (~260 MB) with a couple thousand groups and element numbers up to the millions. I'm struggling to find the best method to break the scan (probably at the double quotes), save that group (probably to a cell), and then move on to the next one.

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Neil le 13 Juil 2020

Yes, there is one pair of curly brackets around the entire row, in the actual file. To your second question, both- the first group is typically the largest and may have most of the elements present.

When the file was smaller, the easiest method was to just manually delete everything after this first group as long as that was the case. I've been thinking and it would probably be easiest for my code to just count the group number and number of elements. Then go back to the largest group and rewrite that list in a new file. Just not sure the best way to separate and count them.

Walter Roberson le 15 Juil 2020

Ouvrir dans MATLAB Online

data = cat(1, image_patches,labels);

That code is overwriting all of data each iteration.

It looks to me as if data will not be a vector, but I do not seem to be able to locate any hellopatches() function so I cannot tell what shape it will be. As you are not doing imresize() I also cannot be sure that all of the images are the same size, so I cannot be sure that data will be the same size for each iteration. Under the circumstances you should be considering saving into a cell array.

Note: please do not post the same query multiple times. I found at least 10 copies of your query :(

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

dpb le 14 Juil 2020

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups#answer_465386

Modifié(e) : dpb le 15 Juil 2020

Ouvrir dans MATLAB Online

Well, the following is pretty easy as far as counting goes...how it works on real file as far as speed and whether need to read record piecemeal or not I've no klew w/o a real file to test.

s='{"1":  [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423]}';
s=erase(s,{'{','}'});
ss=split(s,{':',']'});
ss=ss(2:2:end);
>> ss
ss =
  4×1 cell array
    {'  [1,2,3,5,10,15,25,37'}
    {' [1,5,10,20'           }
    {' [2000,2170'           }
    {' [35,72,423'           }
>>
>> [~,ixss]=max(cellfun(@(s) sum(s==','),ss))
ixss =
          1
>>

Undoubtedly regular expressions could come to the rescue here as well but I'm no guru...

The new(ish) string functions version could be-

ss=extractBetween(s,'[',']');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))

or, if want the brackets, too, then

ss=extractBetween(s,'[',']','Boundaries','inclusive');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Neil le 15 Juil 2020

Thanks! This worked well for me, and was pretty fast (~30 sec for the 260 MB file including writing a new file)

Connectez-vous pour commenter.

Answer 2

Stephen23 le 15 Juil 2020

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/564551-breaking-data-from-a-large-text-file-into-groups#answer_466124

Modifié(e) : Stephen23 le 15 Juil 2020

Ouvrir dans MATLAB Online

test.txt

"I just need to determine the group with the highest number of elements and output that list only."

So for your example data this would be the first group?

What happens if multiple groups have the same number of elements?

I doubt the importing the entire file into MATLAB and doing string operations would be particularly efficient. I would do as much processing as possible with as little data as possible, which means operating at the level of file-reading. For example, reading only one group at a time would likely be efficient, something like this:

outN = 0;
outV = [];
[fid,msg] = fopen('test.txt','rt');
assert(fid>=3,msg)
fscanf(fid,'{');
while ~feof(fid)
    tmpN = fscanf(fid,'"%f"%*[: ][');
    tmpV = fscanf(fid,'%f,',[1,Inf]);
    fscanf(fid,']%*[, }]');
    assert(~isempty(tmpN),'could not match number')
    assert(~isempty(tmpV),'could not match vector')
    if numel(tmpV)>numel(outV) % or whatever condition.
        outV = tmpV;
        outN = tmpN;
    end
end
fclose(fid);

If you could upload a small sample file (a few thousand characters) by clicking the paperclip button then I could test this too. Instead I had to create my own test file (attached) to test my code with (i made the third group have the most elements).

2 commentaires
Afficher AucuneMasquer Aucune

dpb le 15 Juil 2020

Indeed. Nothing in the above was intended as anything that would necessarily be fast.

Your approach is similar to what I figured would be the necessary -- read a block of whatever size is feasible given memory constraints, find the last "]" in the block and count the commas between groups.

If there's another "[" in the block after the last "]", then that's part of next block to process.

Rinse and repeat...

Neil le 15 Juil 2020

Thank you both for the help! As I responded above the string editing worked out pretty quickly. But I'll try this out if my file size increases any more.

Connectez-vous pour commenter.

Breaking data from a large text file into groups

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (1)

2 commentaires
Afficher AucuneMasquer Aucune

Voir également

Catégories

Tags

Community Treasure Hunt

Breaking data from a large text file into groups

3 commentaires Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Plus de réponses (1)

2 commentaires Afficher AucuneMasquer Aucune

Voir également

Catégories

Tags

Community Treasure Hunt

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

2 commentaires
Afficher AucuneMasquer Aucune