Effacer les filtres
Effacer les filtres

Breaking data from a large text file into groups

3 vues (au cours des 30 derniers jours)
Neil
Neil le 13 Juil 2020
Commenté : Walter Roberson le 15 Juil 2020
I have a text file that has groups of elements formattted like this:
{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423], .... }
and so on. The data is all in one row in this format. I just need to determine the group with the highest number of elements and output that list only. However these files can be fairly large (~260 MB) with a couple thousand groups and element numbers up to the millions. I'm struggling to find the best method to break the scan (probably at the double quotes), save that group (probably to a cell), and then move on to the next one.
  3 commentaires
Neil
Neil le 13 Juil 2020
Yes, there is one pair of curly brackets around the entire row, in the actual file. To your second question, both- the first group is typically the largest and may have most of the elements present.
When the file was smaller, the easiest method was to just manually delete everything after this first group as long as that was the case. I've been thinking and it would probably be easiest for my code to just count the group number and number of elements. Then go back to the largest group and rewrite that list in a new file. Just not sure the best way to separate and count them.
Walter Roberson
Walter Roberson le 15 Juil 2020
data = cat(1, image_patches,labels);
That code is overwriting all of data each iteration.
It looks to me as if data will not be a vector, but I do not seem to be able to locate any hellopatches() function so I cannot tell what shape it will be. As you are not doing imresize() I also cannot be sure that all of the images are the same size, so I cannot be sure that data will be the same size for each iteration. Under the circumstances you should be considering saving into a cell array.
Note: please do not post the same query multiple times. I found at least 10 copies of your query :(

Connectez-vous pour commenter.

Réponse acceptée

dpb
dpb le 14 Juil 2020
Modifié(e) : dpb le 15 Juil 2020
Well, the following is pretty easy as far as counting goes...how it works on real file as far as speed and whether need to read record piecemeal or not I've no klew w/o a real file to test.
s='{"1": [1,2,3,5,10,15,25,37], "2": [1,5,10,20], "3": [2000,2170], "4": [35,72,423]}';
s=erase(s,{'{','}'});
ss=split(s,{':',']'});
ss=ss(2:2:end);
>> ss
ss =
4×1 cell array
{' [1,2,3,5,10,15,25,37'}
{' [1,5,10,20' }
{' [2000,2170' }
{' [35,72,423' }
>>
>> [~,ixss]=max(cellfun(@(s) sum(s==','),ss))
ixss =
1
>>
Undoubtedly regular expressions could come to the rescue here as well but I'm no guru...
The new(ish) string functions version could be-
ss=extractBetween(s,'[',']');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
or, if want the brackets, too, then
ss=extractBetween(s,'[',']','Boundaries','inclusive');
[~,ixss]=max(cellfun(@(s) sum(s==','),ss))
  1 commentaire
Neil
Neil le 15 Juil 2020
Thanks! This worked well for me, and was pretty fast (~30 sec for the 260 MB file including writing a new file)

Connectez-vous pour commenter.

Plus de réponses (1)

Stephen23
Stephen23 le 15 Juil 2020
Modifié(e) : Stephen23 le 15 Juil 2020
"I just need to determine the group with the highest number of elements and output that list only."
So for your example data this would be the first group?
What happens if multiple groups have the same number of elements?
I doubt the importing the entire file into MATLAB and doing string operations would be particularly efficient. I would do as much processing as possible with as little data as possible, which means operating at the level of file-reading. For example, reading only one group at a time would likely be efficient, something like this:
outN = 0;
outV = [];
[fid,msg] = fopen('test.txt','rt');
assert(fid>=3,msg)
fscanf(fid,'{');
while ~feof(fid)
tmpN = fscanf(fid,'"%f"%*[: ][');
tmpV = fscanf(fid,'%f,',[1,Inf]);
fscanf(fid,']%*[, }]');
assert(~isempty(tmpN),'could not match number')
assert(~isempty(tmpV),'could not match vector')
if numel(tmpV)>numel(outV) % or whatever condition.
outV = tmpV;
outN = tmpN;
end
end
fclose(fid);
If you could upload a small sample file (a few thousand characters) by clicking the paperclip button then I could test this too. Instead I had to create my own test file (attached) to test my code with (i made the third group have the most elements).
  2 commentaires
dpb
dpb le 15 Juil 2020
Indeed. Nothing in the above was intended as anything that would necessarily be fast.
Your approach is similar to what I figured would be the necessary -- read a block of whatever size is feasible given memory constraints, find the last "]" in the block and count the commas between groups.
If there's another "[" in the block after the last "]", then that's part of next block to process.
Rinse and repeat...
Neil
Neil le 15 Juil 2020
Thank you both for the help! As I responded above the string editing worked out pretty quickly. But I'll try this out if my file size increases any more.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Text Data Preparation dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by