Streamline for loops and indexing of large files
    6 vues (au cours des 30 derniers jours)
  
       Afficher commentaires plus anciens
    
    Louise Wilson
      
 le 11 Oct 2020
  
    
    
    
    
    Réponse apportée : Seth Furman
    
 le 15 Oct 2020
            I am wondering if anyone has any advice on how I could streamline a script I am working on.
The purpose of a script is to subset a large file (e.g. 7000*72000) according to time values which occur in the first column of the large file and the first column of a second file of interest. So, I have two datasheets where some of the rows match in time, and I want to pull out the rows which match. I have 10 of these very large files and only one or two of them will contain times that match, so the first task is to find out which of the very large files I have, corresponds to the file of interest, before loading it in.
folder=('Y:\AIS data\CPA_FILT\CPA_ACOU\'); %folder containing files of interest
CPA_ACOU_files=dir(fullfile(folder,'*.csv')); %list each file in folder
count=1; %index for output
for j=1:length(CPA_ACOU_files) %for each file
    %read in 'interesting' files
    CPA=readtable(fullfile(folder,CPA_ACOU_files(j).name));
    CPA_t=datetime(CPA.Var1,'ConvertFrom','datenum'); %convert time to datetime
    CPA_times=timeofday(CPA_t(:,1)); %remove times
    CPA_datesonly=CPA_t(:,1)-CPA_times; %get dates of time vector
    %read in times of large datasheets (PSD) and find any that match interesting files
    tvec_folder=('Y:\AIS data\CPA_FILT\TVEC\'); 
    tvec_files=dir(fullfile(tvec_folder,'*.csv'));
    %for each tvec, do ismember, and if any times match interesting times, store name of tvec file
    %(shortcut to load PSD so we don't have to do it manually)
    for a=1:length(tvec_files) 
        tvec{a}=readtable(fullfile(tvec_folder,tvec_files(a).name)); %read in tvec
        tvec{a}=datetime(tvec{a}.Var1,'ConvertFrom','datenum'); %covnert times to datetime
        tvec_times=timeofday(tvec{a});
        tvec_datesonly=tvec{a}-tvec_times;
        [matches]=ismember(CPA_datesonly,tvec_datesonly);
        B=any(matches);
        if (B==1)
           matchingfiles(a,:)=string(tvec_files(a).name);
        else
            %do nothing
        end
    end
    matchingfiles=rmmissing(matchingfiles); 
    %tvec files with times which match interesting file times
    for c=1:length(matchingfiles) %read in corresponding PSD file
        %remove TVEC from start of filename to get filename which matches PSD file
        filename_parts=strsplit(matchingfiles(c),'_'); 
        filename_parts(:,1)=[];
        filename=strjoin(filename_parts,'_');
        %read in PSD file
        PSDfolder=('Y:\SoundTrap\Boats\Hydrophone\PSD Output\Duty cycle data\');
        PSDfiles{c}=readtable(fullfile(PSDfolder,filename));
    end
    %for each PSD, subset by rows which match interesting files
    %(if they occur over multiple PSD files, they are concatenated into one output file)
    for i=1:length(PSDfiles)
        PSD_t=datetime(PSDfiles{1,i}.Var1,'ConvertFrom','datenum');
        PSD_t(:,2)=PSD_t+minutes(2); %time window for filtering PSD files 
        [rows,cols]=size(CPA);
        for k=1:rows %for each hydrophone recording with CPA
            acou_CPA_file=CPA.Var1(k) %get time of file
            acou_CPA_file=datetime(acou_CPA_file,'ConvertFrom','datenum')
            for m=1:length(PSD_t) %check during which PSD file it occurs
                starttime=PSD_t(m,1); %two minute period to check between
                endtime=PSD_t(m,2);
                if (acou_CPA_file>=starttime) && (acou_CPA_file<=endtime)
                    CPA_PSD(count,:)=PSDfiles{1,i}{m,:}; %store PSD data in output
                    count=count+1;
                else
                    %do nothing
                end
            end
        end
    CPA_PSD_t=datetime(CPA_PSD(:,1),'ConvertFrom','datenum'); %check times
    %outfolder=('Y:\AIS data\CPA_FILT\CPA_PSD\');
    %writetable(CPA_PSD,strcat(outfolder,'acouCPA_',PSDfile(i).name));
    end
end
2 commentaires
  Mathieu NOE
      
 le 12 Oct 2020
				hello
this is my 2 cents suggestion ; instead of having to load very large files into matlab, I'd rather try to find a pre processing method to copy the time vector only from each data file into a new data_time file , and run the matlab code on those much smaller files. Once you know the files and time index of interest, you can do the post processing of the data in matlab.
if you have to do that not very often, you can try a powerfull text editor like Textpad (https://www.textpad.com/home)
otherwise maybe a batch exe file could do the trick
Réponse acceptée
  Seth Furman
    
 le 15 Oct 2020
        1) Consider using innerjoin or outerjoin to find the rows with matching row times between two tables.
2) Consider using tall tables for data that are too big to fit into memory.
For example,
>> tt1 = table2timetable(tall(tabularTextDatastore('tt1.csv')))
tt1 =
  M×1 tall timetable
       Time        Data
    ___________    ____
    15-Oct-2020     1  
    16-Oct-2020     2  
    17-Oct-2020     3  
    18-Oct-2020     4  
    19-Oct-2020     5  
    20-Oct-2020     6  
    21-Oct-2020     7  
    22-Oct-2020     8  
         :          :
         :          :
>> tt2 = table2timetable(tall(tabularTextDatastore('tt2.csv')))
tt2 =
  4×1 tall timetable
       Time        Data
    ___________    ____
    16-Oct-2020     11 
    20-Oct-2020     12 
    22-Oct-2020     13 
    25-Oct-2020     14 
>> tJoin = innerjoin(tt1,tt2,'Keys','Time')
tJoin =
  M×2 tall timetable
    Time    Data_tt1    Data_tt2
    ____    ________    ________
     ?         ?           ?    
     ?         ?           ?    
     ?         ?           ?    
     :         :           :
     :         :           :
Preview deferred. Learn more.
>> tJoin = gather(tJoin)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 7: Completed in 0.31 sec
- Pass 2 of 7: Completed in 0.17 sec
- Pass 3 of 7: Completed in 0.39 sec
- Pass 4 of 7: Completed in 0.47 sec
- Pass 5 of 7: Completed in 0.32 sec
- Pass 6 of 7: Completed in 0.49 sec
- Pass 7 of 7: Completed in 0.71 sec
Evaluation completed in 3.4 sec
tJoin =
  3×2 timetable
       Time        Data_tt1    Data_tt2
    ___________    ________    ________
    16-Oct-2020       2           11   
    20-Oct-2020       6           12   
    22-Oct-2020       8           13
0 commentaires
Plus de réponses (0)
Voir également
Catégories
				En savoir plus sur Timetables dans Help Center et File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


