How to find rows in table quickly based on time ranges specified in another table?
5 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have a table T that contains a time start, time end and an ID. I have another table S that contains a timestamp and an ID. T typically has 50k rows, while S has ~1 million rows. Simplified example code for tables T and S:
rng(1);
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
head(T)
size(T)
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
head(S)
size(S)
For each row in T, I would like to find all the rows in S that has a timestamp within the (time start, time end) range and has the same ID as in T. Rows in T that do not have an timeEnd can be ignored. I see a few approaches:
- Use isbetween() in a for loop around all the rows of T for the time-based check, but that was slow and really does not scale well with the number of rows in S (and linearly in the number of rows in T).
- I could achieve this by converting S to a timetable and using timeranges constructed from T.timeStart and T.timeEnd together with a check on the T.id and S.id columns, but that seemed to be just as slow as approach 1.
- I can convert all datetimes to unix time (using posixtime) and instead work with those as real numbers. This seems to be much faster than approaches 1 and 2.
- I can take approach 3 and instead of using a for loop, compare all time ranges against all timestamps and create one giant (1 million by 50k) logical matrix indicating whether a row in S is within the timerange of each row of T. This matrix is sparse. Like all previous approaches, this does not scale well at all. To make memory use manageable, I can create a number of smaller (X by 50k) sparse matrices and then vertcat them at the end.
Does anyone see a good approach that scales well and keeps computation time manageable?
0 commentaires
Réponses (2)
Voss
le 30 Oct 2023
Modifié(e) : Voss
le 30 Oct 2023
% input data construction (copied):
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
id = (1:numel(timeStart))';
T = table(timeStart, timeEnd, id);
rng(1);
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
% find which row of T contains each id in S:
[ism,idx] = ismember(S.id,T.id);
% make sure each id in S exists in T:
assert(all(ism));
% make a new table (S_new) that has the timeStamp from S and everything
% from T, but the rows of T are ordered according to the id in S:
S_new = [S(:,1), T(idx,:)];
head(S_new)
size(S_new)
% now keep only rows in S_new where timeStamp is between timeStart and timeEnd:
to_keep = S_new.timeStamp >= S_new.timeStart & S_new.timeStamp <= S_new.timeEnd;
S_new = S_new(to_keep,:);
head(S_new)
size(S_new)
Voir également
Catégories
En savoir plus sur Data Type Conversion dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!