Collating data from different sources
Afficher commentaires plus anciens
I have two csv datasets t1 and t2:
t1: t2:
A B C D A B C D
1 5 4.1 "a" 4 3 4.2 "e"
2 3 3.8 "g" 7 5 2.5 "c"
3 2 3.6 "d" 9 4 4.3 "b"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"
So t1 is missing some values but is the 'master' table. t2 has only the rows with missing values from t1, but with the data filled (from another source).
I am trying to fill the missing values in t1 using the values from t2, where variable 'A' is like an index (having unique values for each observation).
I have tried using outerjoin() but the result is:
t3:
tleft_A tleft_B tleft_C tleft_D tright_A tright_B tright_C tright_D
1 5 4.1 "a" NaN NaN NaN NaN
2 3 3.8 "g" NaN NaN NaN NaN
3 2 3.6 "d" NaN NaN NaN NaN
4 NaN 4.2 "e" 4 3 2.5 "e"
5 6 4.2 "h" NaN NaN NaN NaN
6 2 3.6 "d" NaN NaN NaN NaN
7 NaN 2.5 "c" 7 5 2.5 "c"
8 5 1.8 "e" NaN NaN NaN NaN
9 NaN 4.3 "b" 9 4 4.3 "b"
10 6 2.6 "a" NaN NaN NaN NaN
What I am trying to achieve, however, is:
t3:
A B C D
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 3 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
10 6 2.6 "a"
Is there a simple way to accomplish this, or do I need to extract, index and combine the variables, or approach it programmatically via a loop or something similar? None of the interpolation-type fill methods seem to be appropriate either.
I'm fairly new to MatLab, so if there is any other information I am missing which would help clarify the problem, I'll be happy to update the question with it.
Thanks in advance for any insights you can provide.
2 commentaires
Dyuman Joshi
le 4 Oct 2023
Please attach the csv files, use the paperclip button to do so.
Stephen
le 4 Oct 2023
Réponses (1)
You could use this approach
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
9 4 4.3 "b"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
t3(idxMissing,:) = t2(idx2,:)
7 commentaires
Note this will also work if t2 has extra rows that don't correspond to missing data for example
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
t3(idxMissing,:) = t2(idx2,:)
Stephen
le 4 Oct 2023
Jon
le 4 Oct 2023
The approach I outline is relatively straightforward, but I understand, it would be nice if there were some more or less "built in" approach for doing this. There may be one, maybe someone else will answer and point out a simpler way to do this.
Stephen
le 4 Oct 2023
Jon
le 5 Oct 2023
No worries, I totally understand that you are thinking there must be nicer way to do something like this. I couldn't think of one, and wanted to make sure you at least had a way to get it done. I'm curious to know also if someone has a simpler approach for this. Let's see, maybe you'll get some more answers.
Stephen
le 5 Oct 2023
Here is another approach, that utilizes operations with timetables, rather than indexing. Still a little arcane, but maybe you prefer this. You may want to remove a few more semicolons at end of statements to understand better what each step is doing
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
% Put data into timetables, using A as the time variable
t1 = cell2table(c1(2:end,:),'VariableNames',c1(1,:));
t1.A = days(t1.A); % convert first column to a duration
t1 = table2timetable(t1);
t2 = cell2table(c2(2:end,:),'VariableNames',c2(1,:));
t2.A = days(t2.A); % convert first column to a duration
t2 = table2timetable(t2);
% Combine the tables (makes union of timetables) putting missing data
% wherever data is not found for a given time
t3 = synchronize(t1,t2)
% Replace the missing values using values from t2
% note by defualt, min omits missing, e.g. NaN
t3.B_t1 = min(t3.B_t1,t3.B_t2);
% Remove columns from t2
t3 = t3(:,1:3);
% Remove any rows that were in t2 but not t1
% These will have missing data from the synchronization
t3 = rmmissing(t3);
% Optionally clean up the variable names
t3.Properties.VariableNames = c1(1,2:end)
Catégories
En savoir plus sur Logical dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!