27 views (last 30 days)

Show older comments

Dear members,

I have a program that allows me to what I set out to do on a single time series. Now that I’m satisfied with that, I’ve got to roll it out on a whole bunch of time series. Say I have an indicator for many countries, where data availability differs, so lots of NaN-there are plenty of members posting answers and questions about getting rid of NaN in case of vectors, or, as far as matrices are concerned, getting rid of entire rows or columns whenever a NaN appears.

For instance:

- https://uk.mathworks.com/matlabcentral/answers/179142-how-can-i-remove-nan-values-from-a-matrix
- https://stackoverflow.com/questions/5202680/matlab-how-to-efficiently-remove-nan-elements-from-matrix#
- http://uk.mathworks.com/matlabcentral/fileexchange/41941-snip-m-snip-elements-out-of-vectors-matrices

However, I don’t want to get rid of whole rows or columns, but simply to purge each column independently of its NaN regardless of what happens to the others. How would I go about doing this? Any ideas would be much appreciated.

I started with something like this:

% clearvars

data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');

datacolumns=size(data,2);

for bycolmn =1:length(datacolumns);

And then I got stuck.

This:

data(isnan(data))= []

deletes the NaN and returns my data into one big long vector. Clearly something of this fashion ought to go in the loop above, but I'm not sure how.

Many thanks in advance

Cedric Wannaz
on 17 Sep 2017

Edited: Cedric Wannaz
on 18 Sep 2017

I think that the major problem is basic data manipulation, and once you have that right you'll be fine for the rest. Here are a few hints and we'll see how it goes from there.

Loading data : you could use the other outputs of XLSREAD; the first is the numeric part, the second the text, and the third raw data (a mix of text and numbers).

[num,txt,raw] = xlsread( 'DMdata.xls' ) ;

If we look at what we get with that, indexing a 3 by 6 block (so we see the top-left part of the output of XLSREAD):

>> raw(1:3,1:6)

ans =

3×6 cell array

Columns 1 through 4

[ NaN] 'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…'

'Q2 1950' 'NA' 'NA' 'NA'

'Q3 1950' 'NA' 'NA' 'NA'

Columns 5 through 6

'BG UNEMPLOYMENT R…' 'CN UNEMPLOYMENT:R…'

'NA' 'NA'

'NA' 'NA'

We see that in cell array raw we have everything and the NA values are strings. Doing the same thing for the num array, we get:

>> num(1:3,1:6)

ans =

NaN NaN NaN NaN 4.8000 NaN

NaN NaN NaN NaN 4.4300 NaN

NaN NaN NaN NaN 4.0700 NaN

Well, that's interesting: it appears that the first row doesn't correspond to Q2 1950 but to Q1 1995, so XLSREAD outputs in fact the smallest block that encompasses all numeric values. Good to know, because if you need to associate a date to each row, you'll have to "sync" with what XLSREAD does if you use its numeric output.

Processing row and column headers : taking a little detour, let's imagine that we need the dates and the column headers. The cell array of all dates is column 1 of cell array raw, elements 2 to the end: raw(2:end,1). If we need the quarter IDs and the years as distinct numbers, many approaches are possible. I propose the following: we can use SSCANF to read/convert numbers from a string; let's see how it works on the first date entry: raw{2,1}:

>> sscanf( raw{2,1}, 'Q%d %d' )

ans =

2

1950

Simple, now we need to see if we can apply this to each element of raw(2:end,1). We could loop, or use CELLFUN as follows (more concise):

>> temp = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;

where

- @(x) sscanf( x, 'Q%d %d' ) : is an anonymous function of x. It is almost the same as creating an M-File: function numValues = myScan( x ), numValues = sscanf( x, 'Q%d %d' ) ; end and passing its handle to CELLFUN (handles are roughly "function pointers" in MATLAB, they are one way to pass a function to another function), with the advantage that we can declare it inline with no name.
- raw(2:end,1) : cell array on which CELLFUN must operate. As a reminder, {} indexing accesses the content of a cell, and () indexing is normal block indexing: it returns the (sub-) cell array indexed (and not the content of cells).
- 'UniformOutput', false : tell CELLFUN to output a cell array of outputs of the anonymous function, which is required because the function outputs an array (e.g. [2; 1950] and not a scalar). You'll have to experiment a little with CELLFUN to see what are its limitations.

Now we can for example concatenate the output and create a vertical numeric array out of it:

>> dates = horzcat( temp{:} ).' ;

dates =

2 1950

3 1950

4 1950

1 1951

...

We can also extract the column headers, which are stored in the first row of raw, elements 2 to the end: raw(1,2:end). This could be useful for creating legends, labels, etc.

>> headers = raw(1,2:end)

headers =

1×26 cell array

'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…' ...

This is interesting, because now we have all the dates, all the column headers, and we could work without hard coding boundaries like in your code.

>> nCols = numel( headers ) ;

>> nDates = size( dates, 1 ) ;

which can be used when defining loop boundaries:

for colId = 1 : nCols

...

end

Using dates, we can also build vectors of logicals for extracting relevant data, e.g. for a given period:

>> selectYears = dates(:,2) >= 1951 & dates(:,2) <= 1953

selectYears =

269×1 logical array

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

...

Processing data : we saw that the numeric array num is defined in a way that can be difficult to use, as XLSREAD truncates it to the smallest block that encompasses all numeric values. For this reason I generally prefer working with raw data. Assuming that you just want raw data, we can update the first call to XLSREAD by the way:

[~,~,raw] = xlsread( 'DMdata.xls' ) ;

so we don't have num and txt that we won't use. Working with raw, the data are in the block raw(2:end,2:end). Let's see how it looks at a place with a few data and a few 'NA':

>> raw(19:22,5:7)

ans =

4×3 cell array

'NA' 'NA' 'NA'

'NA' 'NA' 'NA'

'NA' [4.8000] 'NA'

'NA' [4.4300] 'NA'

we see that it contains strings and numbers. We can check it:

>> class( raw{21,6} )

ans =

double

>> class( raw{21,7} )

ans =

char

There are multiple ways to proceed here; I propose to replace all entries of type/class char by NaN, and then to convert the cell array into a numeric array. To detect if a cell content is char we can use ISCHAR. We also saw how to apply a function to all elements of a cell array:

>> temp = raw(2:end,2:end) ;

>> selectChar = cellfun( @ischar, temp ) ;

where

- @ischar is a shortcut for @(x)ischar(x) (there is a difference but it doesn't matter now).
- There is no need for the 'UniformOutput' argument, because ISCHAR outputs a scalar.

We can see that selectChar is a 269x26 array of logical

>> whos selectChar

Name Size Bytes Class Attributes

selectChar 269x26 6994 logical

that we can use to index (logical indexing) cell array temp for a replacement:

>> temp(selectChar) = {NaN} ;

Now we see what 'NA' strings were replaced with numeric NaN:

>> temp(18:21,4:6)

ans =

4×3 cell array

[NaN] [ NaN] [NaN]

[NaN] [ NaN] [NaN]

[NaN] [4.8000] [NaN]

[NaN] [4.4300] [NaN]

and we can convert this cell array to a numeric array:

>> data = cell2mat( temp ) ;

If you display data, you will see that it contains all the rows and columns present in your Excel file and that there is no truncation.

Summary: I developed a lot so it is clear, but the whole thing until here can be condensed into:

[~,~,raw] = xlsread( 'DMdata.xls' ) ;

dates = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;

dates = horzcat( dates{:} ).' ;

headers = raw(1,2:end) ;

data = raw(2:end,2:end) ; % Cell array mixed str/values.

data(cellfun( @ischar, data )) = {NaN} ; % Cell array NaN/values

data = cell2mat( data ) ; % Numeric array.

[nDates, nCols] = size( data ) ;

which outputs the following when run from scratch:

>> whos

Name Size Bytes Class Attributes

data 269x26 55952 double

dates 269x2 4304 double

headers 1x26 5050 cell

nCols 1x1 8 double

nDates 1x1 8 double

raw 270x27 871220 cell

This approach has the advantage that it outputs "clean" arrays, whose size do match: dates has 269 rows that correspond to the 269 rows of data, headers has 26 columns that correspond to the 26 columns of data, there was no truncation, and we know the size of everything programmatically (no need for hard coding "weird" boundaries).

Further processing: now we can work with this: first, many functions can work with NaN and/or have a special version and/or special parameters that make them support NaNs. However, if we needed to extract non-NaN values, e.g. from column 3, we could proceed as follows:

select = ~isnan( data(:,3) ) ; % ~ = NOT, can be used for logical indexing

values = data(select,3) ;

or directly

values = data(~isnan( data(:,3) ),3) ;

This can easily be used in a loop over columns:

for colId = 1 : nCols

values = data(~isnan( data(:,colId) ),colId) ;

... do something with values

end

Note that if we needed corresponding dates and/or quarters, we could easily get them as well:

select = ~isnan( data(:,3) ) ;

values = data(select,3) ;

datesForValues = dates(select,:) ;

or again in a loop, using the relevant header for a title:

for colId = 1 : nCols

select = ~isnan( data(:,colId ) ) ;

values = data(select,colId ) ;

datesForValues = dates(select,:) ;

results = myAnalyzer( dates(select,:), values ) ;

plot( ... ) ;

title( headers{colId} ) ;

end

My guess is that if you go over these explanations, understand logical indexing, how to operate on the various types of arrays, etc, your problem will disappear!

Cheers, Cedric

Cedric Wannaz
on 17 Sep 2017

Corrected a few typos. Updated a bit the code (in the summary) to simplify.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!