Linear regression on data with asymmetric measurement error

Question

Katrina le 10 Nov 2023

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error

Réponse apportée : Jeff Miller le 14 Nov 2023

I am looking to perform a linear regression on measured data that takes into account an asymmetric error in the data. I've created some dummy data to illustrate what I mean:

The blue curve represents the measured data, while the red curve is the lower bound and is notably closer to the measured data than the orange curve, which represents the upper bound.

Snippet of code to create dummy data:

xdata = linspace(0,10, 20);
ydata = 2*xdata+1.5*rand(1,length(xdata));
y_err_low = 0.3*xdata+1.5*rand(1,length(xdata));
y_err_high = 0.6*xdata+1.5*rand(1,length(xdata));
ylowbnd = ydata - y_err_low;
yupbnd = ydata + y_err_high;
plot(xdata, ydata,'o-', 'LineWidth', 2, 'DisplayName', 'measured data') 
hold on
plot(xdata, ylowbnd, 'x--', 'LineWidth', 2, 'DisplayName', 'lower bound') 
plot(xdata, yupbnd, 's--', 'LineWidth', 2, 'DisplayName', 'upper bound') 
xlabel('x')
ylabel('y')
legend('Location','northwest')

I have linear regression approaches that rely on the error in y being symmetric about the measured datapoint, but am struggling to find a way to weight my regression based on an asymmetric error.

Things I've been digging into:

fmincon (for both fmincon and lsqcurvefit, the bounds, equalities, and inequalities do not appear to allow to input a bound/etc with vectors, e.g., , where anonymous function to fit the data would be and the objective for fmincon would be )
lsqcurvefit
Method of Maximum Likelihood (here the examples I've been seeing rely on Gaussian distribution around each ydata point, so not asymmetric)

I would appreciate any help in how I can go about giving the fit more (or less) freedom to roam as matches with the asymmetric error associated with each data point.

Thanks!

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Torsten le 14 Nov 2023

Where do the error curves come from ? What do they represent ?

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Mathieu NOE le 10 Nov 2023

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error#answer_1350370

Ouvrir dans MATLAB Online

hello Katrina

maybe this ?

you can force the mean curve to get closer from either the upper or the lower bound by adjusting the a coefficient

a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

full code (dummy data slightly different from your version, sorry !)

% "true" data

x2 = (0:30);

y2 = 2*x2+1.5*rand(1,length(x2));

dx = mean(diff(x2));

% upper bound

x1 = x2 + dx/3;

y1 = 2.6*x1+1.5*rand(1,length(x1));

% lower bound

x3 = x2 + dx*2/3;

y3 = 1.7*x3-1.5*rand(1,length(x3));

% measurement = all data (contatenated)

x = [x1 x2 x3];

[x,ind] = sort(x);

y = [y1 y2 y3];

y = y(ind);

%%%% main loop %%%%

n = 15; % buffer size

a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

yy = myspecialavg(y, n ,a);

plot(x2, y2,'b',x, y,'*-c',x,yy,'r', 'LineWidth', 2, 'DisplayName', 'measured data')

legend('"true data"','noisy data','my solution');

xlabel('x')

ylabel('y')

legend('Location','northwest')

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = myspecialavg(in, N, a)

% OUTPUT_ARRAY = MYSLIDINGAVG(INPUT_ARRAY, N)

%

% The function 'slidingavg' implements a one-dimensional weighted filtering, applying a sliding window to a sequence. Such filtering replaces the center value in

% the window with the average value of all the points within the window. When the sliding window is exceeding the lower or upper boundaries of the input

% vector INPUT_ARRAY, the average is computed among the available points. Indicating with nx the length of the the input sequence, we note that for values

% of N larger or equal to 2*(nx - 1), each value of the output data array are identical and equal to mean(in).

%

% * The input argument INPUT_ARRAY is the numerical data array to be processed.

% * The input argument N is the number of neighboring data points to average over for each point of IN.

%

% * The output argument OUTPUT_ARRAY is the output data array.

if (isempty(in)) | (N<=0) % If the input array is empty or N is non-positive,

disp(sprintf('SlidingAvg: (Error) empty input data or N null.')); % an error is reported to the standard output and the

return; % execution of the routine is stopped.

end % if

if (N==1) % If the number of neighbouring points over which the sliding

out = in; % average will be performed is '1', then no average actually occur and

return; % OUTPUT_ARRAY will be the copy of INPUT_ARRAY and the execution of the routine

end % if % is stopped.

nx = length(in); % The length of the input data structure is acquired to later evaluate the 'mean' over the appropriate boundaries.

if (N>=(2*(nx-1))) % If the number of neighbouring points over which the sliding

out = mean(in)*ones(size(in)); % average will be performed is large enough, then the average actually covers all the points

return; % of INPUT_ARRAY, for each index of OUTPUT_ARRAY and some CPU time can be gained by such an approach.

end % if % The execution of the routine is stopped.

out = zeros(size(in)); % In all the other situations, the initialization of the output data structure is performed.

if rem(N,2)~=1 % When N is even, then we proceed in taking the half of it:

m = N/2; % m = N / 2.

else % Otherwise (N >= 3, N odd), N-1 is even ( N-1 >= 2) and we proceed taking the half of it:

m = (N-1)/2; % m = (N-1) / 2.

end % if

for i=1:nx, % For each element (i-th) contained in the input numerical array, a check must be performed:

dist2start = i-1; % index distance from current index to start index (1)

dist2end = nx-i; % index distance from current index to end index (nx)

if dist2start<m || dist2end<m % if we are close to start / end of data, reduce the mean calculation on centered data vector reduced to available samples

dd = min(dist2start,dist2end); % min of the two distance (start or end)

else

dd = m;

end % if

tmp = sort(in(i-dd:i+dd)); % buffered data , reduced to available samples at both ends of the data vector

win = linspace(1/a,a,numel(tmp));

win = win/sum(win);

out(i) = sum(win.*tmp); % mean of weighted data , reduced to available samples at both ends of the data vector

end % for i

end

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Mathieu NOE le 10 Nov 2023

Ouvrir dans MATLAB Online

same code on another set of dummy data (for the fun) :

smoothdata or movmean or any other averaging method will give a centered line , whereas here you can shift towards one or the other bounds by changing the a factor

% "true" data

n = 150;

x2 = (1:n)/n;

y2 = 20*x2+0.05*randn(1,length(x2));

% with asymetric noise

x = x2;

y = y2;

% larger amplitude positive noise at random x index

ind1 = randi([1,n],round(n/2),1);

ind1 = unique(ind1);

y(ind1) = y(ind1)+ 5*rand(1,length(ind1));

% lower amplitude negative noise at random x index

ind2 = (1:n);

ind2(ind1) = [];

y(ind2) = y(ind2)- 1*rand(1,length(ind2));

%%%% main loop %%%%

buff = 25; % buffer size

a = 0.5; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

yy = myspecialavg(y, buff ,a);

% compare with smoothdata

ys = smoothdata(y,'gaussian',buff);

plot(x2, y2,'b',x, y,'*-c',x,yy,'r',x,ys,'k', 'LineWidth', 2, 'DisplayName', 'measured data')

legend('"true data"','noisy data','my solution','smoothdata');

xlabel('x')

ylabel('y')

legend('Location','northwest')

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = myspecialavg(in, N, a)

% OUTPUT_ARRAY = MYSLIDINGAVG(INPUT_ARRAY, N)

%

% The function 'slidingavg' implements a one-dimensional weighted filtering, applying a sliding window to a sequence. Such filtering replaces the center value in

% the window with the average value of all the points within the window. When the sliding window is exceeding the lower or upper boundaries of the input

% vector INPUT_ARRAY, the average is computed among the available points. Indicating with nx the length of the the input sequence, we note that for values

% of N larger or equal to 2*(nx - 1), each value of the output data array are identical and equal to mean(in).

%

% * The input argument INPUT_ARRAY is the numerical data array to be processed.

% * The input argument N is the number of neighboring data points to average over for each point of IN.

%

% * The output argument OUTPUT_ARRAY is the output data array.

if (isempty(in)) | (N<=0) % If the input array is empty or N is non-positive,

disp(sprintf('SlidingAvg: (Error) empty input data or N null.')); % an error is reported to the standard output and the

return; % execution of the routine is stopped.

end % if

if (N==1) % If the number of neighbouring points over which the sliding

out = in; % average will be performed is '1', then no average actually occur and

return; % OUTPUT_ARRAY will be the copy of INPUT_ARRAY and the execution of the routine

end % if % is stopped.

nx = length(in); % The length of the input data structure is acquired to later evaluate the 'mean' over the appropriate boundaries.

if (N>=(2*(nx-1))) % If the number of neighbouring points over which the sliding

out = mean(in)*ones(size(in)); % average will be performed is large enough, then the average actually covers all the points

return; % of INPUT_ARRAY, for each index of OUTPUT_ARRAY and some CPU time can be gained by such an approach.

end % if % The execution of the routine is stopped.

out = zeros(size(in)); % In all the other situations, the initialization of the output data structure is performed.

if rem(N,2)~=1 % When N is even, then we proceed in taking the half of it:

m = N/2; % m = N / 2.

else % Otherwise (N >= 3, N odd), N-1 is even ( N-1 >= 2) and we proceed taking the half of it:

m = (N-1)/2; % m = (N-1) / 2.

end % if

for i=1:nx, % For each element (i-th) contained in the input numerical array, a check must be performed:

dist2start = i-1; % index distance from current index to start index (1)

dist2end = nx-i; % index distance from current index to end index (nx)

if dist2start<m || dist2end<m % if we are close to start / end of data, reduce the mean calculation on centered data vector reduced to available samples

dd = min(dist2start,dist2end); % min of the two distance (start or end)

else

dd = m;

end % if

tmp = sort(in(i-dd:i+dd)); % buffered data , reduced to available samples at both ends of the data vector

win = linspace(1/a,a,numel(tmp));

win = win/sum(win);

out(i) = sum(win.*tmp); % mean of weighted data , reduced to available samples at both ends of the data vector

end % for i

end

Mathieu NOE le 14 Nov 2023

hello Katrina

sorry but for the time being I have no other solution to suggest

Katrina le 14 Nov 2023

That's fine - thanks!

Connectez-vous pour commenter.

Answer 2

Jeff Miller le 14 Nov 2023

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error#answer_1352647

If you have separate measures of the lower and upper directional error associated with each X value (either empirical or derived from some model), then you can probably use least-squares.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Linear regression on data with asymmetric measurement error

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Linear regression on data with asymmetric measurement error

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

4 commentaires Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens