DataAnalysis.html

Data analysis

In this worksheet we provide some means to analyze experimental data. It represents an exercise in making use of drawing tools (to plot error bars), in reading in data from an ASCII file (e.g., from a computer-interfaced experiment, spreadsheet/graphing program, etc.), and finally in making use of the statistical analysis package. Ther is also a hand-written least-squares fitting procedure is attached at the end.

First we make use of the data interface. The file draw1 has to exist in the current directory. We assume four columns of data, containing x_i, y_i, and low and high values for y_i (uncertainties).

We follow the convention on PC's for the directory structure. Note, that instead of the forward-slash (borrowed from unix), one could use the double-backward slash.

> readlib(readdata):

> #data:=readdata(`C:\\ComPhys\\DataAnalysis\\draw1`,float,4): # this works as well

> data:=readdata(`C:/ComPhys/DataAnalysis/draw1`,float,4):

After reading the data they are available as follows:

> data[1],data[2];

> dline:=map(evalf,data[1]);

Unfortunately the Maple development team is not listening to user's complaints about the meagre selection of plotting symbols.

Experimental physicists with data plus uncertainties are the last thing on their minds... So we write our own graphiscs primitives here (they are not very fast when it comes to plotting many points, but one can survive).

Graphics primitives to draw data points with error bars

To plot a decent-sized open symbol we define our own circle with the centre location [ x , y ] passed as the first argument, and the radius as the third.

A fourth argument can be used to set the colour.

> Circle:=proc() local col;

> if nargs<4 then col:=black else col:=args[4] fi;

> plot([args[3]*cos(t)+args[1],args[3]*sin(t)+args[2],t=0..2*Pi],scaling

> =CONSTRAINED,colour=col,thickness=2); end:

> P1:=Circle(2,3,0.3):

> with(plots):

A line connecting points [ x 1, y 1], [ x 2, y 2] with x 1, y 1, x 2, y 2 passed as the first four arguments to the procedure. The fifth argument allows to set the colour.

> Line:=proc() local col;

> if nargs<5 then col:=black else col:=args[5] fi;

> plot([[args[1],args[2]],[args[3],args[4]]],scaling=CONSTRAINED,colour=

> col); end:

> P2:=Line(-1,-1,2,3,red):

> display({P1,P2},axes=boxed);

Here comes the procedure to plot a data point with error bar. Five arguments specify the point, and the sixth is for the colour.

The first two arguments provide the location of the point, the next two the low- y and high- y values, and the fifth argument specifies the width of the error bar to either side, and the size of the symbol.

> Expoint:=proc() local size,col; size:=2*args[5];

> if nargs<6 then col:=black else col:=args[6] fi;

> Circle(args[1],args[2],0.5*args[5],col),Circle(args[1],args[2],args[5]

> ,col),Line(args[1],args[3],args[1],args[4],col),Line(args[1]-size,args

> [3],args[1]+size,args[3],col),Line(args[1]-size,args[4],args[1]+size,args[4],col);

> end:

> Pex:=Expoint(1.5,2.2,1.9,2.4,.1,magenta):

> display({Pex},axes=boxed,view=[-1..2,0..4]);

If desired, one can fill in the circle with more than one, as done here, but it will slow plotting..

We can now plot a circle that when reduced in size will appear filled, and an 'error' bar that indicates the uncertainty of the measurement. It is straightforward to write routines for other symbols (squares and diamonds at least) We generate a colour table and have the luxury of distinguishing subsequent data points by colour:

> coltable:=[violet,magenta,blue,cyan,green,yellow,red,maroon];

> imax:=7;

Now we set up a loop to prepare the data point structure:

> P3:=[]:

> for i from 1 to imax do:

> dline:=data[i]:

> P3:=[Expoint(dline[1],dline[2],dline[3],dline[4],0.01,coltable[i+1]),op(P3)]: od:

> display(P3,axes=boxed);

>

With the procedures defined in the previous section we can plot the data set read in from the file draw1 .

> display(P3,axes=boxed);

The next step is to obtain a linear least squares fit to the data.

Use Maple's stats package to do the linear least squares fit:

Suppose we wish to fit the data with a linear least squares method: Maple has a package for that, but it requires the data in two lists for the x and y values.

> with(stats):

> for i from 1 to imax do:

> xval[i]:=data[i][1]: yval[i]:=data[i][2]: od:

We generate the solution after some trial and error: Maple requires a list of lists in the fit procedure!

> lxval:=convert(xval,list); lyval:=convert(yval,list);

> sol:=fit[leastsquare[[x,y]]]([lxval,lyval]);

> assign(sol);

> y;

> Pfit:=plot(y,x=xval[1]-0.1..xval[imax]+0.1,color=brown):

> display({Pfit,op(P3)},axes=boxed);

A good question: given the uncertainty in the data, what is the uncertainty in the slope and the y intercept? How well are the data fitted by the straight line?

> describe[linearcorrelation](lxval,lyval): evalf(%);

The linear correlation is strong (the number is close to 1; 0 would mean no linear relationship; -1 that an anticorrelation to a linear relationship exists.

This illustrates nicely that the question of fitting a straight line to measured data points is non-trivial. Note that the uncertainties have not been used in the calculation. One can weigh the data points, if one trusts some of them more than others. For example: the points on the right apparently follow a straight line rather well. The fit pays attention to this fact, but it does not 'know' that the two rightmost points have the biggest error bars.

>

What assumptions enter the standard linear least squares fit? (J.R. Taylor, Introduction to Error Analysis ; chapter 8.2)

1) the x values are assumed to be very precise (negligible uncertainties);

2) the y values are assumed to have equal uncertainties (if not then weighting should be applied) described by a Gaussian with a certain width common to all y_i;

3) if the data were exact, they would satisfy y_i = A + B x_i , where A,B are physical constants. If the uncertainties for each measurement y_i are governed by a Gaussian width sigma , the probability of obataining the entire data set for the physical constants A,B is given as the product of probabilities for each point. The product of exponentials can be expressed as a single exponential with argument -chi^2/2 .

Derivation expressions for the LLSQ fit with uncertainties:

> y:='y': x:='x': i:='i': chi:='chi': A:='A': B:='B':

The expression for the total error is defined according to the data model (linear relationship, equal Gaussian errors in y_i) described above as:

> E1:=chi^2=Sum((y[i]-A-B*x[i])^2/sigma^2,i=1..N);

The best values for A and B are those for which the probability P_{A,B}(y_1,...,y_N) is a maximum, or chi-squared a minimum. They are found from the necessary condition:

> E2a:=diff(rhs(E1),A)=0;

> E2b:=diff(rhs(E1),B)=0;

> sol:=solve({E2a,E2b},{A,B}):

> assign(sol);

> simplify(A);

> simplify(B);

Everything goes as per textbook, except that the sum of 1 from 1 to N is not simplified to N. This is due to the fact that the unevaluated Sum was used.

> value(A);

> value(B);

> A1:=subs(N=imax,value(A)):

> i:='i': for j from 1 to imax do:

> A1:=subs(x[j]=xval[j],y[j]=yval[j],A1); od:

> simplify(A1);

> B1:=subs(N=imax,value(B)):

> i:='i': for j from 1 to imax do:

> B1:=subs(x[j]=xval[j],y[j]=yval[j],B1); od:

> simplify(B1);

Thus, we know how the simple linear least squares result was obtained. Now we are interested in understanding the estimate of uncertainties. The best estimate for sigma is obtained in analogy with the best estimate for A and B: We find the maximum probability to obtain the measured values {y_i} from the necessary condition.

First we indicate schematically how the probability depends on sigma: psi is the same as chi except for the sigma^2 that has been removed:

> Prob:=exp(-psi^2/(2*sigma^2))/sigma^N;

> E3:=diff(Prob,sigma)=0;

> solsig:=solve(E3,sigma);

> E4:=sigma^2=solsig[1]^2;

Thus, the sum of squares (psi^2) per degree of freedom (N) provides an estimate for the uncertainty of the data. In fact, we need to correct for the fact that A and B are not exactly known, but rather deduced from the data to divide by (N-2).

For our example we have:

> i:='i':

> sigsq:=evalf(Sum((yval[i]-A1-B1*xval[i])^2,i=1..imax)/(imax-2));

This number represents a mean square deviation between the data points and the line fit. That this is a realistic estimate follows from taking a typical deviation from the graph and squaring it (e.g., 0.02 for the green point #4). We are, however, more interested in the uncertainty in the extracted coefficients A and B.

> evalf(yval[4]-A1-B1*xval[4])^2;

Now that we have an error estimate for the data {y_i} we can use standard error propagation to estimate the errors in A and B from their respective expressions. Straightforward calculus results in the following expressions:

> i:='i':

> sigBsq:=evalf(sigsq*imax/(imax*Sum(xval[i]^2,i=1..imax)-Sum(xval[i],i=1..imax)^2));

> i:='i':

> sigAsq:=evalf(sigsq*Sum(xval[i]^2,i=1..imax)/(imax*Sum(xval[i]^2,i=1..imax)-Sum(xval[i],i=1..imax)^2));

These expressions are not easily derived from A and B. The following example shows Maple's difficulty with the problem:

> diff(sum(z[i]*eta[i],i=1..6),z[2]);

> diff(sum(z[i]*eta[i],i=1..N),z[2]);

Hand-written LLSQ fit procedure with uncertainties:

We define procedures which implement the derived formulae for the LLSQ fit:

> addme:=arg->evalf(Sum(arg[i],i=1..nops(arg)));

> addme([1,2,3,4]);

> addpr:=(arg1,arg2)->evalf(Sum(arg1[i]*arg2[i],i=1..nops(arg1)));

> addpr([1,2,3,4],[1,2,3,4]);

> evalf(sum(i^2,i=1..4));

We are ready to program the standard linear least squares algorithm:

> i:='i':

> LLSQ:=proc(xv,yv) local _i,N,a,b,c,d,Delta,A,B,sq,sigsq,dA,dB;

> N:=nops(xv);

> if nops(yv)<N then RETURN(`Mismatch in sizes of xv and yv`); fi;

> c:=addpr(xv,xv);

> d:=addme(xv);

> a:=addpr(xv,yv);

> b:=addme(yv);

> Delta:=evalf(N*c-d^2);

> A:=(c*b-d*a)/Delta;

> B:=(N*a-d*b)/Delta;

> sq:=[]; for _i from 1 to N do:

> sq:=[op(sq),evalf(yv[i]-A-B*xv[i])]: od:

> sigsq:=addpr(sq,sq)/(N-2);

> dA:=sigsq*c/Delta;

> dB:=evalf(N*sigsq)/Delta;

> print(`[A,B,sigma,dA,dB]`);

> [A,B,evalf(sqrt(sigsq),4),evalf(sqrt(dA),4),evalf(sqrt(dB),4)];

> end:

> LLSQ([1,2,3,4],[2.1,4.,5.9,8.2]);

> sol1:=LLSQ(lxval,lyval);

> yfit:=sol1[1]+sol1[2]*x;

Weighted LLSQ fit procedure:

We know that our procedure works and we have estimates for the uncertainties in A and B now. We can modify the results to take into account individual uncertainties for the datapoints y_i by means of a weighted least squares fit (J.R. Taylor problem 8.4). The individual uncertainties have to be known for this method to be meaningful. One uses for the inverse of the individual sigma_i-squared as a weight for data point i.

> i:='i': for i from 1 to imax do:

> wval[i]:=1./(0.5*(data[i][3]-data[i][4]))^2; od:

> lwval:=convert(wval,list);