Because Astronomy is an observational, not experimental, science, our conditions are not as well controlled as they are for many physics experiments. In particular, astronomical measurements often deal with large numbers of upper and lower limits. The statistical tools we use are often more obscure (to scientists, if not to statisticians) than those used by physicists (see the Statistical Consulting Center for Astronomy for examples of the types of statistical questions astronomers run up against).
Knowledge of statistics is absolutely essential for an observational astronomer. It is rare when one obtains sufficient signaltonoise that one can dispense with error estimates. Telescope time is sufficiently oversubscribed that one only has time to obtain marginal detections of sources, or one is often working at the very limits of detectability. We detect two things: signal and noise: you need to know how to separate them.
The most important thing you need to know is the question you are asking. A statistic is just a number; if you apply the wrong statistical technique, or apply an inappropriate statistical technique, or your model is wrong, then your answer is meaningless.
Remember that you cannot use statistics to prove that a model is correct; you can only use statistics to reject hypotheses. If you test the wrong model, even if you cannot reject it, then all bets are off. Increasing the number of trials improves estimates of means and variances only up to the limit of your systematic errors. Nonstatistical fluctuations do exist. Tails of distributions are not necessarily representative of the true errors. Finally, it is acceptable to thow away discrepant data. Chaunevet's criterion says that if you expect less than half an event, you can disregard any data in that bin. But, before you do, ask yourself whether your model may not be incorrect, or whether the distribution may be nonGaussian.
Important Concepts
A statistic S is used to measure the goodnessoffit of a model to some data. One minimizes S by adjusting model parameters to better fit the data. If one has N independent data points and m adjustable parameters in the model, one has =Nm degrees of freedom. In the case of data which are normallydistributed, one expects that an acceptable model will have a reduced chisquare, = /(Nm), of unity. Note that this is only a rough rule of thumb: check Table C.4 in Bevington and Robinson. For small , =1 gives a probability of about 30% of exceeding by chance, but for large the probability approaches 50%.
We often measure confidence levels in terms of , which refers to the width of the Gaussian probability distribution function. Consequently, you often hear of 3 results, or 1 upper limits. 68% of the area is at less than 1 from the mean, so a 1 deviation will occur by chance 32% of the time. 2 deviations occur with 5% probability, 3 deviations have a 0.3% probability P of occurring by chance, and 4 deviations have P=0.01%. A quoted 3 detection means that the probability is 0.997 that the detection is real, and only 0.003 that it is corresponds to a random noise fluctuation.
A 1 upper limit means that if you were to repeat the observation, 68% of the time the result would be less than that value, and 32% of the time it would be greater. This is the same as quoting a result with 68% confidence.
Note that, if you have a detector with one million detection elements, such as a 1000X1000 CCD, you expect 3000 3, and 100 4 deviations just by chance. This is why, in imaging surveys, the threshold for source detection is generally higher than 5. Often it is set to that level which will produce the expectation of at most one spurious source.
, or some other statistic S, can be used not only to find the best fit point in parameter space (where S is minimized), but also to find the region of parameter space which provides acceptable fits. You need to
You want to be able to reject unsuitable hypotheses, but not the correct hypothesis.
f(S) is the probability distribution of S. The probability of wrongful rejection is the integral of f(S) over some region dS. In general, we use . In so doing, we assume that the data are normally distributed with the variance equal to the mean, a mean deviation of 0, and the data points are independent of each other. In this case, the statistic S is distributed as . , the level of significance of the test, is defined as . In general, if > 0.1, you should not reject the fit.
If you reject a hypothesis at > 0.1, you have a 90% chance of not rejecting the correct hypothesis.
If you have more than one adjustable parameter, you are testing a composite hypothesis. To reject the composite hypothesis, you must be able to reject the best fit with all the parameters adjusted. Assume that there are p adjustable parameters. You can find the minimum value of your statistic, S_{min} by adjusting all parameeters. S_{min} will be distributed as with Np degrees of freedom. You can reject the model at some level of confidence if S_{min}>_{Np}().
The acceptable region in parameter space is that region wherein your statistic S is less than some threshhold value. Suppose you measure a quantity X from a sample x_{i} with a mean x. then, with 1 confidence,
Given 2 parameters, the region x<X<x+ and y<Y<y+ encloses the 0.68X0.68, or 46% confidence region. Joint confidence regions are larger than single regions, even if all the parameters are independent.
To determine a confidence contour, follow this procedure:
Measure a statistic S_{True} for
p (unknown) parameters. S_{True}
is distributed as with N degrees of freedom. You can minimize
S=S_{min}
to find the best fit parameters.
SS_{True}S_{min}
is distributed as _{p}. This follows because
_{Np}_{p} is approximately distributed as
_{N}.
Note that the S statistic is independent of the
S_{min} statistic.
Because S is distributed as
_{N}, one can determine the confidence of
S as
, where f is the probability density of
.
For some chosen confidence contour, you determine a limiting
area S_{L}=S_{min
}+T
Margon, Lampton, & Bowyer (1976) provide a table giving values of T as a function of parameters p for various values of significance and confidence = 1.
Confidence  p=1  p=2  p=3  p=4  

0.80  0.20 (0.25)  0.06  0.45  1.00  1.65 
0.32  0.68 (1)  1.00  2.3  3.5  4.7 
0.10  0.90 (1.6)  2.71  4.61  6.25  7.78 
0.01  0.99 (2.6)  6.63  9.21  11.3  13.3 
Note that the rule of thumb  to add 1 to

is appropriate when p=1 for a
1 level of
confidence.
Reference
Maximum likelihood is discussed briefly in Bevington & Robinson. For a given set of outcomes, the probability of that outcome is the product of the probabilities of each individual outcome, and the most likely set of outcomes is that which maximizes the product of the individual probabilities. For example, suppose you have a Gaussian parent distribution with mean and variance . We do not know the mean, but can only estimate it (') from the data. The gaussian probability distribution P_{i} =
Astronomical data are usually binned rather than continuous, that is, each event can be placed in a discrete bin. For example, a photon detected by a CCD detector falls in a particular pixel on the detector, with integral values of X and Y coordinates, or a flux is measured during a particular readout of the detector, which falls within a discrete time interval.
There are a number of ways to smooth the data:
Fourier smoothing works in the frequency domain by filtering the Fourier transform of the data. You can remove the pixeltopixel noise by removing (zeroingout) all frequencies higher than the Nyquist (sampling) frequency). Since sharp edges in the transform lead to spurious features in the data, one generally multiplies the Fourier transform of the data by a smoothly varying function that is unity at low frequence and zero at the highest frequency. The inverse transform of this product is the data with high frequency noise removed.
The advantage of Fourier smoothing is that it removes high frequency noise while not affecting the resolution of the data (unless the bandbass of the filter is too narrow). Fourier filtering can also be used to select specific frequencies, or remove specific frequencies, from a data set.
Since there won't be a lot of time available in this course for developing fancy software, feel free to use existing code to fit data. But you should understand how the code works, and be sure that the results are reasonable.
In IDL, the following routines are useful:
In linear regreession analysis, we try to determine the best linear relation between Y and X, Y(X). There are at least 5 different ways to do ordinary least squares (OLS) linear regression, and so long as the linear correlation coefficient R is less than unity, they generally give different answers. In fact, the only times these 5 methods give identical answers for Y(X) if R=1, or if R ne 1 and the variances in X and Y are identical and the slope of the relation is unity.
The methods are:
Feigelson & Babu (1992) discuss proper treatment of leastsquares regression in the presence of known errors.