Because Astronomy is an observational, not experimental, science, our conditions are not as well controlled as they are for many physics experiments. In particular, astronomical measurements often deal with large numbers of upper and lower limits. The statistical tools we use are often more obscure (to scientists, if not to statisticians) than those used by physicists (see the Statistical Consulting Center for Astronomy for examples of the types of statistical questions astronomers run up against).
Knowledge of statistics is absolutely essential for an observational astronomer. It is rare when one obtains sufficient signal-to-noise that one can dispense with error estimates. Telescope time is sufficiently oversubscribed that one only has time to obtain marginal detections of sources, or one is often working at the very limits of detectability. We detect two things: signal and noise: you need to know how to separate them.
The most important thing you need to know is the question you are asking. A statistic is just a number; if you apply the wrong statistical technique, or apply an inappropriate statistical technique, or your model is wrong, then your answer is meaningless.
Remember that you cannot use statistics to prove that a model is correct; you can only use statistics to reject hypotheses. If you test the wrong model, even if you cannot reject it, then all bets are off. Increasing the number of trials improves estimates of means and variances only up to the limit of your systematic errors. Nonstatistical fluctuations do exist. Tails of distributions are not necessarily representative of the true errors. Finally, it is acceptable to thow away discrepant data. Chaunevet's criterion says that if you expect less than half an event, you can disregard any data in that bin. But, before you do, ask yourself whether your model may not be incorrect, or whether the distribution may be non-Gaussian.
Important Concepts
A statistic S is used to measure the goodness-of-fit of
a model to some data. One minimizes S by adjusting model
parameters to better fit the data. If one has N independent
data points and m adjustable parameters in the model, one
has
=N-m degrees of freedom.
In the case of data which are
normally-distributed, one expects that an acceptable model will have a reduced
chi-square,
=
/(N-m), of unity. Note that this is only a
rough rule of thumb: check Table C.4 in Bevington and Robinson. For
small
,
=1
gives a probability of about 30% of exceeding
by
chance, but for
large
the probability approaches 50%.
We often measure confidence levels in terms of
, which
refers to the width
of the Gaussian probability distribution function. Consequently, you often hear
of 3
results, or
1
upper limits.
68% of the area is
at less than 1
from
the mean, so a 1
deviation will occur by chance 32% of the time.
2
deviations occur
with 5% probability, 3
deviations have a 0.3% probability P of occurring by chance, and
4
deviations have
P=0.01%. A quoted 3
detection means that the probability is 0.997 that the detection is real,
and only 0.003 that it is corresponds to a random noise fluctuation.
A 1
upper limit
means that if you were to repeat the observation, 68% of the time the
result would be less than that value, and 32% of the time it would be
greater. This is the same as quoting a result with 68% confidence.
Note that, if you have a detector with one million detection elements,
such as a 1000X1000 CCD, you expect 3000
3
, and
100 4
deviations just
by chance. This is why, in imaging surveys, the threshold for source detection
is generally higher than
5
.
Often it is set to
that level which will produce the expectation of at most one spurious source.
, or some
other statistic S, can be used not only to find the best fit
point in parameter space (where S is minimized), but also to
find the region of parameter space which provides acceptable
fits. You need to
You want to be able to reject unsuitable hypotheses, but not the correct hypothesis.
f(S) is the probability distribution of S.
The probability of wrongful rejection is the integral of
f(S) over some region dS. In general,
we use
. In
so doing, we assume that the data are normally distributed with the
variance equal to the mean, a mean deviation of 0, and the data points are
independent of each other. In this case, the statistic
S is distributed as
.
, the
level of significance of the test, is defined as
.
In general, if
> 0.1, you
should not reject the fit.
If you reject a hypothesis at
> 0.1, you
have a 90% chance of not rejecting the correct hypothesis.
If you have more than one adjustable parameter, you are testing a
composite hypothesis. To reject the composite hypothesis, you must be able
to reject the best fit with all the parameters adjusted. Assume that there
are p adjustable parameters. You can find the minimum
value of your statistic, Smin by adjusting all
parameeters. Smin will be distributed as
with
N-p degrees of freedom. You can reject the model at
some level of confidence
if
Smin>
N-p(
).
The acceptable region in parameter space is that region wherein your
statistic S is less than some threshhold value. Suppose
you measure a quantity X from a sample
xi with a mean x. then, with
1
confidence,
Given 2 parameters, the region
x-
<X<x+
and
y-
<Y<y+
encloses the
0.68X0.68, or 46% confidence region. Joint confidence regions are larger
than single regions, even if all the parameters are independent.
To determine a confidence contour, follow this procedure:
Measure a statistic STrue for
p (unknown) parameters. STrue
is distributed as
with N degrees of freedom. You can minimize
S=Smin
to find the best fit parameters.
S
STrue-Smin
is distributed as
p. This follows because
N-p-
p is approximately distributed as
N.
Note that the
S statistic is independent of the
Smin statistic.
Because
S is distributed as
N, one can determine the confidence of
S as
, where f is the probability density of
.
For some chosen confidence contour, you determine a limiting
area SL=Smin
+T
Margon, Lampton, & Bowyer (1976) provide a table giving values of
T as a function of parameters p
for various values of significance
and
confidence = 1-
.
|
|
Confidence | p=1 | p=2 | p=3 | p=4 |
|---|---|---|---|---|---|
| 0.80 | 0.20
(0.25 |
0.06 | 0.45 | 1.00 | 1.65 |
| 0.32 | 0.68
(1 |
1.00 | 2.3 | 3.5 | 4.7 |
| 0.10 | 0.90
(1.6 |
2.71 | 4.61 | 6.25 | 7.78 |
| 0.01 | 0.99
(2.6 |
6.63 | 9.21 | 11.3 | 13.3 |
Note that the rule of thumb - to add 1 to
-
is appropriate when p=1 for a
1
level of
confidence.
Reference
Maximum likelihood is discussed briefly in Bevington & Robinson. For a given
set of outcomes, the probability of that outcome is the product of the
probabilities of each individual outcome, and the most likely set of outcomes
is that which maximizes the product of the individual probabilities. For
example, suppose you have a Gaussian parent distribution with mean
and variance
. We do not know
the mean, but can only estimate it
(
')
from the data. The gaussian probability
distribution Pi =
Astronomical data are usually binned rather than continuous, that is, each event can be placed in a discrete bin. For example, a photon detected by a CCD detector falls in a particular pixel on the detector, with integral values of X and Y coordinates, or a flux is measured during a particular readout of the detector, which falls within a discrete time interval.
There are a number of ways to smooth the data:
Fourier smoothing works in the frequency domain by filtering the Fourier transform of the data. You can remove the pixel-to-pixel noise by removing (zeroing-out) all frequencies higher than the Nyquist (sampling) frequency). Since sharp edges in the transform lead to spurious features in the data, one generally multiplies the Fourier transform of the data by a smoothly varying function that is unity at low frequence and zero at the highest frequency. The inverse transform of this product is the data with high frequency noise removed.
The advantage of Fourier smoothing is that it removes high frequency noise while not affecting the resolution of the data (unless the bandbass of the filter is too narrow). Fourier filtering can also be used to select specific frequencies, or remove specific frequencies, from a data set.
Since there won't be a lot of time available in this course for developing fancy software, feel free to use existing code to fit data. But you should understand how the code works, and be sure that the results are reasonable.
In IDL, the following routines are useful:
In linear regreession analysis, we try to determine the best linear relation between Y and X, Y(X). There are at least 5 different ways to do ordinary least squares (OLS) linear regression, and so long as the linear correlation coefficient |R| is less than unity, they generally give different answers. In fact, the only times these 5 methods give identical answers for Y(X) if |R|=1, or if |R| ne 1 and the variances in X and Y are identical and the slope of the relation is unity.
The methods are:
Feigelson & Babu (1992) discuss proper treatment of least-squares regression in the presence of known errors.