## Astronomical Statistics

### Contents

General References
• Babu, G.J. & Feigelson, E.D. Astrostatistics, 1996, Chapman and Hall.
• Bevington, P.R. & Robinson, D.K. Data Reduction and Error Analysis for the Physical Sciences, 1992, McGraw-Hill.
• Feigelson, D.E. & Babu, G.J. (eds.) Statistical Challenges in Modern Astronomy, 1992, Springer-Verlag.

Because Astronomy is an observational, not experimental, science, our conditions are not as well controlled as they are for many physics experiments. In particular, astronomical measurements often deal with large numbers of upper and lower limits. The statistical tools we use are often more obscure (to scientists, if not to statisticians) than those used by physicists (see the Statistical Consulting Center for Astronomy for examples of the types of statistical questions astronomers run up against).

Knowledge of statistics is absolutely essential for an observational astronomer. It is rare when one obtains sufficient signal-to-noise that one can dispense with error estimates. Telescope time is sufficiently oversubscribed that one only has time to obtain marginal detections of sources, or one is often working at the very limits of detectability. We detect two things: signal and noise: you need to know how to separate them.

The most important thing you need to know is the question you are asking. A statistic is just a number; if you apply the wrong statistical technique, or apply an inappropriate statistical technique, or your model is wrong, then your answer is meaningless.

Remember that you cannot use statistics to prove that a model is correct; you can only use statistics to reject hypotheses. If you test the wrong model, even if you cannot reject it, then all bets are off. Increasing the number of trials improves estimates of means and variances only up to the limit of your systematic errors. Nonstatistical fluctuations do exist. Tails of distributions are not necessarily representative of the true errors. Finally, it is acceptable to thow away discrepant data. Chaunevet's criterion says that if you expect less than half an event, you can disregard any data in that bin. But, before you do, ask yourself whether your model may not be incorrect, or whether the distribution may be non-Gaussian.

Important Concepts

• Random errors. These arise from counting statistics
• Systematic errors. Other errors are lumped in this category. For example, observing through the wrong filter can lead to significant systematic errors.
• Roundoff and significant figures. The advantage of slide rules was that you never knew the answer to more than 3 significant figures. This may not be sufficient for physicists or engineers, but is is usually appropriate for astronomers. If you measure a result x.abcdefg +/- 0.00xy, the digits efg are completely meaningless and should not be reported.
• The difference between accuracy and precision. The HST was launched with a very precisely-figured mirror. Unfortunately, the figure was not very accurate. Alternatively, a digital watch can be read more precisely than can an analog watch without a second hand, but if you set the digital watch to the wrong time, it may be less accurate.

### I. The Basics

You will need to be acquainted with all the concepts in Bevington & Robinson (or the earlier version by Bevington). What follows is a synopsis of the most important concepts:
• Distributions. There is a parent distribution (the truth), and you sample it (xi). The larger your sample, the closer it will be to the parent distribution. Three important distributions are the binomial, Poisson, and Gaussian distributions. These are well described by Bevington and Robinson. In the limit of a large number N of samples, both the Poisson and binomial distributions approach the Gaussian distribution. In the limit of large N, the uncertainty is given by sqrt(N)
• Most distributions have a mean (the centroid, or average value), a median (the point at which half the sample are above and half below), and a mode (the most probable value).
• An estimate of the scatter of the data is provided by the deviation di=xi-, the average deviation (the mean of the absolute value of di) the standard deviation , (the root-mean-square value of the deviations), and the variance (N is the number of samples summed). Note that the variance of a sample, , is different than the true variance, because the mean value m is estimated from the data.

### II. Propagation of Errors

Whenever you attempt to measure the true brightness S of an object, there is some background (sky, or instrumental background, or a combination) B. You actually observe S+B. You can usually measure, or at least estimate, what B is independently. The true brightness S of a source is given by S=(S+B)-B. Counting statistics apply to both the observed brightness (S+B) and to the observed background B. The variance is then (assuming only counting errors), and the signal-to-noise ratio is given by . Note that this is always less than the SNR in the absence of background, which is the square root of the total number of counts.

### III. Goodness of Fit Criteria and Confidence Contours

A statistic S is used to measure the goodness-of-fit of a model to some data. One minimizes S by adjusting model parameters to better fit the data. If one has N independent data points and m adjustable parameters in the model, one has =N-m degrees of freedom. In the case of data which are normally-distributed, one expects that an acceptable model will have a reduced chi-square, = /(N-m), of unity. Note that this is only a rough rule of thumb: check Table C.4 in Bevington and Robinson. For small , =1 gives a probability of about 30% of exceeding by chance, but for large the probability approaches 50%.

#### Confidence Contours

References
• Lampton, M., Margon, B., & Bowyer, S. 1976, ApJ, 208, 177.

We often measure confidence levels in terms of , which refers to the width of the Gaussian probability distribution function. Consequently, you often hear of 3 results, or 1 upper limits. 68% of the area is at less than 1 from the mean, so a 1 deviation will occur by chance 32% of the time. 2 deviations occur with 5% probability, 3 deviations have a 0.3% probability P of occurring by chance, and 4 deviations have P=0.01%. A quoted 3 detection means that the probability is 0.997 that the detection is real, and only 0.003 that it is corresponds to a random noise fluctuation.

A 1 upper limit means that if you were to repeat the observation, 68% of the time the result would be less than that value, and 32% of the time it would be greater. This is the same as quoting a result with 68% confidence.

Note that, if you have a detector with one million detection elements, such as a 1000X1000 CCD, you expect 3000 3, and 100 4 deviations just by chance. This is why, in imaging surveys, the threshold for source detection is generally higher than 5. Often it is set to that level which will produce the expectation of at most one spurious source.

, or some other statistic S, can be used not only to find the best fit point in parameter space (where S is minimized), but also to find the region of parameter space which provides acceptable fits. You need to

• define some statistic S which is sensitive to significant discrepancies between the model and the data,
• evaluate S, and
• compare S with the theoretical distribution of S, given that your hypothesis (the model) is correct.

You want to be able to reject unsuitable hypotheses, but not the correct hypothesis.

f(S) is the probability distribution of S. The probability of wrongful rejection is the integral of f(S) over some region dS. In general, we use . In so doing, we assume that the data are normally distributed with the variance equal to the mean, a mean deviation of 0, and the data points are independent of each other. In this case, the statistic S is distributed as . , the level of significance of the test, is defined as . In general, if > 0.1, you should not reject the fit.

If you reject a hypothesis at > 0.1, you have a 90% chance of not rejecting the correct hypothesis.

The Composite Hypothesis

If you have more than one adjustable parameter, you are testing a composite hypothesis. To reject the composite hypothesis, you must be able to reject the best fit with all the parameters adjusted. Assume that there are p adjustable parameters. You can find the minimum value of your statistic, Smin by adjusting all parameeters. Smin will be distributed as with N-p degrees of freedom. You can reject the model at some level of confidence if Smin>N-p().

Constructing Confidence Contours

The acceptable region in parameter space is that region wherein your statistic S is less than some threshhold value. Suppose you measure a quantity X from a sample xi with a mean x. then, with 1 confidence,

• x-<X<x+ (if the model is linear and xi is distributed normally, this is the smallest region
• x-0.47<X<
• <X<x+0.47
• or an infinite number of other possibilities.

Given 2 parameters, the region x-<X<x+ and y-<Y<y+ encloses the 0.68X0.68, or 46% confidence region. Joint confidence regions are larger than single regions, even if all the parameters are independent.

To determine a confidence contour, follow this procedure: Measure a statistic STrue for p (unknown) parameters. STrue is distributed as with N degrees of freedom. You can minimize S=Smin to find the best fit parameters.
SSTrue-Smin is distributed as p. This follows because N-p-p is approximately distributed as N.
Note that the S statistic is independent of the Smin statistic.

Because S is distributed as N, one can determine the confidence of S as , where f is the probability density of .
For some chosen confidence contour, you determine a limiting area SL=Smin +T

Margon, Lampton, & Bowyer (1976) provide a table giving values of T as a function of parameters p for various values of significance and confidence = 1-.

Confidence p=1 p=2 p=3 p=4
0.80 0.20 (0.25) 0.06 0.45 1.00 1.65
0.32 0.68 (1) 1.00 2.3 3.5 4.7
0.10 0.90 (1.6) 2.71 4.61 6.25 7.78
0.01 0.99 (2.6) 6.63 9.21 11.3 13.3

Note that the rule of thumb - to add 1 to - is appropriate when p=1 for a 1 level of confidence.

### IV. Maximum Likelihood and the C statistic

Reference

• Cash, W. 1982, ApJ, 228, 939.
From the point of view of an observational astronomer, the main limitations of , and Gaussian statistics are that
• the data must be binned, and
• there must be at least 9 or 10 events per bin in order for for to be statistically vaid.
One can get around these problems by using Poisson statistics and a maximum likelihood analysis.

Maximum likelihood is discussed briefly in Bevington & Robinson. For a given set of outcomes, the probability of that outcome is the product of the probabilities of each individual outcome, and the most likely set of outcomes is that which maximizes the product of the individual probabilities. For example, suppose you have a Gaussian parent distribution with mean and variance . We do not know the mean, but can only estimate it (') from the data. The gaussian probability distribution Pi =

### V. Smoothing Data

Real data are often noisy. Real data are often oversampled. In either case, it is sometimes advisable to smooth the data before analysis. At the very least, smoothing the data gives you a better idea of what is really in the data.

Astronomical data are usually binned rather than continuous, that is, each event can be placed in a discrete bin. For example, a photon detected by a CCD detector falls in a particular pixel on the detector, with integral values of X and Y coordinates, or a flux is measured during a particular readout of the detector, which falls within a discrete time interval.

There are a number of ways to smooth the data:

• Coaddition. This is particularly useful when the data are oversampled. Merely add together adjacent bins. Add the errors in quadrature. This will decrease the number of pixels, increase the number of counts in each pixel, and improve the S/N in each pixel. But beware: if you coadd too many pixels you could lose resolution.
• Running Mean, or Boxcar Smoothing. Replace each pixel by the mean of N pixels surrounding it. This does not decrease the number of pixels, but increases the S/N in each pixel by roughly a factor of the square root of N. Boxcar smoothing reduces the resolution.
• convolution The running mean (above) is the convolution of the data with a step function of width N pixels. It is possible, and often advantageous, to smooth by convolving the data with other functions. Convolving with a triangle or a Gaussian places more weight on the central pixel, and preserves the resolution better, than does smoothing with a boxcar (but the S/N is lower for the same width). Convolution with fancier functions can be used for edge detection, or for detection of features of specified widths.
• Fourier Smoothing. Real data is often affected by 1/f noise, or high frequency (pixel-to-pixel) noise. Low frequency (slowly-varying) noise does not greatly affect small features in an image (stars, or narrow spectral lines). By working in the frequency domain, you can preferentially filter out particular frequencies.

Fourier smoothing works in the frequency domain by filtering the Fourier transform of the data. You can remove the pixel-to-pixel noise by removing (zeroing-out) all frequencies higher than the Nyquist (sampling) frequency). Since sharp edges in the transform lead to spurious features in the data, one generally multiplies the Fourier transform of the data by a smoothly varying function that is unity at low frequence and zero at the highest frequency. The inverse transform of this product is the data with high frequency noise removed.

The advantage of Fourier smoothing is that it removes high frequency noise while not affecting the resolution of the data (unless the bandbass of the filter is too narrow). Fourier filtering can also be used to select specific frequencies, or remove specific frequencies, from a data set.

### VI. How to Fit Data

In general, we use the technique of least-squares or chi-square minimization. This issue is covered in some detail by Bevington and Robinson (chapters 6-8).

Since there won't be a lot of time available in this course for developing fancy software, feel free to use existing code to fit data. But you should understand how the code works, and be sure that the results are reasonable.

In IDL, the following routines are useful:

• LSQRE does a least-squares for Y given X (the usual case). Type LSQRE,/HELP for brief on-line help. The procedure was written by F. Walter.
• REGRESS does a least-squares multiple linear regression. See the on-line IDL help.
• GAUSSFIT fits a gaussian line and a quadratic background. See the on-line IDL help.
• POLY_FIT fits a polynomial to Y(X) and returns the coefficients. See the on-line IDL help.
• CURVEFIT does a non-linear least-squares fit to a user-supplied function with arbitrary parameters. See the on-line IDL help.

### VII. Issues in Regression Analysis

References
• Isobe, Feigelson, Akritas, & Babu. 1990, ApJ, 364, 104.
• Feigelson, E.D. & Babu, G.J. 1992, ApJ, 397, 55.

In linear regreession analysis, we try to determine the best linear relation between Y and X, Y(X). There are at least 5 different ways to do ordinary least squares (OLS) linear regression, and so long as the linear correlation coefficient |R| is less than unity, they generally give different answers. In fact, the only times these 5 methods give identical answers for Y(X) if |R|=1, or if |R| ne 1 and the variances in X and Y are identical and the slope of the relation is unity.

The methods are:

• OLS(Y|X). This is appropriate if all the errors are in Y. This minimizes the scatter in Y.
• OLS(X|Y). This is appropriate if all the errors are in X. This minimizes the scatter in X.
• The Geometric mean, or bisector, of OLS(Y|X) and OLS(X|Y)
• Orthogonal regression, minimizing the perpendicular distance of the point from the fit.
• Reduced major Axes (RMA), wherein the X and Y deviations are independently minimized.
Reminder: in general, at least in astronomy, there are no perfect measurements. OLS(Y|X) is generally the best choice if
• the true relation is linear,
• X is measured without error,
• errors in Y have zero mean, finite variance, and are independent of all other points, and
• errors in Y do not depend on X (this is often violated).
Note that, even if one or more of these assumptions are violated, OLS(Y|X) can still be used to predict Y(X). Isobe et al. ran a series of Monte Carlo simulations on data consisting of N points, with uncertainties in both X and Y, and found the following:
• In the absence of any additional knowledge, any relation can be used to predict Y(X).
• Try fitting the data using all 5 techniques; if they agree within the errors there is no need to worry about what to use.
• Orthogonal regression is considerably less accurate than the other methods. This works well only for log-log or other scale-free data.
• OLS(X|Y) is unstable if |R| is small or if N is small.
• All methods underestimate the uncertainties in the slopes when N is small. For R=0.5 and N=100, the uncertaintly on the slope is underestimated by 10%; for N=20 the uncertainty is underestimated by 40%.
• RMA should not be used. The slope is independent of R, and the technique cannot give any insights into the relation between X and Y.
• If X is clearly causative, use OLS(Y|X).
• To prexict Y from X, use OLS(Y|X).
• To determine the functioal form of Y(X) use the OLS bisector.
• Always quote the error in your derived slope!

Feigelson & Babu (1992) discuss proper treatment of least-squares regression in the presence of known errors.

#### Truncation

Data are truncated if there is an artificial upper or lower limit imposed. An example is a magnitude-limited survey. Results can be biased

#### Malmquist Bias

Malmquist bias (ref) occurs when the truccation is a function of the independent variable.

#### Censored data

Data that includes either upper or lower limits is called censored data. It is almost never appropriate to ignore limits and only fit the detected points - the only exception would be in the case of meaningless upper limits that lie far above all the other data points. Dealing with censored data is the subject of the next section.

### VIII. Censored Data and Survival Analysis

References
• Schmitt, J.H.M.M. 1985, ApJ, 293, 178.
• Feigelson, E.D. & Nelson 1985, ApJ, 293, 192.