Calc/To-Dos/Descriptive Statistics
Contents
Descriptive Statistics
One Level Up: Calc/To-Dos/Statistical_Data_Analysis_Tool
Intro
One of the most neglected steps in statistics is the adequate presentation of the group-characteristics for the data of interest. However, half of the statistical work is already done by simply providing adequate descriptions. During the following sections I will therefore emphasize the correct techniques, especially in view of newer methods and possibilities.
The statistical reasoning starts usually by examining the data (see Exploratory Data Analysis).
- graphical methods: the first step involves some graphical methods (see Graphical Presentations)
- For a better understanding, I will start this section however with a description of the various group indices.
- summarizing the data: the second step involves summarizing the whole group of data using only a few numbers
- The solely purpose of Descriptive Statistics is to summaryse the data in a very simple fashion, using only 2 or 3 values. This is NOT always possible in an optimal way. Also, depending on the purpose of our analysis, we may choose different Indexes, which will be described briefly next.
Indexes of Central Tendency:
- NOTE: provide always both an Index of Central Tendency and an Index of Spread
Median
- simple
- often an actual value of the data set
- robust: not influenced by outliers
Mean
- before the availability of computers, was easier to calculate than the median
- used in parametrical tests (see Statistical inference later)
- however, modern methods do NOT depend on the mean (see bootstrap procedure, non-parametrical tests)
- the software can calculate the mean internally (hidden to the viewer) and use it in a particular test, even if the mean was NOT already calculated/used for descriptive purposes
- heavily influenced by outliers
Trimmed Mean
- x%-trimmed mean:
- rank the data
- trim x% from the lower ranks
- trim x% from the upper ranks
- calculate the mean from the (100 -2*x)% remaining values
- more robust then the mean: not influenced by outliers
- Median: is just the 50% trimmed mean
Mode
- rarely used (several disadvantages, but may be useful for nominal data)
Discussion
Most of these are already available in Calc. However, more emphasis should be put on the median; therefore some of the following comments could be included in the help file, too.
- median: is favoured over the mean in many applications (Feinstein AR. Principles of Medical Statistics. Chapman & Hall 2002);
- the help file should state clearly that the median has several advantages over the mean as an index of central tendency and therefore should be used preferentially
- median: provide both the median AND the 50% inter-percentile range (IPR50%) or the IPR95% or the range (see Indexes of Spread/ Inner Location)
- mean: in many applications, the median should be used instead (also rename AVERAGE() to MEAN() – is more intuitive);
- the mean is usually reported as mean ± SD (standard deviation), however this value has several disadvantages... (see Indexes of Spread)
- for Coefficient of Stability (= SE/Mean) and Coefficient of Variation (=SD/Mean) see Indexes of Spread
- [trimmed mean: e.g. 25% trimmed mean = ignores the lower 25% and the upper 25% of the data; it is more robust than the mean for outliers, however, if there is concern about outliers, the median should be used instead for descriptive purposes. The trimmed mean has a value in bootstrap resampling because the median performs poorly unless the sample size is quite large.]
- mode: rarely used (several disadvantages, but may be useful for nominal data)
- modern methods: bootstrap resampling and jackknife procedures (see later)
- the median (+ an Index of Spread) is still used for descriptive purposes
- however, these newer methods are especially useful to determine the stability of these Indexes
Median vs Mean
This is a very important question, and most likely to be misunderstood. The role of Descriptive Statistics is to provide the best description of our data using only 2 or 3 values: an Index of Central Tendency and an Index of Spread.
- The median is always as good as the mean, and often it is better.
- if the median performs poor (because there is no good Index of Central Tendency), the mean would perform equally poor
- NOTE: descriptive statistics is unrelated to inferential statistics; therefore, reasons like minimizing the variance (aka standard deviation (SD) or SE) are not appropriate.
Why is the mean so entrenched in current statistics? =
- easy calculation: before the computer age, it was very difficult to calculate the median (although it is calculated from only one or two values from the data set, you need to know from which ones.) In the computer age, this is NOT a reason anymore.
- parametric tests: the second reason was that for classical statistical tests you need the mean, but again, computers can calculate it automatically in the background to use it for the specific statistic (without the user having to explicitly do so). With newer statistical methods, the mean can be often abandoned altogether.
- Additional reasons:
- The mean is more mathematically tractable and can be proven to minimize the variance (speak standard deviation, SD). The sample SD is an unbiased estimator of the population standard deviation. [-- a User, slightly modified]
- Response:
- median:
- is often a value from the data set, therefore this is even more tractable
- minimizes the absolute deviation!!! (therefore it bears similarity to the mean)
- absolute deviation is easier to interpret (though more difficult to calculate without a computer):
- the value is straightforward (same unit as the initial data, as opposed to the variance which has a meaningless unit)
- SD is an unbiased estimator for the SD of the population ONLY for a perfect Gaussian distribution;
- even slight outliers / skewness will render the sample SD a poor estimator for the population parameter;
- the average absolute deviation of the mean (MD) is a better estimator under these non-ideal conditions; as pointed previously, the median minimizes the average absolute deviation (of the median); a very robust parameter is the median absolute deviation from the median (MAD).
- median:
- Squared error (?SD) is a suitable measure for many applications in engineering and physical sciences but less so for many of the distributions found in medicine and biology. [... though, SD is actually in many situations not better, even in physics; the mean deviation is better in many of the real life examples and at the same time is more simple; -- Discoleo 21:00, 13 March 2007 (CET) ]
Unfortunately, the previous concerns miss the point. The mean / SD pair does NOT beat the median / (+ a spread parameter) for descriptive purposes. For descriptive purposes we do NOT need to minimize some arbitrary parameters. Instead, we wish to represent our data sample as accurately as possible (the actual data, NOT a hypothetical population). This point becomes even more mute when performing Bayesian inference or employing non-parametric methods.
Indexes of Inner Location
- quartiles (25%, 50% = median, 75%)
- percentiles
- rank
- percentrank (? = cumulative relative frequency)
- standardized z-score
- standardized increment (index of contrast for 2 groups of data)
Indexes of Spread
Measure the spread of the data. These parameters are never used alone but usually combined with an index of central tendency. Therefore, we will have parameters that are more suitable to be used with the mean and others more specific to the median. For a more extended discussion see also the Engineering Statistics Handbook (http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm).
- used with the median:
- range: use either (max - min) or provide both min and max values
- inner percentile zones: 95% (IPR95), 50% (IPR50 - the interquartile range)
- H zone: zone between the 2 hinges, H2 - H1, see mathworld-H-1 and mathworld-H-2
- H1: lower higne = the median of the lower half of the ranked data (up to and including the median)
- H2: upper hinge = the median of the upper half of the ranked data (including the median)
- Quartile Variation Coefficient: = 100 * (Q3-Q1)/(Q3+Q1), where Q1 and Q3 are the first and the 3rd quartiles; see also mathworld
- H zone: zone between the 2 hinges, H2 - H1, see mathworld-H-1 and mathworld-H-2
- used with the mean:
- mean deviation: MD = SUM(|Xi - mean|), rarely used, historically because of difficulty in calculation (absolute values are used), but see also MD vs SD
- although it can be demonstrated that, IF the data has a perfect gaussian distribution, the sample SD is a better population estimator than the mean deviation, the SD fails miserably even for slight skewness
- variance: PLEASE EXPAND
- for a sample: var = sum(Xi - mean)^2 / (N-1)
- for the population variance, divide by N (the population size): var = sum(Xi - mean)^2 / N
- a main disadvantage of the variance is the meaningless unit
- it puts greater emphasis on the extreme values (and outliers), as it sums the squared residuals
- standard deviation (SD): = square_root(variance)
- standard error (SE): = SD / square_root(N)
- DISADVANTAGES of the (variance, SD, SE)-group of parameters
- senseless values: one of the main disadvantages comes from impossible values for many non-gaussian distributions, i.e. computed values for (mean + 2*SD) or (mean - 2*SD) fall well outside the plausible values for the measured variable (e.g. negative age or body mass)
- put greater emphasis on extreme values (as residuals get squared)
- Complex measures:
- MD (mean deviation): absolute deviation of the mean = sum(|Xi - mean|) / N, see previous paragraph
- AAD: average absolute deviation of the median = sum(|Xi - median|) / N
- median absolute deviation of the mean: = median(all |Xi - mean|)
- MAD: median absolute deviation of the median = median(all |Xi - median|)
- this is a robust estimator
- highly insensitive to outliers
- NOTES
- Coefficient of stability for the mean: = SE/Mean
- Coefficient of variation: CV = SD/Mean
- modern methods: these methods do not depend on any index of central tendency!!!
- Gini's mean diference = (|Xi-Xj|)/(n*(n-1))
- please note that both Xi and Xj are members of the original data (differing therefore from the mean deviation, where the mean is subtracted from every element), see Fast R-Implemetation. (see here for the exact formula and also Wikipedia)
- Gini's Index is widely used in economics, see http://economics.dal.ca/RePEc/dal/wparch/howgini.pdf
- Walsh averages = array of all (Xi-Xj)/2 for each pair of the data (including each value paired with itself) -> especially useful for non-Gaussian data;
- other modern methods: bootstrap resampling and jackknife procedures
- bootstrap resampling can be used to compute robust confidence intervals for the various indexes described previously (like the mean);
- Gini's mean diference = (|Xi-Xj|)/(n*(n-1))
What should I use in my work?
There is NO universal index. However some comments are pertinent. For continuous data:
- the median is always as least as good as the mean, and often much better, so you are on the sure side if you use the median
- beyond median / mean, you must provide a second index
- if you wish to point out the outliers, use median (the range)
- if outliers are a cause of concern and should not influence any statistical decisions, use the median (interquartile range = inner percentile range 50%, IPR50%) or median (IPR95%)
- if the group is large and normally distributed, the CV (SD / mean) < ~0.5 (for very stringent conditions use 0.15 - 0.30)and mean ± 2* SD evaluates to some acceptable values, you can provide the mean (and SD), but the median (and IPR95%), or median (and range) are equally well suited;
For binary data
- only proportions are available
- do NOT provide both proportions, but only the one significant for your work (as they are redundant, p=1-q)
- state the group size, too (sometimes as n / Ntotal (x %), where Ntotal is the size of the group )
Outliers
Another important issue is the searching for outliers. There are various tests, e.g. Dixon's Q-Test. For full details see the R-Project Newsletter Vol 6/2, May 2006, "Processing data for outliers", p10-13, available freely at http://cran.r-project.org/doc/Rnews/Rnews_2006-2.pdf . The paper discusses other tests (like the Grubbs and Cochran-Tests), too. It offers some examples describing the R-implementation of the tests (package outliers). Historically, Chauvenet's Criterion was widely used. It converges for large N to Grubbs test at confidence 50%.
Graphical presentations
- Basics: VOLUNTEERS NEEDED: please expand!!!
- Graphical Analysis
- Stem-and-Leaf plots: see http://regentsprep.org/Regents/math/data/stemleaf.htm
- Histogram
- spectrum (distribution)
- Frequency/ Cumulative Frequency
- Box-Plots
- Bar charts
- Relationship:
- scatterplot
- Categorical data
- Frequency/ Cumulative Frequency
- Pie charts: are disfavoured in serious statistical analysis
- Venn/Euler diagrams: for overlapping categories, see http://en.wikipedia.org/wiki/Venn_diagram for a detailed explanation
- Time-Series:
- Kaplan-Meier curves
- plotting data for multiple groups/ multivariate data:
- ggpoint(), gghistogram(), ... in R-package ggplot (regression curves can be fitted with ggsmooth(), too); please run the examples from the ggplot help to see some real applications.
- Boxplot: ggboxplot() in R-package ggplot and boxplot() in R-package graphics. Run the examples in the help file for full details.
Intro
The first step in performing a statistical analysis is to visually analyse the actual data. This is usually done with some simple graphical methods. Probably the widest known is the stem-leaf-plot for dimensional data and one-way frequency tables/ cumulative frequencies for non-dimensional data. More complex methods involve histograms and boxplots.
After inspecting the data, the second step involves describing the group of data as a whole using some special group values (summary indexes), collectively known as indexes of central tendency and indexes of spread. (see paragraph on Indexes of Central Location)
In the next paragraphs I will describe the basic graphical methods with reference to the corresponding R-functions:
STEM-LEAF PLOT
R function:
stem('data', scale, ...) in package graphics, where # 'data' = numeric vector of data for histogram, # scale = length of the plot
HISTOGRAMS
R functions:
truehist('data', nbins, ...) in package MASS, where # 'data' = numeric vector of data for histogram, # nbins = the suggested number of bins.
hist('data', breaks, ...) in package graphics, where # 'data' = numeric vector of data for histogram, # breaks = a vector with the breakpoints between the cells or the number of cells (or ...)
Various other plotting methods:
barplot2(), lines(), ... from package gplots
For multivariate data see the function gghistogram() in package ggplot.
Modern Methods: Bootstrap Resampling
The classical statistical methods contain some important intellectual disadvantages. Recent advances in computational power allow however the use of robust non-parametric methods. More details on these methods can be found here. See also later. I will describe below the bootstrap resampling:
Creating robust confidence intervals:
R functions: boot() and boot.ci() from the package boot;
'boot_outobj'<- boot('data','stat_function',R=999,stype=”w”, ...), where # 'data' is the vector containing the data # 'stat_function' is the name of our statistical function to be bootstrapped, e.g.: # for the mean we need to define a custom function mymean(): mymean<-function(x,w) {sum(x*w)/sum(w)} # 'boot_outobj': the function outputs an object of class boot; boot.ci('boot_outobj', conf = 0.95, ...), where # 'boot_outobj' is an object of class boot # conf = the desired confidence interval (e.g. 95%)
R functions: bootstrap() from package bootstrap;
'boot_outobj'<-bootstrap('data', R = 999, 'stat_function', ...), where # 'data' is the vector containing the data # R = 999, the number of bootstrap samples desired # 'stat_function' is the name of our statistical function to be bootstrapped: # e.g. mean, median or a user defined function; # 'boot_outobj': the output value is a list with various components, # most important is thetastar; summary('boot_outobj'$thetastar) gives a summary of the result; quantile('boot_outobj'$thetastar, .025) and quantile('boot_outobj'$thetastar, .975) # will construct the 95% CI for our statistic (for 'stat_function').