Difference between revisions of "Calc/To-Dos/Statistical Data Analysis Tool"

From Apache OpenOffice Wiki
< Calc‎ | To-Dos
Jump to: navigation, search
m (Pitman-Welch permutation test)
(Classical statistics)
Line 20: Line 20:
 
= Classical statistics =
 
= Classical statistics =
  
This document will be structured the same way as when doing true statistical inference. I will deal in separate sections with the following topics:
+
This document will be structured in a similar way to the reasoning for true statistical inference. I will deal in separate sections with the following topics:
 
* Descriptive statistics
 
* Descriptive statistics
* Statisticcal inference I: parametric tests (Gaussian-distribution)
+
* Statistical inference I: parametric tests (Gaussian-distribution)
* Statisticcal inference II: non-parametric tests
+
* Statistical inference II: non-parametric tests
  
I will often give ''R''-functions as examples. Some of the functions described are already available in Calc, some new could be implemented, but great consideration should be given to access them externally through R. A more detailed description of ''R'' will be given at the end of this file (section of external software integration).
+
I will often give ''R''-functions as examples. Some of the functions described are already available in Calc, some new could be implemented, but great consideration should be given to access them externally through ''R''. A more detailed description of ''R'' will be given at the end of this file (section of external software integration).
  
 
== Descriptive Statistics ==
 
== Descriptive Statistics ==
=== Indexes of central tendency: ===
+
 
 +
One of the most often neglected steps in statistics is the adequate presentation of the group-characteristics for the data of interest. However, half of the statistical work is already done by simply providing adequate descriptions. During the following sections I will therefore emphasize the correct techniques, especially in view of newer methods and possibilities.
 +
 
 +
The statistical reasoning starts usually by examining the data. The first step involves some graphical methods (see Graphical Presentations). However, I will start this section with a description of the various group indices.
 +
 
 +
 
 +
=== Indexes of Central Tendency: ===
  
 
Most of these are already available in Calc. However, more emphasis should be put on median, therefore some of the following comments could be included in the help file, too.
 
Most of these are already available in Calc. However, more emphasis should be put on median, therefore some of the following comments could be included in the help file, too.
Line 34: Line 40:
 
** median: provide both the median AND the 50% inter-percentile range (IPR50%) or the IPR95% or the range (see Indexes of Inner Location)
 
** median: provide both the median AND the 50% inter-percentile range (IPR50%) or the IPR95% or the range (see Indexes of Inner Location)
 
* mean: in most instances, the median should be used instead (also rename AVERAGE() to MEAN() – is more intuitive); the mean is usually reported as mean ± SD (standard deviation), however this value has several disadvantages... (see Indexes of Spread)
 
* mean: in most instances, the median should be used instead (also rename AVERAGE() to MEAN() – is more intuitive); the mean is usually reported as mean ± SD (standard deviation), however this value has several disadvantages... (see Indexes of Spread)
 +
** [trimmed mean: e.g. 25% trimmed mean = ignores the lower 25% and the upper 25% of the data; it is more robust than the mean for outliers, '''however''', if there is concern about outliers, the median should be used instead for descriptive purposes. The trimmed mean has a value in '''bootstrap resampling''' because the median perfoors poorly unless the sample size is quite large.]
 
* mode: rarely used (several disadvantages, but may be useful for nominal data)
 
* mode: rarely used (several disadvantages, but may be useful for nominal data)
 
* '''modern methods:''' bootstrap resampling and jackknife procedures (see later)
 
* '''modern methods:''' bootstrap resampling and jackknife procedures (see later)
Line 48: Line 55:
 
=== Indexes of Spread ===
 
=== Indexes of Spread ===
 
* range: min, max
 
* range: min, max
* standard deviation, standard error: ...
+
* inner percentile zones: 95%, 50% (interquartile range)
 +
* H zone: zone between 2 hinges
 +
* standard deviation, standard error: ... one of the main disadvantages of sd comes from impossible values for many non-gaussian distributions
 
* Gini's mean diference = sum(|Xi-Xj|), Walsh averages = array of all (Xi-Xj)/2 for each pair of the data (including each value paired with itself) -> especially for non-Gaussian data; these methods do not depend on any index of central tendency!!!
 
* Gini's mean diference = sum(|Xi-Xj|), Walsh averages = array of all (Xi-Xj)/2 for each pair of the data (including each value paired with itself) -> especially for non-Gaussian data; these methods do not depend on any index of central tendency!!!
 
* modern methods: bootstrap resampling and jackknife procedures; the bootstrap resampling can be used to compute robust confidence intervals for the various indexes described previously;
 
* modern methods: bootstrap resampling and jackknife procedures; the bootstrap resampling can be used to compute robust confidence intervals for the various indexes described previously;
 +
 +
Another important issue is the searching for '''outliers'''. There are various tests, e.g. Dixon's Q-Test. ...
  
 
=== Graphical presentations: ===
 
=== Graphical presentations: ===

Revision as of 11:03, 24 July 2006

The purpose of this page is to outline the requirements of Calc's Statistical Data Analysis Tool (working name), which is yet to be developed by someone with willingness and skills. Feel free to contribute to this page what features you want to see in this tool, how such tool could be developed, or anything else that you need to add.

Goal

Large amount of single or multivariate data require convenient functions to analyze them. There are countless statistical methods which accomplish this. Some of them are already integrated, but many of them are still missing. In most of the cases they produce a one or more values out of one or many vectors of numbers (or categorical variables) which describe those vectors or their relationship between them. The aim is a collection of methods which are easy to use.

The statistical methods have evolved significantly over the last years. With the advent of cheap computer power, new computationally-intensive methods have become more popular and will ultimately replace the older and severely-plagued classical tests. Some of these advances include Bayesian statistics and various resampling procedures (bootstrap and jackknife procedures). I will try to indroduce some of these newer methods because of the major intellectual advantages they offer.

The implementation of any of these newer techniques is unfortunately not as easy as for classical tests. Therefore, strong consideration should be given to implement such techniques through external programs when free alternatives exist.

Statistical Models

  • Classical Statistics
  • Bayesian Statistics

Due to the complexity of Bayesian Statistics, the implementation should be done entirely through external programs (R and/or WinBUGS). [LOW PRIORITY, but the future belongs definetely to Bayesian statistics] In the rest of this document I will describe only classical statistical methods. There should be strong consideration to include parts of this file in the help file, too.



Classical statistics

This document will be structured in a similar way to the reasoning for true statistical inference. I will deal in separate sections with the following topics:

  • Descriptive statistics
  • Statistical inference I: parametric tests (Gaussian-distribution)
  • Statistical inference II: non-parametric tests

I will often give R-functions as examples. Some of the functions described are already available in Calc, some new could be implemented, but great consideration should be given to access them externally through R. A more detailed description of R will be given at the end of this file (section of external software integration).

Descriptive Statistics

One of the most often neglected steps in statistics is the adequate presentation of the group-characteristics for the data of interest. However, half of the statistical work is already done by simply providing adequate descriptions. During the following sections I will therefore emphasize the correct techniques, especially in view of newer methods and possibilities.

The statistical reasoning starts usually by examining the data. The first step involves some graphical methods (see Graphical Presentations). However, I will start this section with a description of the various group indices.


Indexes of Central Tendency:

Most of these are already available in Calc. However, more emphasis should be put on median, therefore some of the following comments could be included in the help file, too.

  • median: is favoured over the mean (Feinstein AR. Principles of Medical Statistics. Chapman & Hall 2002); the help file should state clearly that the median has several advantages over the mean as an index of central tendency and therefore should be used preferentially
    • median: provide both the median AND the 50% inter-percentile range (IPR50%) or the IPR95% or the range (see Indexes of Inner Location)
  • mean: in most instances, the median should be used instead (also rename AVERAGE() to MEAN() – is more intuitive); the mean is usually reported as mean ± SD (standard deviation), however this value has several disadvantages... (see Indexes of Spread)
    • [trimmed mean: e.g. 25% trimmed mean = ignores the lower 25% and the upper 25% of the data; it is more robust than the mean for outliers, however, if there is concern about outliers, the median should be used instead for descriptive purposes. The trimmed mean has a value in bootstrap resampling because the median perfoors poorly unless the sample size is quite large.]
  • mode: rarely used (several disadvantages, but may be useful for nominal data)
  • modern methods: bootstrap resampling and jackknife procedures (see later)

Indexes of Inner Location

  • quartiles (25%, 50% = median, 75%)
  • percentiles
  • rank
  • percentrank (? = cumulative relative frequency)
  • standardized z-score
  • standardized increment (index of contrast for 2 groups of data)


Indexes of Spread

  • range: min, max
  • inner percentile zones: 95%, 50% (interquartile range)
  • H zone: zone between 2 hinges
  • standard deviation, standard error: ... one of the main disadvantages of sd comes from impossible values for many non-gaussian distributions
  • Gini's mean diference = sum(|Xi-Xj|), Walsh averages = array of all (Xi-Xj)/2 for each pair of the data (including each value paired with itself) -> especially for non-Gaussian data; these methods do not depend on any index of central tendency!!!
  • modern methods: bootstrap resampling and jackknife procedures; the bootstrap resampling can be used to compute robust confidence intervals for the various indexes described previously;

Another important issue is the searching for outliers. There are various tests, e.g. Dixon's Q-Test. ...

Graphical presentations:

  • spectrum (distribution)
  • Frequency/ cumulative frequency
  • Stem-Leaf plot
  • Histogram
  • plotting data for multiple groups/ multivariate data:
    • ggpoint(), gghistogram(), ... in R-package ggplot (regression curves can be fitted with ggsmooth(), too); please run the examples from the ggplot help to see some real applications.
    • Boxplot: ggboxplot() in R-package ggplot and boxplot() in R-package graphics. Run the examples in the help file for full details.

The first step in performing a statistical analysis is to visually analyse the actual data. This is usually done with some simple graphical methods. Probably the widest known is the stem-leaf-plot for dimensional data and one-way frequency tables/ cumulative frequencies for non-dimensional data. More complex methods involve histograms and boxplots.

After inspecting the data, the second step involves describing the group of data as a whole using some special group values (summary indexes), collectively known as indexes of central tendency and indexes of spread. (see paragraph on Indexes of Central Location)

In the next paragraphs I will describe the basic graphical methods with reference to the corresponding R-functions:

STEM-LEAF PLOT

R function:

* stem('data', scale, ...) in package graphics, where
  'data' = numeric vector of data for histogram,
  scale = length of the plot


HISTOGRAMS

R functions:

* truehist('data', nbins, ...) in package MASS, where
   'data' = numeric vector of data for histogram,
   nbins = the suggested number of bins. 
* hist('data', breaks, ...) in package graphics, where
   'data' = numeric vector of data for histogram,
   breaks = a vector with the breakpoints between the cells or the number of cells (or ...)

Various other plotting methods:

* barplot2(), lines(), ... from package gplots

For multivariate data see the function gghistogram() in package ggplot.

Modern Methods: Bootstrap Resampling

The classical statistical methods contain some important intelectual disadvantages. Recent advances in computational power allow however the use of robust non-parametric methods. I will describe below the bootstrap resampling:

Creating robust confidence intervals:

R functions: boot() and boot.ci() from the package boot;

'boot_outobj'<- boot('data','stat_function',R=999,stype=”w”, ...), where
  'data' is the vector containing the data
  'stat_function' is the name of our statistical function to be bootstrapped, e.g.:
  for the mean we need to define a custom function mymean():
     mymean<-function(x,w)
        {sum(x*w)/sum(w)}
'boot_outobj': the function outputs an object of class boot;

boot.ci('boot_outobj', conf = 0.95, ...), where
  'boot_outobj' is an object of class boot
  conf = the desired confidence interval (e.g. 95%)

R functions: bootstrap() from package bootstrap;

'boot_outobj'<-bootstrap('data', R = 999, 'stat_function', ...), where
  'data' is the vector containing the data
  R = 1000, the number of bootstrap samples desired
  'stat_function' is the name of our statistical function to be bootstrapped:
     e.g. mean, median or a user defined function;
  'boot_outobj': the output value is a list with various components,
     most important is thetastar;

  summary('boot_outobj'$thetastar) gives a summary of the result;
  quantile('boot_outobj'$thetastar, .025) and
  quantile('boot_outobj'$thetastar, .975)
     will construct the 95% CI for our statistic (for 'stat_function').

Statistical inference: parametric tests (Gaussian)

These methods are already available in Calc. However, many should be superseded by modern methods.

Statistical inference: non-parametric tests (distribution-free)

Unlike parametric models, these tests do NOT depend on any theoretical assumptions about distributions or any other specific population characteristic.

  • Fisher exact test (for proportions)
  • Pitman-Welch permutation test (for dimensional data)
  • Wilcoxon signed-rank test
  • Mann-Whitney-Wilcoxon U test


Fisher Exact Test

The Fisher exact test is the gold standard for the comparison of two proportions. The Chi-square test was favoured in the past because it was easier to calculate, however it is an inaccurate method, especially at small population sizes or departure from a Gaussian distribution. As computing power is no longer a limitation, the Fisher test should be used instead.

R-syntax:

* fisher.test(matrix(c('nr1', 'nr2', 'nr3', 'nr4'),2)),
* where 'nr1', ..., 'nr4' are the four values from the contingency table
* Return value: a list with class httest containing the following components:
*   p-value
*   confidence interval for the odds ratio 
*   odds ratio
*   (further text descriptions)

Pitman-Welch Permutation Test

This test is a permutation test similar to the Fisher exact test, but for dimensional data. A good description about permutation tests can be found at http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf .

Rank Tests

Can be applied without any problem to ordinal data and to dimensional data that have non-Gaussian distributions.



Overview single and multivariate data analysis

just some thoughts ....

Graphics

  • histogram
  • density
  • bubble
  • mosaicplot for crosstables
  • Box and Whiskers Plot
  • Bland and Altman Plot

Models

linear

non-linear

multivariate response variables

  • multivariate tests

Test Characteristics

Rationale

In real life, a test result is not simply positive or negative, but rather produces a value, which is interpreted depending on a given cutoff. Depending on this cutoff, the various test characteristics may change.

It would be nice to automate the computations for these parameters:

  • Sensitivity
  • Specificity
  • Receiver-Operating Characteristics Curves (ROC)
  • Positive Predictive Value (PPV)
  • Negative Predictive Value (NPV)

All these features are already available under R (see later in section Third Party Library Integration), packacke ROCR (visualizing classifier performance in R, ROCR-site)

Receiver-Operating Characteristics Curves (ROC-Curves)

The comparison between two tests is generally done calculating the area under the ROC curve.

Patterns

Cluster analysis

Desired Features

What features do users need as statistical data analysis tools?

  • analysis of one-dimensional data
  • Deviation from the median
  • Variation Ratio
  • Range
  • Pearson r²
  • analysis of multidimensional data
  • graphical representation

Task Breakdown

User Input and Output

How this application needs to be structured...

Third Party Library Integration

R

R can be used as a backend statistical analysis engine. Since we can't ship R with OO.o due to licensing incompatibility (it's released under GPL), the location of its executable or shared library needs to be specified by the user so that OO.o can locate it at run-time.


AVANTAGES of R:

  • offers complex statistical functions
  • over 100 additional packages available
    • e.g. Fisher exact-test, bootstrap procedure and other non-parametric tests
  • linear regression models (including glm, generalized linear regression), as well as non-linear models


DISADVANTAGE

  • R is a statistical environment (more like a programming language) and for beginners it is sometimes difficult to use

Calc should therefore ease this use by:

  • providing easy access within Calc
  • to basic R functions


Example of Fisher-test:

  • syntax in R: fisher.test(matrix(c('nr1', 'nr2', 'nr3', 'nr4'),2)), where 'nr1-4' are the corresponding values to be tested.
  • the user should be able to select the 4 cells in Calc and have a Menu-Item 'Fisher Test'; Calc should automatically pipeline the data (including correct syntax) into R
  • a special note on this test: the Fisher exact test should be used in preference to the inaccurate Chi-Square test for contingency tables (I actually always use the Fisher test)


Example of Regression:

  • linear regression:
    • 'regression-object'<- lm(y ~ x1 + x2 + ...), where y, x1, x2, ... are vectors containing our data
    • 'regression-object' is an object taking the results
    • details can be obtained using the command: summary('our-object'), in this case summary('regression-object')
    • lm() has of course numerous options not covered here
  • generalized linear regression:
    • 'object'<- glm(formula, other options...)
  • non-linear models: e.g. nls(y ~ 1/(exp(a+k*x))), a sigmoid curve
Personal tools