Calc/To-Dos/Statistical Data Analysis Tool

From Apache OpenOffice Wiki
< Calc‎ | To-Dos
Revision as of 13:18, 11 July 2006 by Discoleo (Talk | contribs)

Jump to: navigation, search

The purpose of this page is to outline the requirements of Calc's Statistical Data Analysis Tool (working name), which is yet to be developed by someone with willingness and skills. Feel free to contribute to this page what features you want to see in this tool, how such tool could be developed, or anything else that you need to add.

Goal

Large amount of single or multivariate data require convenient functions to analyze them. There are countless statistical methods which accomplish this. Some of them are already integrated, but many of them are still missing. In most of the cases they produce a one or more values out of one or many vectors of numbers (or categorical variables) which describe those vectors or their relationship between them. The aim is a collection of methods which are easy to use.

Statistical Models

  • Classical Statistics
  • Bayesian Statistics

Due to the complexity of Bayesian Statistics, the implementation should be done entirely through external programs (R and/or WinBUGS). [LOW PRIORITY, but the future belongs definetely to Bayesian statistics]


Classical statistics

  • Descriptive statistics
  • Statisticcal inference I: parametric tests (Gaussian-distribution)
  • Statisticcal inference II: non-parametric tests

Descriptive Statistics

Indexes of central tendency:

Most of these are already available in Calc. However, more emphasis should be put on median, therefore some of the following comments could be included in the help file,too.

  • median: is favoured over the mean (Feinstein AR. Principles of Medical Statistics. Chapman & Hall 2002); the help file should state clearly that the median has several advantages over the mean as an index of central tendency and therefore should be used preferentially
    • median: provide both the median AND the 50% inter-percentile range (IPR50%) or the IPR95% or the range
    • range: min, max
    • quartiles (25%, 50% = median, 75%)
    • percentiles
  • rank
    • percentrank (? = cumulative frequency)
  • mean: in most instances, the median should be used instead (also rename AVERAGE() to MEAN() – is more intuituive)
  • mode: rarely used (several disadvantages)

Graphical presentations:

  • spectrum (distribution)
  • Frequency/ cumulative frequency
  • Stem-Leaf plot
  • Histogram
  • plotting data for multiple groups/ multivariate data:
    • ggpoint(), gghistogram(), ... in R-package ggplot (regression curves can be fitted with ggsmooth(), too); please run the examples from the ggplot help to see some real applications.
    • Boxplot: ggboxplot() in R-package ggplot and boxplot() in R-package graphics. Run the examples in the help file for full details.

STEM-LEAF PLOT


R function: stem('data', scale, ...) in package graphics, where 'data' = numeric vector of data for histogram, scale = length of the plot


Overview single and multivariate data analysis

just some thoughts ....

Graphics

  • histogram
  • density
  • bubble
  • mosaicplot for crosstables
  • Box and Whiskers Plot
  • Bland and Altman Plot

Models

linear

non-linear

multivariate response variables

  • multivariate tests

Test Characteristics

Rationale

In real life, a test result is not simply positive or negative, but rather produces a value, which is interpreted depending on a given cutoff. Depending on this cutoff, the various test characteristics may change.

It would be nice to automate the computations for these parameters:

  • Sensitivity
  • Specificity
  • Receiver-Operating Characteristics Curves (ROC)
  • Positive Predictive Value (PPV)
  • Negative Predictive Value (NPV)

All these features are already available under R (see later in section Third Party Library Integration), packacke ROCR (visualizing classifier performance in R, ROCR-site)

Receiver-Operating Characteristics Curves (ROC-Curves)

The comparison between two tests is generally done calculating the area under the ROC curve.

Patterns

Cluster analysis

Desired Features

What features do users need as statistical data analysis tools?

  • analysis of one-dimensional data
  • Deviation from the median
  • Variation Ratio
  • Range
  • Pearson r²
  • analysis of multidimensional data
  • graphical representation

Task Breakdown

User Input and Output

How this application needs to be structured...

Third Party Library Integration

R

R can be used as a backend statistical analysis engine. Since we can't ship R with OO.o due to licensing incompatibility (it's released under GPL), the location of its executable or shared library needs to be specified by the user so that OO.o can locate it at run-time.


AVANTAGES of R:

  • offers complex statistical functions
  • over 100 additional packages available
    • e.g. Fisher exact-test, bootstrap procedure and other non-parametric tests
  • linear regression models (including glm, generalized linear regression), as well as non-linear models


DISADVANTAGE

  • R is a statistical environment (more like a programming language) and for beginners it is sometimes difficult to use

Calc should therefore ease this use by:

  • providing easy access within Calc
  • to basic R functions


Example of Fisher-test:

  • syntax in R: fisher.test(matrix(c('nr1', 'nr2', 'nr3', 'nr4'),2)), where 'nr1-4' are the corresponding values to be tested.
  • the user should be able to select the 4 cells in Calc and have a Menu-Item 'Fisher Test'; Calc should automatically pipeline the data (including correct syntax) into R
  • a special note on this test: the Fisher exact test should be used in preference to the inaccurate Chi-Square test for contingency tables (I actually always use the Fisher test)


Example of Regression:

  • linear regression:
    • 'regression-object'<- lm(y ~ x1 + x2 + ...), where y, x1, x2, ... are vectors containing our data
    • 'regression-object' is an object taking the results
    • details can be obtained using the command: summary('our-object'), in this case summary('regression-object')
    • lm() has of course numerous options not covered here
  • generalized linear regression:
    • 'object'<- glm(formula, other options...)
  • non-linear models: e.g. nls(y ~ 1/(exp(a+k*x))), a sigmoid curve
Personal tools