Calc/To-Dos/Statistical Data Analysis Tool

From Apache OpenOffice Wiki
< Calc‎ | To-Dos
Revision as of 12:13, 28 June 2006 by Discoleo (Talk | contribs)

Jump to: navigation, search

The purpose of this page is to outline the requirements of Calc's Statistical Data Analysis Tool (working name), which is yet to be developed by someone with willingness and skills. Feel free to contribute to this page what features you want to see in this tool, how such tool could be developed, or anything else that you need to add.

Goal

Large amount of single or multivariate data require convenient functions to analyze them. There are countless statistical methods which accomplish this. Some of them are already integrated, but many of them are still missing. In most of the cases they produce a one or more values out of one or many vectors of numbers (or categorical variables) which describe those vectors or their relationship between them. The aim is a collection of methods which are easy to use.

Overview single and multivariate data analysis

just some thoughts ....

Graphics

  • histogram
  • density
  • bubble
  • Box and Whiskers Plot
  • Bland and Altman Plot

Models

linear

non-linear

multivariate response variables

  • multivariate tests

Test Characteristics

  • Sensitivity
  • Specificity
  • Receiver-Operating Curves (ROC)
  • Positive Predictive Value (PPV)
  • Negative Predictive Value (NPV)

All this features are already available under R (see later), packacke ROCR (visualizing classifier performance in R, ROCR-site)

Patterns

Cluster analysis

Desired Features

What features do users need as statistical data analysis tools?

  • analysis of one-dimensional data
  • Deviation from the median
  • Variation Ratio
  • Range
  • Pearson r²
  • analysis of multidimensional data
  • graphical representation

Task Breakdown

User Input and Output

How this application needs to be structured...

Third Party Library Integration

R

R can be used as a backend statistical analysis engine. Since we can't ship R with OO.o due to licensing incompatibility (it's released under GPL), the location of its executable or shared library needs to be specified by the user so that OO.o can locate it at run-time.


AVANTAGES of R:

  • offers complex statistical functions
  • over 100 additional packages available
    • e.g. Fisher exact-test, bootstrap procedure and other non-parametric tests
  • linear regression models (including glm, generalized linear regression), as well as non-linear models


DISADVANTAGE

  • R is a statistical environment (more like a programming language) and for beginners it is sometimes difficult to use

Calc should therefore ease this use by:

  • providing easy access within Calc
  • to basic R functions


Example of Fisher-test:

  • syntax in R: fisher.test(matrix(c('nr1', 'nr2', 'nr3', 'nr4'),2)), where 'nr1-4' are the corresponding values to be tested.
  • the user should be able to select the 4 cells in Calc and have a Menu-Item 'Fisher Test'; Calc should automatically pipeline the data (including correct syntax) into R
  • a special note on this test: the Fisher exact test should be used in preference to the inaccurate Chi-Square test for contingency tables (I actually always use the Fisher test)


Example of Regression:

  • linear regression:
    • 'regression-object'<- lm(y ~ x1 + x2 + ...), where y, x1, x2, ... are vectors containing our data
    • 'regression-object' is an object taking the results
    • details can be obtained using the command: summary('our-object'), in this case summary('regression-object')
    • lm() has of course numerous options not covered here
  • generalized linear regression:
    • 'object'<- glm(formula, other options...)
  • non-linear models: e.g. nls(y ~ 1/(exp(a+k*x))), a sigmoid curve
Personal tools