Talk:Calc/To-Dos/Statistical Data Analysis Tool

From Apache OpenOffice Wiki
Revision as of 01:14, 12 August 2006 by Discoleo (Talk | contribs)

Jump to: navigation, search

some discussions about the content:

How to structure the content of the page

first, it should be discussed how it is structured. many of these tools will have more than a number as output. they will require some input, based on dialogs. .. and much more

Should it consist of multiple separate dialogs, or one dialog with tabbed pages? I personally prefer the multiple separate dialogs for different types of analyses. --Kohei 02:44, 24 June 2006 (CEST)

Which Methods?

  • Tests
    • t-test
    • ANOVA
  • Certain Distributions?
    • Poisson distribution
    • Gaussian distribution

Models

  • linear

Comment from a User

Hi. I'm a biologist and a teacher. I am somewhat concerned by the plan shown in the Statistical Data Analysis Tool article. It looks like heavy emphasis is to be put on "modern methods." I have two concerns with this approach:

1. While it would be nice to have these implemented, I would hate to see them hold up production. The fact is that most users who turn to a spreadsheet for stats don't want to use "the latest."

Currently, the spreadsheet of choice for Introductory Statistics students is (naturally) MS Excel. Why? Because it handles all the basic tests, measures and procedures. Currently, OOO does not. As a result, Professors are making students go out and buy MS Office, when they should be showing them how cool OOO is!

In my humble opinion, the priority should be catching up with the functionality of Excel. We can work on advanced Bayesian Stats later. We also should focus equally on parametric and non-parametric methods, because they are both important. Each has its own utility.

I think so, too. There is really a lot missing (histograms, ...). Can you or somebody else build a list of must have methods/functions/procedures to catch up with excel? I think, at first there should be a consens what has to be done first. --phatsphere 13:01, 2 August 2006 (CEST)

2. Similarly, there is a suggestion that the help function should steer people away from the mean in favor of the median. I'm not sure that this is appropriate. Each has its use. We should try to indicate when each is useful, and when each is prone to error.

I must admit that I am not a programmer, but I do support OOO by using and recommending it. I cannot do this if the app does not perform the functions that are needed. I hope that this edit will be taken as constructive criticism, and as a plea for more good work. Thank you! --ericpaulkatz

Reply

referring to point 2

The mean is a bad representation for the group of data in many instances (and actually in most instances where it is used). In most of thosse instances, the median is a better estimate. Many proeminent statisticians discussed this issue, including the late Prof. Feinstein (Feinstein AR. Principles of Medical Statistics. Chapman & Hall 2002).

When you supply the mean, you MUST also tell the standard error (SE) OR the standard deviation (SD), and optimally also the sample size. Without these variables, the mean is meaningless.

Consider the following data:

1, 10, 100

37, 37, 37

-17.5, 37, 91.5

10, 10, 10, 10, 185


ALL have the same mean, but very different SD and SE. (with the excception of group 2, the mean is senseless for all other groups, too) Also, groups 1 and 3 have both the same mean and the same SD and SE, but the 2 data sets are quite different. You see, the mean is quite often insufficient and inaccurate.

I believe that any good software that implemets statistical techinques should give some basic statistical knowledge and point to potential problems. Therefore, the help file should state:

  • what the user should use for a descriptive statistic (see Descriptive Statistics): IF user decides for mean, he has to provide at least SD or SE (optimally also the size of the group), NOT just the mean
  • median is often superior to mean and is therefore preferred most of the time

When should the mean NOT be used at all:

  • when mean / SD > 0.1 - 0.3 (depending on authorities): I frecuently encounter papers stating something like: 1 (+-) 2, this is completely senseless; when SD > mean, you can forget the mean;
  • mean + 2*SD (or - 2*SD), creates an absurd value: this is the 95% CI for our group, created from the mean, and IF it includes some very impossible values (e.g. negative ages), the mean won't describe acurately our group; forget it again for descriptive statistics;
  • Descriptive Statistics should offer in a simple fashion the best description for our group of data; that is its soly purpose!


referring to point 1

  • I consider Bayesian statistics low priority quite now, but it definetely belongs to the future, and loosing the catch would be a terrible mistake
  • the classical non-parametric tests and the newer techniques like permutation tests (also non-parametric tests) are indeed high priority
  • for any serious statistics, you need this last methods (including bootstrapping and permutation tests); their implementation is however difficult
  • it is always easier to use a dedicated program (that will always be superior to any local attempt of implementation), than devise and implement a new algorithm;
  • a mechainsm to communicate with R, once in place, will practically open all doors for any statistical test; it would be easy to extend the functionality;

Therefore, this last issue should have the highest priority. For someone performing once in his lifetime a statistic, using the t-test (although it would be most likely inadequate) would be just right. t-Test is already implemented. However, for most users who do real statistics, a t-test is rarely performed nowadays. For users inbetween, this would be the right moment to learn about newer methods.

Second Comment from same user

Concerning the Use of R.

R is a very powerful tool. It would be delightful to create a OOO front-end for R. However, as noted in the article, we cannot include R with OOO.

I'm not sure that I understand the intentions of the section on R, but it seems to me that it implies making the statistical data analysis tool dependant on R. I would like to suggest that this might be a flawed approach.

If R cannot be supplied by OOO, then it will have to be installed separately, and OOO will have to know where it is in order to use it. To accomplish this, OOO will have to be redesigned to find R on any of three or four operating systems, based on at least 5 different package managers. But, since the power of R is best tapped by using R itself, I'm not sure what is accomplished by using it as the backend here. Yes, it would be easier to design the tool, but either there would be huge over-head in the installation, or low-power users would not be able to access the functions. --ericpaulkatz

I'm not a developer, but R is in my view a good reference implementation of those methods, this means, for testing, source of ideas and so on. I don't think that a heavy dependency on R has any use, too. --phatsphere 12:59, 2 August 2006 (CEST)

Reply

The question is: Do low-power users need anything beyond the t-test and Fisher test?

I believe NOT. (Fisher-test instead of the chi-square test; it could be implemented in Calc).

Someone who really needs an analysis of variance (ANOVA), would probably look for the newer methods as well. And I wouldn't call him a low-power user.

I hope also that serious journals embrace in the coming years the permutation tests (which offer great intelectual advantages over the t-test). IF OOo Calc is pioneering here, its only in its advantage, and, hopefully, some other folks begin to learn and understand this new tests. (They are not really new, but until recently it was not quite possible to calculate them.)

Regarding R

> OOO will have to know where it is High power users already use R, and providing the path to R wouldn't be a problem. As I said, low-power users already have (almost) all what they need. If your inbetween, I suggest learning timely the better statistical methods. I really hope that older methods get in the near future abandoned, especially by high ranked journals.

One last remark about newer tests:

  • you can use the classical tests only for statistics where the distribution is known (this is mainly the mean or differences of mean); (this issue is NOT about the accuracy of newer methods!!!)
  • whenever you wish to calculate a statistic on a different parameter, you CAN'T use these classical tests; you will need to get for the newer methods
  • in recent times, it become increasinlgy often necessary to compute statistics for a diverse range of variables (because the mean is such a poor candidate OR because you need a different parameter)

A problem is the teaching, too. Many times, professors put to much emphasis on classical tests. If they'd discuss just some real life problems, you will notice that these tests fail much to often and way to high. With the current approach, Calc would sensitize some of these users to all those problems and solutions. I hope that people get this way a better understanding of statistics and that their own work will be of higher quality.

Many thanks,

Leonard Mada

Remark

  • Tests
    • t-test: ttest() is a function that is available within calc
    • ANOVA
  • Certain Distributions?
    • Poisson distribution: poisson() is a function that is available within calc
    • Gaussian distribution: gauss() is a function that is available within calc

Maybe someone wants to implement a function anova() as an add-in or as a macro ?

Personal tools