Talk:Calc/To-Dos/Statistical Data Analysis Tool
some discussions about the content:
Contents
How to structure the content of the page
- first, it should be discussed how it is structured:
- many of these tools will have more than a number as output
- they will require some input, based on dialogs. ... and much more
Should it consist of multiple separate dialogs, or one dialog with tabbed pages? I personally prefer the multiple separate dialogs for different types of analyses. --Kohei 02:44, 24 June 2006 (CEST)
Which Methods?
- Tests
- One/ Two-Groups of Data: t-test is already implemented (TTEST());
- ANOVA: see http://courses.ncssm.edu/math/Stat_Inst/PDFS/NEWANOVA.pdf for a good tutorial on one-way ANOVA;
- Non-Parametric: see Main Statistical Page
- Certain Distributions?
- Gaussian distribution; Poisson distribution
- already available (TDIST(), NORMDIST(), POISSON(), and GAUSS()), however
- function to generate random numbers upon Gaussian distribution is not available
- Gaussian distribution; Poisson distribution
Models
- linear multivariate
- non-linear multivariate
- Cox-Proportional Hazards
Comment from a User
Hi. I'm a biologist and a teacher. I am somewhat concerned by the plan shown in the Statistical Data Analysis Tool article. It looks like heavy emphasis is to be put on "modern methods." I have two concerns with this approach:
1. While it would be nice to have these implemented, I would hate to see them hold up production. The fact is that most users who turn to a spreadsheet for stats don't want to use "the latest."
Currently, the spreadsheet of choice for Introductory Statistics students is (naturally) MS Excel. Why? Because it handles all the basic tests, measures and procedures. Currently, OOO does not. As a result, Professors are making students go out and buy MS Office, when they should be showing them how cool OOO is!
In my humble opinion, the priority should be catching up with the functionality of Excel. We can work on advanced Bayesian Stats later. We also should focus equally on parametric and non-parametric methods, because they are both important. Each has its own utility.
- I think so, too. There is really a lot missing (histograms, ...). Can you or somebody else build a list of must have methods/functions/procedures to catch up with excel? I think, at first there should be a consens what has to be done first. --phatsphere 13:01, 2 August 2006 (CEST)
2. Similarly, there is a suggestion that the help function should steer people away from the mean in favor of the median. I'm not sure that this is appropriate. Each has its use. We should try to indicate when each is useful, and when each is prone to error.
I must admit that I am not a programmer, but I do support OOO by using and recommending it. I cannot do this if the app does not perform the functions that are needed. I hope that this edit will be taken as constructive criticism, and as a plea for more good work. Thank you! --ericpaulkatz
Reply
- referring to point 2
- The mean is a bad representation for the group of data in many instances (even in most instances where it is used). The median is a better estimate. Many proeminent statisticians discussed this issue, including the late Prof. Feinstein (Feinstein AR. Principles of Medical Statistics. Chapman & Hall 2002).
- When you supply the mean, you MUST also tell the standard error (SE) OR the standard deviation (SD), and optimally also the sample size. Without these variables, the mean is meaningless. (There is also an ongoing discussion on variance vs mean deviation, see Mean Deviation on main statistical wiki page.)
- Consider the following data:
- I: 1, 10, 100
- II: 37, 37, 37
- III: -17.5, 37, 91.5
- IV: 10, 10, 10, 10, 185
- ALL have the same mean, but very different SD and SE. (with the excception of group 2, the mean is senseless for all other groups, too). Also, groups 1 and 3 have both the same mean and the same SD and SE, but the 2 data sets are quite different. You see, the mean is quite often insufficient and inaccurate.
- The median would be adequate for groups 2 and 4, and the median (with the range) would differentiate between groups 1 and 3. [Although the group sizes are small and all indexes are very unstable, the median (range) combination is far superior to the mean (SD).]
- I believe that any good software that implemets statistical techinques should provide some basic statistical explanations and try to enhance the knowledge of its users, pointing the potential problems. Therefore, the help file should state:
- what the user should use for a descriptive statistic (see Descriptive Statistics): IF user decides for mean, he has to provide at least SD or SE (optimally also the size of the group), NOT just the mean
- median is often superior to mean and is therefore preferred most of the time
- When should the mean NOT be used at all:
- when the coefficient of variation (CV) = SD / mean > ~0.5 (depending on authorities): I frecuently encounter papers stating something like: 1 ± 2 (where mean < SD!), this is completely senseless; when SD > mean, you can forget the mean;
- mean + 2*SD (or - 2*SD), creates an absurd value: this is the 95% CI for our group, created from the mean, and IF it includes some very impossible values (e.g. negative ages), the mean won't describe acurately our group; forget it again for descriptive statistics;
- Descriptive Statistics should offer in a simple fashion the best description for our group of data; that is its soly purpose!
- referring to point 1
- I consider Bayesian statistics low priority quite now, but it definetely belongs to the future, and loosing the catch would be a terrible mistake
- the classical non-parametric tests and the newer techniques like permutation tests (also non-parametric tests) are indeed high priority
- for any serious statistics, you need this last methods (including bootstrapping and permutation tests); their implementation is however difficult
- it is always easier to use a dedicated program (that will always be superior to any local attempt of implementation), than devise and implement a new algorithm;
- a mechainsm to communicate with R, once in place, will practically open all doors for any statistical test; it would be easy to extend the functionality;
- Therefore, this last issue should have the highest priority. For someone performing once in his lifetime a statistic, using the t-test (although it would be most likely inadequate) would be just right. t-Test is already implemented. However, for most users who do real statistics, a t-test is rarely performed nowadays. For users inbetween, this would be the right moment to learn about newer methods.
Second Comment from same user
Concerning the Use of R.
R is a very powerful tool. It would be delightful to create a OOO front-end for R. However, as noted in the article, we cannot include R with OOO.
I'm not sure that I understand the intentions of the section on R, but it seems to me that it implies making the statistical data analysis tool dependant on R. I would like to suggest that this might be a flawed approach.
If R cannot be supplied by OOO, then it will have to be installed separately, and OOO will have to know where it is in order to use it. To accomplish this, OOO will have to be redesigned to find R on any of three or four operating systems, based on at least 5 different package managers. But, since the power of R is best tapped by using R itself, I'm not sure what is accomplished by using it as the backend here. Yes, it would be easier to design the tool, but either there would be huge over-head in the installation, or low-power users would not be able to access the functions. --ericpaulkatz
- I'm not a developer, but R is in my view a good reference implementation of those methods, this means, for testing, source of ideas and so on. I don't think that a heavy dependency on R has any use, too. --phatsphere 12:59, 2 August 2006 (CEST)
Reply
The question is: Do low-power users need anything beyond the t-test and Fisher test?
I believe NOT. (Fisher-test instead of the chi-square test; it could be implemented in Calc).
Someone who really needs an analysis of variance (ANOVA), would probably look for the newer methods as well. And I wouldn't call him a low-power user.
I hope also that serious journals embrace in the coming years the permutation tests (which offer great intelectual advantages over the t-test). IF OOo Calc is pioneering here, it is only to its advantage, and, hopefully, some other folks begin to learn and understand this new tests. (They are not really new, but until recently it was not quite possible to calculate them.)
Regarding R
>> OOO will have to know where it [R] is
High power users already use R, and providing the path to R wouldn't be a problem. As I said, low-power users already have (almost) all what they need. If your inbetween, I suggest learning timely these more accurate statistical methods. I really hope that older methods get in the near future abandoned, especially by high ranked journals.
Continuos Education
One last remark about the newer tests is pertinent:
- you can use the classical tests only for statistics where the distribution is known (this is mainly the mean or differences and proportions of 2 means); (this issue is NOT about the accuracy of newer methods, it is about the non-applicability of older ones in this circumstances!!!)
- whenever you wish to calculate a statistic on a different parameter, you CAN'T use these classical tests; you will need to get for the newer methods
- in recent times, it has become increasinlgy often necessary to compute statistics for a diverse range of variables (either because the mean is such a poor candidate OR because you simply need a different parameter)
The current teaching of statistics is another problem, especially the low level introductory courses. Many times, professors put to much emphasis on classical tests. If they'd discuss just some real life problems, you will notice that these tests fail much to often and way to high. With the current approach, Calc would sensitise some of these users to those problems and offer alternative solutions. I hope that people get this way a better understanding of statistics and that their own work will be of higher quality.
Many thanks,
"Disadvantages of R" >>"R is a statistical environment (more like a programming language) and for beginners it is sometimes difficult to use."
Ooh wee is that an understatement! I would rephrase it "and for beginners it is extraordinarily, excruciatingly, horrifyingly difficult to use." I am well aware of R's many, many advantages. However, despite my PhD, my 12 years of teaching statistics, and my ability to do a (very) little bit of programming, I still can't reliably even OPEN my data in R. Ouch. Soooo, I would LOVE to see some sort of user-friendly (or friendlier) implementation of R in OOo. Wouldn't that be nice....
Remark
This is a VERY important functionality that should be added to OO as soon as possible, to mirror Excel's 'Analysis ToolPak'.--Piotrus 21:07, 24 February 2007 (CET)