R and Calc Documentation

From Apache OpenOffice Wiki
Revision as of 13:47, 4 April 2008 by ErAck (talk | contribs) (Category Add-In singular.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Please note that this documentation is for Version 0.1.1. See the main R and Calc page for information on installation and setup.

Cell Functions

Note that the function names may change in future versions... I'm not really concerned about properly naming everything for users yet, but making things simpler for development.

  • RNDBLMAT( <command> , <matrix> , <matrix>): passes a command and two matrices. For example, RNDBLMAT("cov";A1:A100;B1:B100).
  • RNMAT( <command> , <matrix> ): passes a command and matrix as a parameter. For example, RNNUM("var";A1:A100).
  • RNNUM( <command> , <number> ): passes a command and a numeric parameter. For example, RNNUM("rnorm"; 1).
  • RNUMPROP( <command> , <property> ): passes a command to R and prints a specific property from the output. This is useful if the values returned from R aren't numeric but contain a variety of information. For example, RNUMPROP("cor.test(c(1,2,3,4), c(1,2,2,3))"; "p.value").
  • RSIMPLENUM( <command> ): passes a command to R, prints a number if one is returned.
  • RTESTCONNECTION(): tests R connection (for RServe).

R Add-On Menu

  • Correlation: an advanced correlation tool that allows you to see p values and 95% confidence intervals. Also allows you to change the type of correlation.
  • Linear Regression: a simple tool that prints the slope and intercept of a line of best fit. It's a good example to start from if you plan on making your own scripts (the code is in <base dir>/RAddon/rcall_linreg.txt).
  • General Linear Model: fits a general linear model (GLM) to your data and allows you to see the p value for the model.

ANOVA Submenu

  • One-Way ANOVA
  • Two-Way ANOVA

Advanced Submenu

  • Load External Script: allows you to load an external GUI script for custom R commands.
  • Dump R Output: creates a new spreadsheet with the object structure returned by a specific R script.

Advanced

Coding Window

Intro

The coding window allows performing more complex operations in R. The user is able to:

  • read data from the spreadsheet (cells and cell-ranges) and import the data into R
  • issue R-commands
  • export the results back into Calc

Command Description

There are generally 4 types of commands:

  • data input commands
  • data output commands
  • specific R-commands
  • and specific escape commands

Data Input To read a data-range in R from the spreadsheet, one uses:

new_variable_name <- {#range_start_cell:range_end_cell}
where the data is read from the cell range range_start_cell:range_end_cell
and stored in the variable new_variable_name

Implementation Detail: the resulting R-variable is of type matrix

my.variable.class <- class(new_variable_name)
{$OUT#C1} <- {$BASE}

OUTPUT: matrix

To store the variable as a list or as a data.frame, use:

my.list <- list( {#cell_begin:cell_end} )
my.data.frame <- data.frame( {#cell_begin:cell_end} )

Data Output Similarly, to write the output back to a Calc cell, one uses:

{$OUT#results_cell}<-name_of_variable

Alternatively, when the returned object has just one value:

{$OUT#results_cell}<-{$BASE}
where name_of_variable is one of the variable names from the returned object (aka one of the returned strings from names(variable), respectively the name_of_variable in the R object: variable$name_of_variable)
{$BASE} is the returned value, when there is just one value

'NOTE for v0.1.6: Any space before the _space_{$OUT:...}' will prevent the code from returning any output. Hope this gets fixed.

Examples

Some examples may explain better the concept. Lets consider that the user data is in the cells A1:A20 and B1:B20. The scripts look then as follows:

Wilcoxon test:
wilcox.test({#A1:A20},{#B1:B20})
{$OUT#A22} <- p.value
Correlation
cor.test({#A1:A20},{#B1:B20})
{$OUT#B22} <- p.value

Lets say cells C1, C2, D1 and D2 containing count data forming a contingency table. The Fisher-exact test for count data can be computed as follows: (ATTENTION: computing the Fisher-exact test with large data values may take some time!)

Fisher-exact test:
fisher.test({#C1:D2})
{$OUT#C22} <- p.value
Chi-Square Test
chisq.test({#C1:D2})
{$OUT#D22} <- p.value

Complex Examples

The following script will perform a Multivariate Linear Regression (least-squares-fitting). Suppose the user data is in cells A1:C100, while D1:D100 contains the dependent variable. We will perform a multivariate regression of the form D = I + x1* A + x2 * B + x3 * C, where:

I is the Intercept
x1-x3 are the corresponding regression coefficients


Multivariate Linear Regression
considering the variable names are stored in A1:D1 (in the column headers)

my.data <- data.frame({#A2:D100})
l<-length(my.data[1,])
// will be used in a future version instead of hardcoding INDEXes

linear.model<-lm(my.data[,4]~my.data[,1]+my.data[,2]+my.data[,3])
sx<-summary(linear.model)[[4]]

// ANNOTATION
my.labels<-c("pm"="Parameter","Intercept"="Intercept",c({#A1:C1}))
// STRING IMPORT IS BROKEN
//{$OUT#F1:5}<-{$BASE} // DOES NOT FUNCTION
{$OUT#F1}<-pm
{$OUT#F2}<-Intercept

s<-list("b"="beta-coeff", "bIntercept"=sx[[1]], "b1"=sx[[2]], "b2"=sx[[3]], "b3"=sx[[4]] )
 s<-c(s,"p"="p-value", "pIntercept"=sx[[13]], "p1"=sx[[14]], "p2"=sx[[15]], "p3"=sx[[16]])
// beta-COEFFICIENTS
{$OUT#G1}<-b
{$OUT#G2}<-bIntercept
{$OUT#G3}<-b1
{$OUT#G4}<-b2
{$OUT#G5}<-b3

// p-VALUES
{$OUT#H1}<-p
{$OUT#H2}<-pIntercept
{$OUT#H3}<-p1
{$OUT#H4}<-p2
{$OUT#H5}<-p3

Finally, to test the validity of the regression model, one has to run an ANOVA-test (Analysis of Variance) on the model. We will do this using the R aov-function:

Testing the model

my.aov<-summary(aov(linear.model))[[1]]
mx<-my.aov[[5]]
s<-list("m"="ANOVA","m1"=mx[[1]],"m2"=mx[[2]],"m3"=mx[[3]])
{$OUT#I1}<-m
{$OUT#I3}<-m1
{$OUT#I4}<-m2
{$OUT#I5}<-m3

That's it.

Tips & Tricks

Running External Libraries

At this time, it's not possible to load libraries ahead of time using Rserve. However, you can load a library through a function cell. The example below loads a library called sna and calls a function, passing a matrix as a parameter. Note that the library needs to be installed prior to using it.

RNMAT("library(sna);gden"; A1:D4)

summary() Hack for Complex R Objects

There are times when the R object structure you are dealing with is very complex, to the point that it's difficult to find specific values. One such example is the nonlinear least squares function (used for curve fitting, for example). You use this function by calling nls() and one of the most useful things it gives is values for coefficients in an equation. The only problem is the object returned by nls() is organized in a way that makes it difficult to find values for these coefficients.

One workaround is to call nls() with summary()... This will allow you to see some of the variables (such as coefficient values) and makes it much easier to extract information using the R/Calc interface.

See the example implementation with one variable: http://www.11-55.org/rcalc/2007-06-01/rcall_nls.txt

Random Numbers

If you try to generate random numbers with a cell function like rsimplenum("rnorm(1)") and try to expand it to multiple cells by dragging, you'll notice the numbers are all the same (at least in Linux!). This occurs because R is using the same seed number for every cell, causing all the numbers to be the same. There are a number of ways to fix this.

If you just need a list of random numbers, the easiest way to get one is to go to R Add-On > Advanced > Dump R Output and type your function here. For example, if you want 100 normally distributed numbers, type rnorm(100) and click Run. This will create a new spreadsheet with your numbers.

A second, harder way is setting the R seed with each cell call. Pick a cell and write rsimplenum("set.seed(543);rnorm(1)") you will first set the random number generator's seed to 543, and then get a random number. In this way, however, the random number will not change if you refresh the function because the seed does not change. Unfortunately, the only way to change the seed in this method is manually for every cell.

Creating Your Own Dialogs and Scripts

Using a basic text file, it is possible to create your own scripts and functions that will then call R and provide output. A basic correlation example is provided, and an outline of the tile structure is given below.

First and foremost, there are four sections to a file:

  • __ETC: ignored -- a set of comments.
  • __GUI: for user interface objects.
  • __CALL: allows you to define the R function call based on GUI inputs.
  • __OUTPUT: allows you to set where the output will go.

Using __GUI:

Currently, the following GUI objects can be defined in the file:

  • TextField: for text-based arguments.
  • NumberField: for numeric arguments -- either actual numbers or a reference to a sheet cell.
  • ArrayField: refers to data in the spreadsheet (current active sheet).
  • Label: a basic label.
  • RunButton: runs the code and prints output.
  • CancelButton: closes the dialog.
  • Dialog: allows you to define the size and location of the dialog.
  • OutputField: allows for the user to input information for outputs.
  • ComboBox: creates a combo box. Rather than passing label_text, pass a list of options with ";" as the delimiter. For example, "a;b;c;d" will have the combo box show four options: a, b, c, and d.

The format for the above is as follows, with italic fields representing variables and parameters: Object_Name<-Label( x position , y position , width , height, label_text ). If you want to use special symbols in labels or other parts of the GUI, the following characters need to be replaced with the corresponding variables:

  • ( {$BL}
  • ) {$BR}
  • = {$EQ}

Using __CALL:

In this case, simply write your call text, with variables being represented as {$variable-name} where the name itself corresponds to the names you gave the GUI objects above. This only works with TextField, NumberField, and ArrayField. You can have multiple lines of R scripts here, but can only use the output of the final line of code in this section.

Using __OUTPUT:

In this case, you write every line like so: Cell-Address<-Value where Value is either the name of a specific property in the R object structure (e.g. p.value or estimate) or a general piece of text. The software will always test to see if your text is a property, and if it is not, the text itself will be printed.

Personal tools