1 #+OPTIONS: H:3 num:nil toc:t \n:nil @:t ::t |:t ^:t -:t f:t *:t TeX:t LaTeX:t skip:nil d:(HIDE) tags:not-in-toc
2 #+STARTUP: align fold nodlcheck hidestars oddeven lognotestate
3 #+SEQ_TODO: TODO(t) INPROGRESS(i) WAITING(w@) | DONE(d) CANCELED(c@)
4 #+TAGS: Write(w) Update(u) Fix(f) Check(c)
5 #+TITLE: org-R: Computing and data visualisation in Org-mode using R
7 #+EMAIL: davison@stats.ox.ac.uk
10 #+CATEGORY: worg-tutorial
12 # #+INFOJS_OPT: view:overview
14 [[file:../index.org][{Back to Worg's index}]]
17 Org-R has been replaced by [[file:../../org-contrib/babel/index.org][Org-babel]] which provides a much improved
18 environment for executing code in many languages (including R) in
19 Org documents. The off-the-shelf plotting functions of Org-R have
20 not yet been transferred to org-babel (i.e. you'd have to construct
21 the R code yourself). If you have requests for any plotting or other
22 analysis features that you would like to see added to Org-babel,
23 please send them to the Org mailing list. Fragments of code (in
24 whatever language) which allow specific tasks such as data plotting
25 to be accomplished in Org-mode will be maintained in the [[file:/usr/local/src/Worg/org-contrib/babel/library-of-babel.org][Library of
28 Org-R has now been removed from the contrib directory of the org
29 distribution, but is still available (but not maintained) [[http://www.stats.ox.ac.uk/~davison/software/org-R/org-R.el][here]].
32 org-R is an org-mode extension that performs numerical computations
33 and generates graphics. Numerical output may be stored in the org
34 buffer in org tables, and the input can also come from an org
35 table. Rather than starting off by documenting everything
36 systematically, I'll provide several commented examples. Towards the
37 end there are lists of [[*Table of available actions][available actions]] and [[*Table of available options][other options]].
39 Although, behind the scenes, it uses [[http://www.r-project.org][R]], you do not need to know
40 anything about R. Common operations are provided `off the shelf' by
41 specifying options on lines starting with #+R:. Having said that,
42 org-R also accepts raw R code (#+RR: lines). For those who don't
43 yet know R, but think they might be interested, try the showcode:t
44 option. It displays the R code corresponding to the action you
45 requested, and so provides a good starting point for fine-tuning
46 your analysis. But that's getting ahead of things.
48 My hope is, of course, that this will be of use to people. So at
49 this stage any comments, ideas, feedback, bug reports etc would be
50 very welcome. I'd be happy to help anyone that's interested in
51 using this, via the Org mailing list.
53 If you'd like to try out these commands yourself, the Org file that
54 created this web page is @<a href="org-R.org">here@</a>.
57 The code is currently [[http://www.stats.ox.ac.uk/~davison/software/org-R/org-R.el][here]] Soon it will be in the contrib
58 directory. The other things you need are R (Windows / OS X binaries
59 available on the [[http://www.r-project.org][R website]] widely available in linux package
60 repositories) and the emacs mode [[http://ess.r-project.org/][Emacs Speaks Statistics]] (ESS). ESS
61 installation instructions are [[http://ess.r-project.org/Manual/readme.html#Installation][here.]] Personally, under linux, I have
64 #+BEGIN_SRC emacs-lisp
65 (add-to-list 'load-path "/path/to/ess/lisp")
70 org-R uses two different option lines to specify an analysis/plot:
71 #+R: and #+RR:. #+RR: is the one that accepts R code, so we'll
72 ignore that for now. To make the action happen, issue C-c C-c with
73 point in the #+R: line (this calls org-R-apply). There are also
74 org-R-apply-in-subtree and org-R-apply-in-buffer, which visit each
75 org-R block that they find in the current subtree/buffer, calling
76 org-R-apply in that block (suggestion from Tom Short). So, first
79 * Computing on org tables: tabulating values
80 Here's a command to tabulate the values in the second column. Issue
81 M-x org-R-apply in the following #+R line.
90 #+R: action:tabulate columns:2
105 . So the values in column 2 were tabulated as requested. However,
106 the original data got overwritten. That leads us to
110 We can specify input data for analysis/plotting in 3 different
113 1. by providing a reference to an org table with the intable:
114 option. You can optionally specify the org file that the table
115 is in with infile:/path/to/file.org
117 2. by pointing it to a csv file, locally or via http:, using
118 infile:/path/to/file.csv or e.g.
119 infile:http://www.stats.ox.ac.uk/~davison/software/org-R/file.csv
121 3. by doing neither, in which case it looks for a table immediately
122 above the #+R(R) line(s).
124 Case (3) is what happened above -- the input data came from a table
125 immediately above the #+R line. The default behaviour is to replace
126 any such table with the output; this allows us to tweak the option
127 line and update the analysis. However, normally we'll want to separate
128 the data from the analysis output. So let's keep the data as a named
129 table in the org file, and refer to it by name:
140 [arbitrary other content of org buffer]
142 #+R: intable:data-set-1 action:tabulate
157 Note that this time we did a different analysis: I removed the
158 columns:2 option, so that tabulate was passed the whole table. As a
159 result the output contains counts of joint occurrences of values in
160 the two columns: out of the 4 possibilities, the only one we didn't
161 observe was "B in column 1 and A in column 2". We could have achieved
162 the same result with columns:(1 2). (But don't try to tabulate more
163 than 2 columns: org does not do multi-dimensional tables).
166 ** Available off-the-shelf plotting commands
167 At the risk of this starting to sound like a dodgy undergraduate
168 statistics textbook, the sort of plots that are appropriate depend
169 on the sort of data. Let's divide it up as
171 - discrete-valued data
172 [e.g. data-set-1 above, or the list of org variables customised by users]
173 - continuous-valued data
174 [e.g. the wing lengths of all Eagle Owls in Europe]
176 [e.g. a data set in which each point is a time,
177 together with the size of the org source code base at that time]
179 The available off-the-shelf actions are listed [[*Table of available actions][here]].
181 ** Continuous data example:
183 :ID: 2ce0fc04-b308-4b8d-8acc-805a9e5fed7d
185 We're going to need some data. So let's prove that org can also
186 speak statistics and use org-R to simulate the data. This
187 requires some raw R code, so skip this bit if you're not
190 The following #+RR line simulates 10 values from a Normal
191 distribution with mean -3, and 10 values from a Normal
192 distribution with mean 3, and lumps them together. The point is that
193 the numbers we get should be concentrated around two different
194 values, and we should be able to see that in a histogram and/or
199 #+RR: x <- c(rnorm(10, mean=-3, sd=1), rnorm(10, mean=3, sd=1))
200 #+R: title:"continuous-data" output-to-buffer:t
204 Here's what I got. Note that the title: option set the name of the
205 table with "#+TBLNAME"; we'll use that to refer to these data.
209 #+TBLNAME:continuous-data
211 |-------------------|
212 | -2.48627002467785 |
214 | -3.43471960580471 |
215 | -5.21985294534255 |
216 | -3.84201126431028 |
217 | -1.72912705369668 |
218 | -2.86703950990613 |
219 | -2.82292622464752 |
220 | -4.43246430621368 |
221 | -1.03188727658288 |
222 | 0.882823532068805 |
234 Now to plot the data. Let's have some colour as well, and this time
235 the title: option will be used to put a title on the plot (and also to
236 name the file link to the graphical output).
241 [[file:tmp.png][histogram example]]
242 #+R: action:hist columns:1 colour:hotpink
243 #+R: intable:continuous-data outfile:"png" title:"histogram example"
246 [[file:../../images/org-R/histogram-example.png]]
248 [Note that you can use multiple #+R lines rather than cramming all
249 the options on to one line.]
251 An alternative would be to produce a density plot. We don't have
252 enough data points to justify that here, but we'll do it anyway just
253 to show the sort of plots that are produced. This time we'll specify
254 the output file for the png image using the output: option. (For the
255 histogram we used output:"png". That's a special case; it doesn't
256 create a file called "png" but instead uses org-attach to store the
257 output in the org-attach dir for this entry. Same thing for the other
258 available output image formats: "jpg", "jpeg", "pdf", "ps", "bmp",
263 [[file:density.png][density plot example]]
264 #+R: action:density columns:"values" colour:chartreuse4 args:(:lwd 4)
265 #+R: intable:continuous-data outfile:"density.png" title:"density plot example"
268 [[file:../../images/org-R/density.png]]
270 There were a couple of new features there. Firstly, I referred to
271 column 1 using its column label, rather than with the
272 integer 1. Secondly, note the use of the args: option. It takes the
273 form of a lisp property list ("p-list"), specifying extra arguments to
274 pass to the R function (in this case density()). Here we used it to
275 set the line thickness (lwd=4).
277 ** Discrete data example: the configuration variables survey
279 The raw data, as collected by Manish, are in a table called
280 org-variables-table, in a file called variable-popcon.org. We use the
281 file: option to specify the org file containing the data, and the
282 table: option to specify the name of the table within that file. [An
283 alternative be to give the entry containing the table a unique id with
284 org-id-get-create, refer to it with table:<uid>, and rely on the
285 org-id mechanism to find it.].
287 Now we tabulate the data. (We're not currently taking the sensible
288 step that Manish did of checking whether the variables were given
289 values different from their default).
291 Rather than cluttering up this org file with all the count data,
292 we'll store them in a separate org file:
296 [[file:org-variables-counts.org][org-variables-counts]]
297 #+R: action:tabulate columns:2 sort:t
298 #+R: infile:"variable-popcon.org" intable:"org-variables-table"
299 #+R: outfile:"org-variables-counts.org" title:"org-variables-counts"
302 [[file:org-variables-counts.org]]
304 We can see the top few rows of the table by using action:head
308 | rownames(x) | value | count |
309 |-------------+-----------------------------+-------|
310 | 1 | org-agenda-files | 22 |
311 | 2 | org-agenda-start-on-weekday | 22 |
312 | 3 | org-log-done | 22 |
313 | 4 | org-todo-keywords | 22 |
314 | 5 | org-agenda-include-diary | 19 |
315 | 6 | org-hide-leading-stars | 19 |
317 #+R: infile:"org-variables-counts.org" intable:"org-variables-counts" output-to-buffer:t
321 Here's a barplot of the counts. It makes it clear that over half the
322 org variables are customised by only one or two users.
326 [[file:org-variables-barplot.png][org-variables barplot]]
327 #+R: action:barplot rownames:t columns:1 width:800 col:darkblue
328 #+R: args:(:names.arg "NULL")
329 #+R: infile:"org-variables-counts.org" intable:"org-variables-counts"
330 #+R: outfile:"org-variables-barplot.png" title:"org-variables barplot"
333 [[file:../../images/org-R/org-variables-barplot.png]]
335 *** Something more complicated: clustering org variables, and org users
337 OK, let's make a bit more use of R's capabilities. We can use the
338 org-variables data set to define distances between pairs of org
339 users (how similar their customisations are), and distances
340 between pairs of org variables (the extent to which people who
341 customise one of them customise the other). Then we can use those
342 distance matrices to cluster org users, and org variables.
344 First, let's create a table that's restricted to variables that
345 were customised by more than four users. This isn't necessary,
346 but there are a lot of org-variables! This is going to require a
347 bit of R code to count the variables and then subset the raw data
352 [[file:variable-popcon-restricted.org][org-variables-table]]
353 #+R: infile:"variable-popcon.org" intable:"org-variables-table"
354 #+R: outfile:"variable-popcon-restricted.org" title:"org-variables-table"
355 #+RR: tab <- table(x[,2])
356 #+RR: x <- subset(x, Variable %in% names(tab[tab > 4]))
359 [[file:variable-popcon-restricted.org][org-variables-table]]
361 Now let's make a table with a row for each variable, and a column for
362 each org user, and fill it with 1s and 0s according to whether user j
363 customised variable i. We can do that without writing any R code:
367 [[file:org-variables-incidence.org][incidence-matrix]]
368 #+R: action:tabulate columns:(1 2) rownames:t
369 #+R: infile:"variable-popcon-restricted.org" intable:"org-variables-table"
370 #+R: outfile:"org-variables-incidence.org" title:"incidence-matrix"
373 [[file:org-variables-incidence.org][incidence-matrix]]
375 First we'll cluster org users. We use the R function dist to compute a
376 distance matrix from the incidence matrix, then hclust to run a
377 hierarchical clustering algorithm, and then plot to plot the results
382 [[file:org-users-tree.png][org-users-tree.png]]
383 #+RR: par(bg="gray15", fg="turquoise2")
384 #+RR: plot(hclust(dist(x, method="binary")), ann=FALSE)
385 #+R: infile:"org-variables-incidence.org" intable:"incidence-matrix" rownames:t
386 #+R: outfile:"org-users-tree.png" title:"org-users-tree.png"
389 [[file:../../images/org-R/org-users-tree.png]]
391 And to cluster org variables, we use the transpose of that incidence matrix:
395 [[file:org-variables-tree.png][org-variables-tree.png]]
396 #+RR: par(bg="gray15", fg="turquoise2")
397 #+RR: plot(hclust(dist(t(x), method="binary")), ann=FALSE)
398 #+R: infile:"org-variables-incidence.org" intable:"incidence-matrix" rownames:t
399 #+R: outfile:"org-variables-tree.png" title:"org-variables-tree.png" width:1000
402 [[file:../../images/org-R/org-variables-tree.png]]
405 Please note that my main aim here was to give some examples of using
406 org-R, rather than to show how the org variables data should be mined
407 for useful information! The org-variables dendrogram does seem to have
408 made some sensible clusterings (e.g. the clusters of agenda-related
409 commands), but I'm going to leave it to others to decide whether this
410 exercise really served to do more than illustrate org-R. Does anyone
411 recognise any usage affinities between the clustered org users?
413 ** Indexed data example
415 :ID: 45f39291-3abc-4d5b-96c9-3a32f77877a5
417 Let's plot the same data as Eric Schulte used in the [[../org-plot.org][org-plot tutorial]] on worg.
421 [[file:/usr/local/src/org-etc/Worg/org-tutorials/org-R/data/45/f39291-3abc-4d5b-96c9-3a32f77877a5/org-R-output-8119M2O.png][An example from the org-plot tutorial, plotted using org-R]]
422 #+R: action:lines columns:((1)(2 3))
423 #+R: infile:"../org-plot.org"
424 #+R: intable:"org-plot-example-1" outfile:"png"
425 #+R: title:"An example from the org-plot tutorial, plotted using org-R"
428 [[file:../../images/org-R/org-plot-example-1.png]]
430 * Table of available options
431 In addition to the action:<some-action> option (described [[*Table of available actions][here]], the
432 following options are available:
433 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
434 | *Input options* | |
435 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
436 | infile:/path/to/file.csv | input data comes from file.csv |
437 | infile:http://www.stats.ox.ac.uk/~davison/software/org-R/file.csv | input data comes from file.csv somewhere on the web |
438 | infile:/path/to/file.org | input data comes from file.org; must also specify table with intable:<name-or-id> |
439 | intable:table-name | input data is in table named with #+TBLNAME:table-name (in same buffer unless infile:/path/to/file.org is specified) |
440 | intable:table-id | input data is first table under entry with table-id as unique ID. Doesn't make sense with infile:/path/to/file.org |
441 | rownames:t | does first column contain row names? (default: nil). If t other column indices are as if first column not present -- this may change) |
442 | colnames:nil | does first row contain column names? (default: t) |
443 | columns:2 columns:(2) | operate only on column 2 |
444 | columns:"wing length" columns:("wing length") | operate only on column named "wing length" |
445 | columns:((1)(2 3)) | (when plotting) plot columns 2 and 3 on y-axis against column 1 on x-axis |
446 | columns:(("age")("wing length" "fierceness")) | (when plotting) plot columns named "wing length" and "fierceness" on y-axis against "age" on x-axis |
447 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
448 | *Action options* | |
449 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
450 | action:some-action | off-the-shelf plotting action or computation (see [[*Table of available actions][separate list]]), or any R function that makes sense (e.g. head, summary) |
451 | lines:t | (when plotting) join points with lines (similar to action:lines) |
452 | args:(:xlab "\"the x axis title\"" :lwd 4) | provide extra arguments as a p-list (note the need to quote strings if they are to appear as strings in R) |
453 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
454 | *Output options* | |
455 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
456 | outfile:/path/to/image.png | save image to file and insert link into org buffer (also: .pdf, .ps, .jpg, .jpeg, .bmp, .tiff) |
457 | outfile:png | save image to file in org-attach directory and insert link |
458 | outfile:/path/to/file.csv | would make sense but not implemented yet |
459 | height:1000 | set height of graphical output in (pixels for png, jpeg, bmp, tiff; default 480) / (inches for pdf, ps; default 7) |
460 | width:1000 | set width of graphical output in pixels (default 480 for png) |
461 | title:"title of table/plot" | title to be used in plot, and as #+TBLNAME of table output, and as name of link to output |
462 | colour:hotpink col:hotpink color:hotpink | main colour for plot (i.e. `col' argument in R, enter colors() at R prompt for list of available colours.) |
463 | sort:t | with action:tabulate, sort in decreasing count order (default is alphabetical on names) |
464 | output-to-buffer:t | force numerical output to org buffer (shouldn't be necessary) |
465 | inline:t | don't name links to output (so that graphics are inline when exported to HTML) |
466 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
468 |-------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------|
469 | showcode:t | Display a buffer containing the R code that was generated to do what was requested. |
471 * Table of available actions
473 To specify an action from the following list, use e.g. action:hist on
476 | *Actions that generate numerical output* | |
477 |------------------------------------------+---------------------------------------------------------------------------------------------------------|
478 | tabulate | count occurrences of distinct input values. Input data should be discrete. This is function table in R. |
479 | summary | summarise data in columns (minimum, 1st quartile, median, mean, 3rd quartile, max) |
480 | head | show first 6 rows of a larger table |
481 | transpose | transpose a table |
483 | *Actions that generate graphical output* | |
484 |------------------------------------------+---------------------------------------------------------------------------------------------------------|
486 | *Discrete data* | |
487 | barplot | produces 'side-by-side' bar plots if multiple columns selected |
490 | plot | if only 1 column selected, index is automatic: 1,2,... |
491 | lines | same as plot |
492 | points | same as plot but don't join points with lines |
494 | *Continuous data* | |
496 | density | like a smoothed histogram (i.e. a curve) |
498 | *Grid of values* | |
499 | image | a grid image, with cells coloured according to their numerical values |
502 Apart from tabulate, the action: names are the same as the names of
503 the R functions which implement them. `tabulate' is really called
506 Note that, in addition to the actions listed below, you can also use
507 action:R-function, where "R-function" is the name of any existing R
508 function. The function must be able to take a data frame as it's first
509 argument, and must not *require* any further arguments (i.e. any
510 further arguyments must have suitable default values). Any numerical
511 output will be sent to the org buffer (use output-to-buffer:t to force
512 this, although if that is necessary then that is a bug).
514 * More detailed description of org-R
515 My aim with org-R is to provide a fairly general facility for using
516 R with Org. The #+R lines and #+RR lines together specify an R
517 function, which may take numerical input, and may generate
518 graphical output, or numerical output, or both.
520 If any input data have been specified, then the R function receives
521 those data as its first argument. The input data may come from an
522 Org table, or from a csv spreadsheet file. In either case they are
523 tabular (1- or 2-dimensional). The input data are passed to the
524 function as an R data frame (a table-like structure in which
525 different columns may contain different types of data -- numeric,
526 character, etc). Inside the R function, that data frame is called
527 'x'. 'x' is also the return value of the R function. Therefore the
528 numerical output of org-R is determined by the modifications to the
529 variable x that are made inside the function (any graphical output
532 It's worth noting that one mode of using org-R would be to write your
533 own code in a separate file, and use the source() function on a #+RR
534 line to evaluate the code in that file.
536 Numerical output of the function should also be tabular, and may be
537 received by the Org buffer as an Org table, or sent to file in Org
538 table or csv format. R deals transparently with multi-dimensional
539 arrays, but Org table and csv format do not.
541 Unless an output file has been specified, graphical output will be
544 * Getting help with R
545 - Bring up an R prompt with R at a shell prompt, or M-x R in emacs (if you have installed ESS)
546 - Enter ?function.name for help on function `function.name'
547 - Enter RSiteSearch("words") for online help matching "words"
548 - Enter ?par to see the full list of graphical parameters
549 - Follow the Documentation link on the left hand side of the R
550 website for "An Introduction to R", and other more technical manuals.
552 Seeing as this has made use of R, I'll briefly say my bit on it for
553 those who are unfamiliar.
554 1. It's good for simple numerical work, as well as having
555 implementations of a a very large range of more sophisticated
556 mathematical and statistical procedures.
557 2. It's good for producing graphics quickly, and for fine tuning
558 every last detail of the graphics for publication.
559 3. It's a syntactically reasonable, user-friendly, interpreted
560 programming language, that is often used interactively (it comes
561 with its own shell/command-line environment, and runs within
563 4. It's a good language for a functional style of programming (in
564 fact I'd say that's how it should be used), which might well
565 appeal to elisp programmers. For example, you want to construct
566 an arbitrarily nested data structure, then pass some function
567 over the tips, returning a data structure of the same shape as
568 the input? No problem ([[http://stat.ethz.ch/R-manual/R-patched/library/base/html/rapply.html][rapply]]).
569 5. There's a *lot* of add-on packages for it (CRAN link on left hand
570 side of [[http://www.r-project.org/][website]].).
571 6. How many programming languages will get [[http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html][their own article]] in the
572 New York Times this year?