inst/doc/filehash.Rnw

   1 \documentclass{article}
   2
   3 %%\VignetteIndexEntry{The filehash Package}
   4 %%\VignetteDepends{filehash}
   5
   6 \usepackage{charter}
   7 \usepackage{courier}
   8 \usepackage[noae]{Sweave}
   9 \usepackage[margin=1in]{geometry}
  10 \usepackage{natbib}
  11
  12 \title{Interacting with Data using the \textbf{filehash} Package for
  13 R}
  14
  15 \author{Roger D. Peng $<$rpeng@jhsph.edu$>$\\\textit{Department of
  16 Biostatistics}\\\textit{Johns Hopkins Bloomberg School of Public Health}}
  17
  18 \date{}
  19
  20 \newcommand{\pkg}{\textbf}
  21 \newcommand{\code}{\texttt}
  22
  23 \begin{document}
  24
  25 \maketitle
  26
  27 \begin{abstract}
  28 The \pkg{filehash} package for R implements a simple key-value style
  29 database where character string keys are associated with data values
  30 that are stored on the disk.  A simple interface is provided for
  31 inserting, retrieving, and deleting data from the database.  Utilities
  32 are provided that allow \pkg{filehash} databases to be treated much
  33 like environments and lists are already used in R.  These utilities
  34 are provided to encourage interactive and exploratory analysis on
  35 large datasets.  Three different file formats for representing the
  36 database are currently available and new formats can easily be
  37 incorporated by third parties for use in the \pkg{filehash} framework.
  38 \end{abstract}
  39
  40 <<options,results=hide,echo=false>>=
  41 options(width=60)
  42 @
  43
  44 \section{Overview and Motivation}
  45
  46 Working with large datasets in R can be cumbersome because of the need
  47 to keep objects in physical memory.  While many might generally see
  48 that as a feature of the system, the need to keep whole objects in
  49 memory creates challenges to those who might want to work
  50 interactively with large datasets.  Here we take a simple definition
  51 of ``large dataset'' to be any dataset that cannot be loaded into R as
  52 a single R object because of memory limitations.  For example, a very
  53 large data frame might be too large for all of the columns and rows to
  54 be loaded at once.  In such a situation, one might load only a subset
  55 of the rows or columns, if that is possible.
  56
  57 In a key-value database, an arbitrary data object (a ``value'') has a
  58 ``key'' associated with it, usually a character string.  When one
  59 requests the value associated with a particular key, it is the
  60 database's job to match up the key with the correct value and return
  61 the value to the requester.
  62
  63 The most straightforward example of a key-value database in R is the
  64 global environment.  Every object in R has a name and a value
  65 associated with it.  When you execute at the R prompt
  66 <<exampleGlobalEnv,results=hide>>=
  67 x <- 1
  68 print(x)
  69 @
  70 the first line assigns the value 1 to the name/key ``x''.  The second
  71 line requests the value of ``x'' and prints out 1 to the console.  R
  72 handles the task of finding the appropriate value for ``x'' by
  73 searching through a series of environments, including the namespaces
  74 of the packages on the search list.
  75
  76 In most cases, R stores the values associated with keys in memory, so
  77 that the value of \code{x} in the example above was stored in and
  78 retrieved from physical memory.  However, the idea of a key-value
  79 database can be generalized beyond this particular configuration.  For
  80 example, as of R 2.0.0, much of the R code for R packages is stored in
  81 a lazy-loaded database, where the values are initially stored on disk
  82 and loaded into memory on first access~\citep{Rnews:Ripley:2004}.
  83 Hence, when R starts up, it uses relatively little memory, while the
  84 memory usage increases as more objects are requested.  Data could also
  85 be stored on other computers (e.g. websites) and retrieved over the
  86 network.
  87
  88 The general S language concept of a database is described in Chapter 5
  89 of the Green Book~\citep{cham:1998} and earlier in~\cite{cham:1991}.
  90 Although the S and R languages have different semantics with respect
  91 to how variable names are looked up and bound to values, the general
  92 concept of using a key-value database applies to both languages.
  93 Duncan Temple Lang has implemented this general database framework for
  94 R in the \pkg{RObjectTables} package of
  95 Omegahat~\citep{TempleLang:2002}. The \pkg{RObjectTables} package
  96 provides an interface for connecting R with arbitrary backend systems,
  97 allowing data values to be stored in potentially any format or
  98 location.  While the package itself does not include a specific
  99 implementation, some examples are provided on the package's website.
 100
 101 The \pkg{filehash} package provides a full read-write implementation
 102 of a key-value database for R.  The package does not depend on any
 103 external packages (beyond those provided in a standard R installation)
 104 or software systems and is written entirely in R, making it readily
 105 usable on most platforms.  The \pkg{filehash} package can be thought
 106 of as a specific implementation of the database concept described
 107 in~\cite{cham:1991}, taking a slightly different approach to the
 108 problem.  Both~\cite{TempleLang:2002} and~\cite{cham:1991} focus on
 109 generalizing the notion of ``attach()-ing'' a database in an R/S
 110 session so that variable names can be looked up automatically via the
 111 search list.  The \pkg{filehash} package represents a database as an
 112 instance of an S4 class and operates directly on the S4 object via
 113 various methods.
 114
 115 Key-value databases are sometimes called hash tables and indeed, the
 116 name of the package comes from the idea of having a ``file-based hash
 117 table''.  With \pkg{filehash} the values are stored in a file on the
 118 disk rather than in memory.  When a user requests the values
 119 associated with a key, \pkg{filehash} finds the object on the disk,
 120 loads the value into R and returns it to the user.  The package offers
 121 two formats for storing data on the disk: The values can be stored (1)
 122 concatenated together in a single file or (2) separately as a
 123 directory of files.
 124
 125
 126
 127
 128 \section{Related R packages}
 129
 130 There are other packages on CRAN designed specifically to help users
 131 work with large datasets.  Two packages that come immediately to mind
 132 are the \pkg{g.data} package by David Brahm~\citep{brahm:2002} and the
 133 \pkg{biglm} package by Thomas Lumley.  The \pkg{g.data} package takes
 134 advantage of the lazy evaluation mechanism in R via the
 135 \code{delayedAssign} function.  Briefly, objects are loaded into R as
 136 promises to load the actual data associated with an object name.  The
 137 first time an object is requested, the promise is evaluated and the
 138 data are loaded.  From then on, the data reside in memory.  The
 139 mechanism used in \pkg{g.data} is similar to the one used by the
 140 lazy-loaded databases described in~\cite{Rnews:Ripley:2004}.  The
 141 \pkg{biglm} package allows users to fit linear models on datasets that
 142 are too large to fit in memory.  However, the \pkg{biglm} package does
 143 not provide methods for dealing with large datasets in general.  The
 144 \pkg{filehash} package also draws inspiration from Luke Tierney's
 145 experimental \pkg{gdbm} package which implements a key-value database
 146 via the GNU dbm (GDBM) library.  The use of GDBM creates an external
 147 dependence since the GDBM C library has to be compiled on each system.
 148 In addition, I encountered a problem where databases created on 32-bit
 149 machines could not be transferred to and read on 64-bit machines (and
 150 vice versa).  However, with the increasing use of 64-bit machines in
 151 the future, it seems this problem will eventually go away.
 152
 153 The R Special Interest Group on Databases has developed a number of
 154 packages that provide an R interface to commonly used relational
 155 database management systems (RDBMS) such as MySQL (\pkg{RMySQL}),
 156 PostgreSQL (\pkg{RPgSQL}), and Oracle (\pkg{ROracle}).  These packages
 157 use the S4 classes and generics defined in the \pkg{DBI} package and
 158 have the advantage that they offer much better database functionality,
 159 inherited via the use of a true database management system.  However,
 160 this benefit comes with the cost of having to install and use
 161 third-party software.  While installing an RDBMS may not be an
 162 issue---many systems have them pre-installed and the \pkg{RSQLite}
 163 package comes bundled with the source for the RDBMS---the need for the
 164 RDBMS and knowledge of structured query language (SQL) nevertheless
 165 adds some overhead.  This overhead may serve as an impediment for
 166 users in need of a database for simpler applications.
 167
 168
 169
 170 \section{Creating a filehash database}
 171
 172 Databases can be created with \pkg{filehash} using the \code{dbCreate}
 173 function.  The one required argument is the name of the database,
 174 which we call here ``mydb''.
 175 <<create>>=
 176 library(filehash)
 177 dbCreate("mydb")
 178 db <- dbInit("mydb")
 179 @
 180 You can also specify the \code{type} argument which controls how the
 181 database is represented on the backend.  We will discuss the different
 182 backends in further detail later.  For now, we use the default backend
 183 which is called ``DB1''.
 184
 185 Once the database is created, it must be initialized in order to be
 186 accessed.  The \code{dbInit} function returns an S4 object inheriting
 187 from class ``filehash''.  Since this is a newly created database,
 188 there are no objects in it.
 189
 190 \section{Accessing a filehash database}
 191
 192 <<setseed1,results=hide,echo=false>>=
 193 set.seed(100)
 194 @
 195
 196 The primary interface to filehash databases consists of the functions
 197 \code{dbFetch}, \code{dbInsert}, \code{dbExists}, \code{dbList}, and
 198 \code{dbDelete}.  These functions are all generic---specific methods
 199 exists for each type of database backend.  They all take as their
 200 first argument an object of class ``filehash''.  To insert some data
 201 into the database we can simply call \code{dbInsert}
 202 <<insert>>=
 203 dbInsert(db, "a", rnorm(100))
 204 @
 205 Here we have associated with the key ``a'' 100 standard normal random
 206 variates.  We can retrieve those values with \code{dbFetch}.
 207 <<fetch>>=
 208 value <- dbFetch(db, "a")
 209 mean(value)
 210 @
 211
 212 The function \code{dbList} lists all of the keys that are available in
 213 the database, \code{dbExists} tests to see if a given key is in the
 214 database, and \code{dbDelete} deletes a key-value pair from the
 215 database
 216 <<delete>>=
 217 dbInsert(db, "b", 123)
 218 dbDelete(db, "a")
 219 dbList(db)
 220 dbExists(db, "a")
 221 @
 222
 223 While using functions like \code{dbInsert} and \code{dbFetch} is
 224 straightforward it can often be easier on the fingers to use standard
 225 R subset and accessor functions like \code{\$}, \code{[[}, and
 226 \code{[}. Filehash databases have methods for these functions so that
 227 objects can be accessed in a more compact manner. Similarly,
 228 replacement methods for these functions are also available. The
 229 \verb+[+ function can be used to access multiple objects from the
 230 database, in which case a list is returned.
 231
 232 <<accessors>>=
 233 db$a <- rnorm(100, 1)
 234 mean(db$a)
 235 mean(db[["a"]])
 236 db$b <- rnorm(100, 2)
 237 dbList(db)
 238 @
 239 For all of the accessor functions, only character indices are allowed.
 240 Numeric indices are caught and an error is given.
 241 <<characteronly>>=
 242 e <- local({
 243     err <- function(e) e
 244     tryCatch(db[[1]], error = err)
 245 })
 246 conditionMessage(e)
 247 @
 248 Finally, there is method for the \code{with} generic function which
 249 operates much like using \code{with} on lists or environments.
 250
 251 The following three statements all return the same value.
 252 <<with>>=
 253 with(db, c(a = mean(a), b = mean(b)))
 254 @
 255 When using \code{with}, the values of ``a'' and ``b'' are looked up in
 256 the database.
 257 <<sapply>>=
 258 sapply(db[c("a", "b")], mean)
 259 @
 260 Here, using \code{[} on \code{db} returns a list with the values
 261 associated with ``a'' and ``b''.  Then \code{sapply} is applied in the
 262 usual way on the returned list.
 263 <<lapply>>=
 264 unlist(lapply(db, mean))
 265 @
 266 In the last statement we call \code{lapply} directly on the
 267 ``filehash'' object.  The \pkg{filehash} package defines a method for
 268 \code{lapply} that allows the user to apply a function on all the
 269 elements of a database directly.  The method essentially loops through
 270 all the keys in the database, loads each object separately and applies
 271 the supplied function to each object.  \code{lapply} returns a named
 272 list with each element being the result of applying the supplied
 273 function to an object in the database.  There is an argument
 274 \code{keep.names} to the \code{lapply} method which, if set to
 275 \code{FALSE}, will drop all the names from the list.
 276
 277 <<cleanupMyDB,results=hide,echo=false>>=
 278 dbUnlink(db)
 279 rm(list = ls(all = TRUE))
 280 @
 281
 282 \section{Loading filehash databases}
 283
 284 <<setseed2,results=hide,echo=false>>=
 285 set.seed(200)
 286 @
 287
 288 An alternative way of working with a filehash database is to load it
 289 into an environment and access the element names directly, without
 290 having to use any of the accessor functions.  The \pkg{filehash}
 291 function \code{dbLoad} works much like the standard R \code{load}
 292 function except that \code{dbLoad} loads active bindings into a given
 293 environment rather than the actual data.  The active bindings are
 294 created via the \code{makeActiveBinding} function in the \pkg{base}
 295 package.  \code{dbLoad} takes a filehash database and creates symbols
 296 in an environment corresponding to the keys in the database.  It then
 297 calls \code{makeActiveBinding} to associate with each key a function
 298 which loads the data associated with a given key.  Conceptually,
 299 active bindings are like pointers to the database.  After calling
 300 \code{dbLoad}, anytime an object with an active binding is accessed
 301 the associated function (installed by \code{makeActiveBinding}) loads
 302 the data from the database.
 303
 304 We can create a simple database to demonstrate the active binding
 305 mechanism.
 306 <<testDB>>=
 307 dbCreate("testDB")
 308 db <- dbInit("testDB")
 309 db$x <- rnorm(100)
 310 db$y <- runif(100)
 311 db$a <- letters
 312 dbLoad(db)
 313 ls()
 314 @
 315 Notice that we appear to have some additional objects in our
 316 workspace.  However, the values of these objects are not stored in
 317 memory---they are stored in the database.  When one of the objects is
 318 accessed, the value is automatically loaded from the database.
 319 <<accessbinding>>=
 320 mean(y)
 321 sort(a)
 322 @
 323 If I assign a different value to one of these objects, its
 324 associated value is updated in the database via the active binding
 325 mechanism.
 326 <<assignvalue>>=
 327 y <- rnorm(100, 2)
 328 mean(y)
 329 @
 330 If I subsequently remove the database and reload it later, the
 331 updated value for ``y'' persists.
 332 <<removeandload>>=
 333 rm(list = ls())
 334 db <- dbInit("testDB")
 335 dbLoad(db)
 336 ls()
 337 mean(y)
 338 @
 339
 340 Perhaps one disadvantage of the active binding approach taken here is
 341 that whenever an object is accessed, the data must be reloaded into R.
 342 This behavior is distinctly different from the the delayed assignment
 343 approach taken in \pkg{g.data} where an object must only be loaded
 344 once and then is subsequently in memory.  However, when using delayed
 345 assignments, if one cycles through all of the objects in the database,
 346 one could eventually exhaust the available memory.
 347
 348 <<cleanupTestDB,results=hide,echo=false>>=
 349 dbUnlink(db)
 350 rm(list = ls(all = TRUE))
 351 @
 352
 353 \section{Other filehash utilities}
 354
 355 There are a few other utilities included with the \pkg{filehash}
 356 package.  Two of the utilities, \code{dumpObjects} and
 357 \code{dumpImage}, are analogues of \code{save} and \code{save.image}.
 358 Rather than save objects to an R workspace, \code{dumpObjects} saves
 359 the given objects to a ``filehash'' database so that in the future,
 360 individual objects can be reloaded if desired.  Similarly,
 361 \code{dumpImage} saves the entire workspace to a ``filehash''
 362 database.
 363
 364 The function \code{dumpList} takes a list and creates a ``filehash''
 365 database with values from the list.  The list must have a non-empty
 366 name for every element in order for \code{dumpList} to succeed.
 367 \code{dumpDF} creates a ``filehash'' database from a data frame where
 368 each column of the data frame is an element in the database.
 369 Essentially, \code{dumpDF} converts the data frame to a list and calls
 370 \code{dumpList}.
 371
 372
 373 \section{Filehash database backends}
 374
 375 Currently, the \pkg{filehash} package can represent databases in two
 376 different formats.  The default format is called ``DB1'' and it stores
 377 the keys and values in a single file.  From experience, this format
 378 works well overall but can be a little slow to initialize when there
 379 are many thousands of keys.  Briefly, the ``filehash'' object in R
 380 stores a map which associates keys with a byte location in the
 381 database file where the corresponding value is stored.  Given the byte
 382 location, we can \code{seek} to that location in the file and read the
 383 data directly.  Before reading in the data, a check is made to make
 384 sure that the map is up to date.  This format depends critically on
 385 having a working \code{ftell} at the system level and a crude check is
 386 made when trying to initialize a database of this format.
 387
 388 The second format is called ``RDS'' and it stores objects as separate
 389 files on the disk in a directory with the same name as the database.
 390 This format is the most straightforward and simple of the available
 391 formats.  When a request is made for a specific key, \pkg{filehash}
 392 finds the appropriate file in the directory and reads the file into R.
 393 The only catch is that on operating systems that use case-insensitive
 394 file names, objects whose names differ only in case will collide on
 395 the filesystem.  To workaround this, object names with capital letters
 396 are stored with mangled names on the disk.  An advantage of this
 397 format is that most of the organizational work is delegated to the
 398 filesystem.
 399
 400 There is a third format called ``DB'' and it is a predecessor of the
 401 ``DB1'' format.  This format is like the ``DB1'' format except the map
 402 which associates keys to byte locations is stored in a separate file.
 403 Therefore, each database is represented by two separate files---an
 404 index file and a data file.  This format is retained for back
 405 compatibility but users should generally try to use the ``DB1'' format
 406 instead.
 407
 408
 409 \section{Extending filehash}
 410
 411 The \pkg{filehash} package has a mechanism for developing new backend
 412 formats, should the need arise.  The function \code{registerFormatDB}
 413 can be used to make \pkg{filehash} aware of a new database format that
 414 may be implemented in a separate R package or a file.
 415 \code{registerFormatDB} takes two arguments: a \code{name} for the new
 416 format (like ``DB1'' or ``RDS'') and a list of functions.  The list
 417 should contain two functions: one function named ``create'' for
 418 creating a database, given the database name, and another function
 419 named ``initialize'' for initializing the database.  In addition, one
 420 needs to define methods for \code{dbInsert}, \code{dbFetch}, etc.
 421
 422 A list of available backend formats can be obtained via the
 423 \code{filehashFormats} function.  Upon registering a new backend
 424 format, the new format will be listed when \code{filehashFormats} is
 425 called.
 426
 427 The interface for registering new backend formats is still
 428 experimental and could change in the future.
 429
 430
 431 \section{Discussion}
 432
 433 The \pkg{filehash} package has been designed be useful in both a
 434 programming setting and an interactive setting.  Its main purpose is
 435 to allow for simpler interaction with large datasets where
 436 simultaneous access to the full dataset is not needed.  While the
 437 package may not be optimal for all settings, one goal was to write a
 438 simple package in pure R that users to could install with minimal
 439 overhead.  In the future I hope to add functionality for interacting
 440 with databases stored on remote computers and perhaps incorporate a
 441 ``real'' database backend.  Some work has already begun on developing
 442 a backend based on the \pkg{RSQLite} package.
 443
 444
 445
 446 \bibliographystyle{asa}
 447 \bibliography{combined}
 448
 449
 450 \end{document}
 451