doc/rest/tutorials/db_basic.rst

   1
   2 ======================================
   3 Working with Databases and Annotations
   4 ======================================
   5
   6 Purpose
   7 ^^^^^^^
   8
   9 This tutorial will teach you how to use Pygr to work with databases
  10 stored in memory, on disk, and in SQL databases.  The examples will
  11 focus on setting up simple annotation databases, both by plugging in
  12 external data sources and creating new databases.  No previous
  13 knowledge of Pygr is required (although you may want to look at
  14 the annotation tutorial, to learn about what you can do with annotations...).
  15
  16 Using dict as a Database
  17 ^^^^^^^^^^^^^^^^^^^^^^^^
  18
  19 You may have noticed an analogy between traditional databases,
  20 which associate a unique identifier (the "primary key") for each row,
  21 and Python dictionaries, which map a unique key value to an associated
  22 value.  Pygr builds on this analogy by adopting the Python dictionary
  23 interface (the "Mapping Protocol") as its standard database interface.
  24 That means you can just use a Python ``dict`` anywhere that Pygr
  25 expects a "database" object.
  26
  27 Example: An Annotation Database based on dict
  28 ---------------------------------------------
  29
  30 For example, Pygr annotation databases are themselves built on top of two
  31 "databases": a *slice information* database that gives the coordinates
  32 of an annotation interval for each key; and a *sequence* database
  33 on which to apply those coordinates.  So we can build an annotation
  34 database by supplying a dictionary containing some slice information.
  35 We just need to create a class that stores the slice coordinate attributes
  36 expected by the annotation database.  Here is the content of our
  37 simple module ``slice_pickle_obj.py``::
  38
  39   class MySliceInfo(object):
  40      def __init__(self, seq_id, start, stop, orientation):
  41         (self.id, self.start, self.stop, self.orientation) = \
  42             (seq_id, start, stop, orientation)
  43
  44 Let's use this to create a dict "database"::
  45
  46   >>> from slice_pickle_obj import MySliceInfo
  47   >>> seq_id = 'gi|171854975|dbj|AB364477.1|'
  48   >>> slice1 = MySliceInfo(seq_id, 0, 50, +1)
  49   >>> slice2 = MySliceInfo(seq_id, 300, 400, -1)
  50   >>> slice_db = dict(A=slice1, B=slice2)
  51
  52 Now all we have to do is open the sequence database and
  53 create the annotation database object::
  54
  55   >>> from pygr import seqdb, annotation
  56   >>> dna_db = seqdb.SequenceFileDB('../tests/data/hbb1_mouse.fa')
  57   >>> annodb = annotation.AnnotationDB(slice_db, dna_db)
  58
  59 We can get our annotations and their associated
  60 sequence intervals::
  61
  62   >>> annodb.keys()
  63   ['A', 'B']
  64   >>> a = annodb['A']
  65   >>> len(a)
  66   50
  67   >>> s = a.sequence
  68   >>> print repr(s), str(s)
  69   gi|171854975|dbj|AB364477.1|[0:50] ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTGGCCTGTGGGG
  70
  71 Using Collection as a Persistent Database
  72 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  73
  74 Of course, in real life you probably need to worry about scalability --
  75 we'd like to be able to build annotation databases that are much larger
  76 than will fit in memory, by storing them on disk and using fast indexing
  77 methods to retrieve data from them.
  78
  79 For this purpose Pygr provides its :class:`mapping.Collection` class.
  80 It adds a few crucial features on top of Python's ``shelve`` persistent
  81 dictionary interface:
  82
  83 * unlike shelve objects, :class:`mapping.Collection` objects are
  84   picklable.  So they can be stored in :mod:`worldbase`.
  85
  86 * unlike shelve dictionaries, :class:`mapping.Collection` can
  87   work with integer keys, if you pass the ``intKeys=True`` argument.
  88
  89 So let's modify our previous example to work with
  90 :class:`mapping.Collection`.  All we need to do is create the
  91 Collection with a ``filename`` argument, and it will be stored on disk
  92 (in that file).  We use the standard shelve argument ``mode='c'``
  93 to tell it to create a new file (overwriting any existing file
  94 if present)::
  95
  96    >>> from pygr import mapping
  97    >>> slice_db = mapping.Collection(filename='myshelve', mode='c')
  98    >>> slice_db['A'] = slice1
  99    >>> slice_db['B'] = slice2
 100    >>> slice_db.close()
 101
 102 Closing the database is essential to ensuring that all data has been
 103 written to disk.  Now we can re-open the Collection in read-only mode,
 104 and use it as the back-end for our annotation database::
 105
 106    >>> slice_db = mapping.Collection(filename='myshelve', mode='r')
 107    >>> annodb = annotation.AnnotationDB(slice_db, dna_db)
 108    >>> for k in annodb:
 109    ...     print repr(annodb[k]), repr(annodb[k].sequence)
 110    annotA[0:50] gi|171854975|dbj|AB364477.1|[0:50]
 111    annotB[0:100] -gi|171854975|dbj|AB364477.1|[300:400]
 112
 113
 114 Accessing SQL Databases
 115 ^^^^^^^^^^^^^^^^^^^^^^^
 116
 117 In many cases, you'll want to access data stored in external
 118 database servers via SQL.  Pygr makes this very easy.  The first
 119 thing you need is a connection to the database server.  Pygr
 120 uses a standard class :class:`sqlgraph.DBServerInfo` (and its
 121 subclasses) for this::
 122
 123    >>> serverInfo = sqlgraph.DBServerInfo(host='genome-mysql.cse.ucsc.edu',
 124                                           user='genome')
 125
 126 In this case, it enables us to connect to UCSC's Genome Browser
 127 MySQL database.
 128
 129 :class:`sqlgraph.DBServerInfo` adds several capabilities on top of
 130 the standard Python DB API 2.0 "database connection" and Cursor
 131 objects:
 132
 133 * it helps Pygr automatically figure out the schema of the target
 134   database, and enables it to work with different databases
 135   (e.g. MySQL, sqlite) that have slight differences in SQL syntax.
 136
 137 * It is guaranteed to be picklable (unlike Cursor or Connection objects),
 138   and therefore can be stored in worldbase.  That is, it stores whatever
 139   information is necessary to re-connect to the target database
 140   server at a later time, in a form that can be pickled and unpickled.
 141
 142 * It can automatically use your saved authentication information
 143   (e.g. for MySQL, in your ~/.my.cnf file) to connect to your database
 144   server.
 145
 146 Let's use this to connect to UCSC "known genes" annotations for
 147 human genome draft 18.  We simply create a :class:`sqlgraph.SQLTable`
 148 object with the desired table name::
 149
 150    >>> genes = sqlgraph.SQLTable('hg18.knownGene', serverInfo=serverInfo)
 151    >>> len(genes)
 152    66803
 153    >>> genes.columnName
 154    ['name', 'chrom', 'strand', 'txStart', 'txEnd', 'cdsStart', 'cdsEnd', 'exonCount', 'exonStarts', 'exonEnds', 'proteinID', 'alignID']
 155    >>> genes.primary_key
 156    None
 157
 158 As you can see, :class:`sqlgraph.SQLTable` has automatically analyzed
 159 the table's schema, determining that the table lacks a primary key.
 160 We can force it to use ``name`` as the default column for looking up
 161 identifiers, by simply setting the ``primary_key`` attribute::
 162
 163    >>> genes.primary_key = 'name'
 164
 165 Now we can look up rows directly; if a given query found more than
 166 one row, it would raise a ``KeyError``::
 167
 168    >>> tx = genes['uc009vjh.1']
 169    >>> tx.chrom
 170    'chr1'
 171    >>> tx.txStart
 172    55424L
 173    >>> tx.txEnd
 174    59692L
 175    >>> tx.strand
 176    '+'
 177
 178 Customizing SQL Database Access
 179 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 180
 181 Let's use this table as the back-end for gene annotations on the
 182 human genome draft 18.  We have to solve a few problems:
 183
 184 * Note that the attribute names used by UCSC
 185   (``chrom``, ``txStart``, ``txEnd``, ``strand``) are different
 186   than what :class:`annotation.AnnotationDB` expects.
 187
 188   This is easy to fix.  :class:`annotation.AnnotationDB` accepts
 189   a ``sliceAttrDict`` dictionary that can provide aliases.
 190   For example ``sliceAttrDict=dict(id='chrom')`` would make it
 191   use the ``chrom`` attribute as the sequence ID.
 192
 193 * A more basic problem: UCSC's ``strand`` attribute returns a string
 194   '+' or '-', instead of an integer (1 or -1) as
 195   :class:`annotation.AnnotationDB` expects.  That requires writing a little
 196   code to translate it.  All we have to do is write a Python
 197   descriptor class to perform this translation::
 198
 199    class UCSCStrandDescr(object):
 200       def __get__(self, obj, objtype):
 201          if obj.strand == '+':
 202             return 1
 203          else:
 204             return -1
 205
 206 Next we create a subclass of Pygr's standard SQL row class,
 207 :class:`sqlgraph.TupleO`, with this descriptor bound as its
 208 ``orientation`` attribute::
 209
 210    class UCSCSeqIntervalRow(sqlgraph.TupleO):
 211       orientation = UCSCStrandDescr()
 212
 213 Finally, we just tell :class:`sqlgraph.SQLTable` to use our new
 214 row class::
 215
 216    >>> txInfo = sqlgraph.SQLTable('hg18.knownGene', serverInfo=serverInfo,
 217    ...                             itemClass=UCSCSeqIntervalRow)
 218    ...
 219    >>> txInfo.primary_key = 'name'
 220    >>> tx = txInfo['uc009vjh.1']
 221    >>> tx.orientation
 222    1
 223
 224 OK, now we can use this as the slice database for our annotation.
 225 Let's get the human genome database, and create our annotation database::
 226
 227    >>> from pygr import worldbase
 228    >>> hg18 = worldbase.Bio.Seq.Genome.HUMAN.hg18()
 229    >>> annodb = annotation.AnnotationDB(txInfo, hg18,
 230    ...                                  sliceAttrDict=
 231    ...                                  dict(id='chrom', start='txStart',
 232    ...                                       stop='txEnd'))
 233    ...
 234    >>> gene = annodb['uc009vjh.1']
 235    >>> print repr(gene.sequence), gene.sequence
 236    chr1[55424:59692] GTTATGAAGAAGGTAGGTGGAAACAAAGACAAAACACATATATTAGAAGAATGAATGAAATTGTAGCATTTTATTGACAATGAGATGGTTCTATTAGTAGGAATCTATTCTGCATAATTCCATTTTGTGTTTACCTTCTGGAAAAATGAAAGGATTCTGTATGGTTAACTTAAATACTTAGAGAAATTAATATGAATAATGTTAGCAAGAATAACCCTTGTTATAAGTATTATGCTGGCAACAATTGTCGAGTCCTCCTCCTCACTCTTCTGGGCTAATTTGTTCTTTTCTCCCCATTTAATAGTCCTTTTCCCCATCTTTCCCCAGGTCCGGTGTTTTCTTACCCACCTCCTTCCCTCCTTTTTATAATACCAGTGAAACTTGGTTTGGAGCATTTCTTTCACATAAAGGTACAaatcatactgctagagttgtgaggatttttacagcttttgaaagaataaactcattttaaaaacaggaaagctaaggcccagagatttttaaatgatattcccatgatcacactgtgaatttgtgccagaacccaaatgcctactcccatctcactgaGACTTACTATAAGGACATAAGGCatttatatatatatatattatatatactatatatttatatatattacatattatatatataatatatattatataatatatattatattatataatatataatataaatataatataaattatattatataatatataatataaatataatataaattatataaatataatatatattttattatataatataatatatattatataaatataatatataaattatataatataatatatattatataatataatatattttattatataaatatatattatattatataatatatattttattatataatatatattatatatttatagaatataatatatattttattatataatatatattatataatatatattatatttatatataacatatattattatataaaatatgtataatatatattatataaatatatttatatattatataaatatatatattatatataatTCTAATGGTTGAATTCCAAGAATAATCTATGGCATGAAAGATTTTACCTGTCAACAGTGGCTGGCTCTTCATGGTTGCTACAATGAGTGTGTAAGATTCTGAAGGACTCCTTTAATAAGCCTAAACTTAATGTTCAACTTAGAATAAATACAATTCTTCTAATTTTTTTTGAATAATTTTTAAAAAGTCAGAAATGAGCTTTGAAAGAATTATGGTGGTGAAGGATCCCCTCAGCAGCACAAATTCAGGAGAGAGATGTCTTAACTACGTTAGCAAGAAATTCCTTTTGCTAAAGAATAGCATTCCTGAATTCTTACTAACAGCCATGATAGAAAGTCTTTTGCTACAGATGAGAACCCTCGGGTCAACCTCATCCTTGGCATATTTCATGTGAAGATATAACTTCAAGATTGTCCTTGCCTATCAATGAAATGAATTAATTTTATGTCAATGCATATTTAAGGTCTATTCTAAATTGCACACTTTGATTCAAAAGAAACAGTCCAACCAACCAGTCAGGACAGAAATTATCTCACAATAAAAATCCTATCGTTTGTACTGTCAATGATTAGTATGATTATATTTATTACCGTGCTAAGCAGAAGAGAAATGAAGTGAATGTTCATGATTTATTCCACTATTAGACTTCTCTTTATTCTTAAAAATATTTAAGATCACTAAATTTTTATAGGACTTTAAAAACAGTAATGTGCTGCTTTGAGTGTGTAGGACTAAGAAATGGGATTCAGAGTAGTAAAGAGAAAAGTGGAATTTCCAAGCACTATGAATTACTGTTCTTTAAAAAACAGCAAAAATCAAATAACAGTATTCCTCCAAAAAAGATGGCAAGTGTAAACTCTATACCTTCATGTCTCCCGTGGAATGTTAGTGATCAATTTCCACTTCTCTCTTTTACATCTTACTTGCCCATTAACTCTTATACCTAATCCAAAGATTGTTAATATGGCTATGTCTCACTTTCAGGACACCTTTTATTTGTTACTTCTCTTCACTGCAAAACTTCTTGAAACAGTACTTATTTTCTCTCCTCCATACACAATTGAAATGGCTCTCAACTCATGCCCAGAAGTCAGTGTTCAGTCTCTCACCTGGCAGATAGCAACTTACAAAGATGCCCCAACAATACCTCCTTGTGTCTAGACAGTCATCATTATCCTTTACCTTTTTCTGTATTTATTTCTGCTCCTAAAAGGGATCTCTATGTAAAGTATTGTTATACTAGTGCTTGTTATAATTATTATCAGAGTTAAAGCCATCACAATGTTCCCAATTACTTAAAGACATTGGAATAACATTTTTTTTATTTTCCACATCTTGCCAAAAAATATTTTGTTATCAGTACCTTaataatggctattatatattgaccattactatttgctagaaaatttatatacctggtcgtatccaatcctcacagaacttctataaagttgtgctattatcacctatattttccagatgtggccgtaagactgaaatcacttaggtgacttgtctaaggtcattcagatacatagtagataacccaggatttgaacacaggcctcctagcacacaagctcatatcttaactactttaatacgttgctcGATGGGATCTTACAGGTCTTCATTCACCCCTTTCCTGCTCACACAACCACAACCTGCAGCTATTACCTATTGTTAGGCTTAAAATAATTACTTGGCTTCATTTCCAAGCTCCCTCCCTTCCAATTCACATTGAGTCCAGAGCTAAATTAAACAATCATTCAAAATTTTTCAGTAGTTCTTGTCTCTATAATAAAACAGAAATGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACactttctctcatttaaactttattgcattttcctcactttttctcacttcacttttgaattccctattcttttatcctctgttaatttttaagtattatatttgtgatattattttttctttttttctattttttatctttcatttcattttggcctatttttttctcttAAGAACTTTAATATCACCAAATAACATGTGTGCTACAAACTGTTTTGTAGTTCAAAGAAAAAGGAGATAAACATAGAGTTATGGCATAGACTTAATCTGGCAGAGAGACAAGCATAAATAATGGTATTTTATATTAGGAATAAACCTAACATTAATGGAGACACTGAGAAGCCGAGATAACTGAATTATAAGGCATAGCCAGGGAAGTAGTGCGAGATAGAATTATGATCTTGTTGAATTCTGAATGTCTTTAAGTAATAGATTATAGAAAGTCACTGTAAGAGTGAGCAGAATGATATAAAATGAGGCTTTGAATTTGAATATAATAATTCTGACTTCCTTCTCCTTCTCTTCTTCAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATGTACTTCCTGCTAGCCAACCTCTCACTCATTGATCTGTCTCTGTCTTCAGTCACAGCCCCCAAGATGATTACTGACTTTTTCAGCCAGCGCAAAGTCATCTCTTTCAAGGGCTGCCTTGTTCAGATATTTCTCCTTCACTTCTTTGGTGGGAGTGAGATGGTGATCCTCATAGCCATGGGCTTTGACAGATATATAGCAATATGCAAGCCCCTACACTACACTACAATTATGTGTGGCAACGCATGTGTCGGCATTATGGCTGTCACATGGGGAATTGGCTTTCTCCATTCGGTGAGCCAGTTGGCGTTTGCCGTGCACTTACTCTTCTGTGGTCCCAATGAGGTCGATAGTTTTTATTGTGACCTTCCTAGGGTAATCAAACTTGCCTGTACAGATACCTACAGGCTAGATATTATGGTCATTGCTAACAGTGGTGTGCTCACTGTGTGTTCTTTTGTTCTTCTAATCATCTCATACACTATCATCCTAATGACCATCCAGCATCGCCCTTTAGATAAGTCGTCCAAAGCTCTGTCCACTTTGACTGCTCACATTACAGTAGTTCTTTTGTTCTTTGGACCAT
 237
 238 Victory!  We are able to serve up gene annotations over the whole
 239 genome on our local machine, simply by plugging in to UCSC's database server!
 240