pygr/seqdb.py

   1 """
   2 seqdb contains a set of classes for interacting with sequence databases.
   3
   4 Primary sequence database classes:
   5
   6   - SequenceDB         - base class for sequence databases
   7   - SequenceFileDB     - file-based sequence database
   8   - PrefixUnionDict    - container to combine multiple sequence databases
   9   - XMLRPCSequenceDB   - XML-RPC-accessible sequence database
  10
  11 Extensions:
  12
  13   - SeqPrefixUnionDict - extends PrefixUnionDict to automatically add seqs
  14   - BlastDB            - implements NCBI-style name munging for lookups
  15
  16 Associated sequence classes:
  17
  18   - FileDBSequence     - sequence associated with a SequenceFileDB
  19   - XMLRPCSequence     - sequence associated with a XMLRPCSequenceDB
  20
  21 ----
  22
  23 SequenceDB provides some basic behaviors for sequence databases:
  24 dictionary behavior, an invert operator interface, and caching for
  25 both sequence objects and sequence intervals.  It also relies on a
  26 'seqInfoDict' attribute to contain summary information about
  27 sequences, so that e.g. slice operations can be done without loading
  28 the entire sequence into memory.  (See below for more info on
  29 seqInfoDict.)
  30
  31 SequenceFileDB extends SequenceDB to contain a file-based database of
  32 sequences.  It constructs a seqLenDict that allows direct on-disk lookup
  33 of sequence intervals.  (See below for more info on seqLenDict.)
  34
  35 PrefixUnionDict provides a unified SequenceDB-like interface to a
  36 collection of sequence databases by combining the database name with
  37 the sequence ID into a new sequence id.  For example, the ID
  38 'genome.chrI' would return the sequence 'chrI' in the 'genome'
  39 database.  This is particularly handy for situations where you want to
  40 have seqdbs of multiple sequence types (DNA, protein, annotations,
  41 etc.) all associated together.
  42
  43 @CTB document XMLRPCSequenceDB.
  44 @CTB document SeqPrefixUnionDict.
  45 @CTB document BlastDB.
  46
  47 ----
  48
  49 The seqInfoDict interface
  50 -------------------------
  51
  52 The seqInfoDict attribute of a SequenceDB is a dictionary-like object,
  53 keyed by sequence IDs, with associated values being an information
  54 object containing various attributes.  seqInfoDict is essentially an
  55 optimization that permits other pygr-aware components to access
  56 information *about* sequences without actually loading the entire
  57 sequence.
  58
  59 The only required attribute at the moment is 'length', which is
  60 required by some of the NLMSA code.  However, seqInfoDict is a good
  61 mechanism for the storage of any summary information on a sequence,
  62 and so it may be expanded in the future.
  63
  64 The seqLenDict interface
  65 ------------------------
  66
  67 The seqLenDict attribute is specific to a SequenceFileDB, where it
  68 provides a file-backed storage of length and offset sequence metadata.
  69 It is used to implement a key optimization in SequenceFileDB, in which
  70 a sequence's offset within a file is used to read only the required
  71 part of the sequence into memory.  This optimization is particularly
  72 important for large sequences, e.g. chromosomes, where reading the
  73 entire sequence into memory shouldn't be done unless it's necessary.
  74
  75 The seqLenDict is keyed by sequence ID and the associated values are a
  76 2-tuple (length, offset), where the offset indicates the byte offset
  77 within the '.pureseq' index file created for each SequenceFileDB.
  78
  79 get_bound_subclass and the 'self.db' attribute
  80 ----------------------------------------------
  81
  82 The SequenceDB constructor calls classutil.get_bound_subclass on its
  83 itemClass.  What does that do, and what is it for?
  84
  85 get_bound_subclass takes an existing class, makes a new subclass of
  86 it, binds the variable 'db' to it, and then calls the _init_subclass
  87 classmethod (if it exists) on the new class.  This has the effect of
  88 creating a new class for each SequenceDB instance, tied specifically
  89 to that instance and initialized by the _init_subclass method.
  90
  91 The main effect of this for SequenceDBs is that for any SequenceDB
  92 descendant, the '.db' attribute is automatically set for each Sequence
  93 retrieved from the database.
  94
  95 CTB: I think this is an optimization?
  96
  97 Caching
  98 -------
  99
 100 @CTB discuss caching.
 101
 102 Pickling sequence databases and sequences
 103 -----------------------------------------
 104
 105 @CTB document pickling issues.
 106 programmer notes:
 107
 108 extending SequenceDB
 109  - seqInfoDict, itemclass
 110
 111 extending SequenceFileDB
 112  - seqLenDict
 113  - using your own itemclass
 114  - using your own reader
 115
 116 doctests & examples
 117 -------------------
 118
 119 update docs for these classes!
 120
 121 intro:
 122  - loading a FASTA file
 123  - using a PUD
 124    + combining dbs, etc.
 125    + inverse
 126
 127 Code review issues, short term:
 128
 129  - @CTB get_bound_subclass stuff refers directly to itemClass to set 'db'.
 130  - @CTB fix 'import *'s
 131  - @CTB run lint/checker?
 132  - @CTB test _create_seqLenDict
 133  - @CTB XMLRPCSequenceDB vs SequenceFileDB
 134
 135 Some long term issues:
 136
 137  - it should be possible to remove _SeqLenDictSaver and just combine its
 138    functionality with _store_seqlen_dict.  --titus 3/21/09
 139
 140 """
 141
 142 from __future__ import generators
 143 import sys
 144 import os
 145 import UserDict
 146 import weakref
 147
 148 from sequence import *                  # @CTB
 149 from sqlgraph import *                  # @CTB
 150 import classutil
 151 from annotation import AnnotationDB, AnnotationSeq, AnnotationSlice, \
 152      AnnotationServer, AnnotationClient
 153 import logger
 154 import seqfmt
 155
 156 from dbfile import NoSuchFileError
 157
 158 ####
 159 #
 160 # SequenceDB and associated support classes.
 161 #
 162
 163 class _SequenceDBInverse(object):
 164     """Implements __inverse__ on SequenceDB objects, returning seq name."""
 165     def __init__(self, db):
 166         self.db = db
 167
 168     def __getitem__(self, seq):
 169         return seq.pathForward.id
 170
 171     def __contains__(self, seq):
 172         try:
 173             return seq.pathForward.db is self.db
 174         except AttributeError:
 175             return False
 176
 177 class SequenceDB(object, UserDict.DictMixin):
 178     """Base class for sequence databases.
 179
 180     SequenceDB provides a few basic (base) behaviors:
 181       - dict-like interface to sequence objects each with an ID
 182       - the ~ (invert) operator returns an 'inverted' database, which is a
 183         dict-like object that returns sequence names when given a sequence.
 184       - weakref-based automatic flushing of seq objects no longer in use;
 185         use autoGC=0 to turn this off.
 186       - cacheHint() system for caching a given set of sequence
 187         intervals associated with an owner object, which are flushed
 188         from cache if the owner object is garbage-collected.
 189
 190     For subclassing, note that self.seqInfoDict must be set before
 191     SequenceDB.__init__ is called!
 192
 193     """
 194     # class to use for database-linked sequences; no default.
 195     itemClass = None
 196     # class to use for sequence slices; see sequence.SeqPath.classySlice.
 197     itemSliceClass = SeqDBSlice
 198
 199     # pickling methods & what attributes to pickle.
 200     __getstate__ = classutil.standard_getstate
 201     __setstate__ = classutil.standard_setstate
 202     _pickleAttrs = dict(autoGC=0)
 203
 204     # define ~ (invert) operator to return a lazily-created _SequenceDBInverse
 205     __invert__ = classutil.lazy_create_invert(_SequenceDBInverse)
 206
 207     def __init__(self, autoGC=True, dbname=None, **kwargs):
 208         """Initialize seq db from filepath or ifile."""
 209         if autoGC: # automatically garbage collect unused objects
 210             self._weakValueDict = classutil.RecentValueDictionary(autoGC)
 211         else:
 212             self._weakValueDict = {}    # object cache @CTB not tested
 213         self.autoGC = autoGC
 214
 215         # override itemClass and itemSliceClass if specified
 216         self.itemClass = kwargs.get('itemClass', self.itemClass)
 217         self.itemSliceClass = kwargs.get('itemSliceClass', self.itemSliceClass)
 218
 219         if self.itemClass is None:
 220             raise TypeError, "must supply itemClass to SequenceDB"
 221
 222         # get a copy we can modify w/o side effects and bind itemClass.
 223         kwargs = kwargs.copy()
 224         kwargs['db'] = self
 225         classutil.get_bound_subclass(self, 'itemClass', dbname,
 226                                      subclassArgs=kwargs)
 227
 228         # guess the sequence type
 229         self._set_seqtype()
 230
 231     def __hash__(self):
 232         """Define a hash function to allow this object to be used as a key."""
 233         return id(self)
 234
 235     def _set_seqtype(self):
 236         """Guess the seqtype from 100 chars of 1st seq if not already known."""
 237         seqtype = getattr(self, '_seqtype', None)
 238         if seqtype is not None:
 239             return
 240
 241         for seqID in self: # get an iterator
 242             seq = self[seqID] # get the 1st sequence
 243             ch100 = str(seq[:100])
 244             self._seqtype = guess_seqtype(ch100)
 245             break # only process the 1st sequence!!!
 246
 247     _cache_max=10000                    # @CTB move? make settable?
 248     def cacheHint(self, ivalDict, owner):
 249         """Save a cache hint dict: {id: (start, stop)}.
 250
 251         @CTB document!
 252         """
 253         d={}
 254         # @CTB refactor, test
 255         for id, ival in ivalDict.items(): # BUILD THE CACHE DICTIONARY FOR owner
 256             if ival[0] < 0: # FORCE IVAL INTO POSITIVE ORIENTATION
 257                 ival=(-ival[1], -ival[0])        # @CTB untested
 258             if ival[1]-ival[0] > self._cache_max: # TRUNCATE EXCESSIVE LENGTH
 259                 ival=(ival[0], ival[0] + self._cache_max) # @CTB untested
 260             d[id]=[ival[0], ival[1]]
 261         try:
 262             self._cache[owner] = d # ADD TO EXISTING CACHE
 263         except AttributeError:
 264             self._cache = weakref.WeakKeyDictionary()  # AUTOMATICALLY REMOVE
 265             self._cache[owner] = d # FROM CACHE IF owner GOES OUT OF SCOPE
 266     def strsliceCache(self, seq, start, stop):
 267         """Get strslice using cache hints, if any available."""
 268         try:
 269             cacheDict=self._cache
 270         except AttributeError:
 271             raise IndexError('no cache present')
 272         for owner, d in cacheDict.items():
 273             try:
 274                 ival = d[seq.id]
 275             except KeyError:
 276                 continue # NOT IN THIS CACHE, SO SKIP  @CTB untested
 277             ival_start, ival_stop = ival[:2]
 278             if start >= ival_start and stop <= ival_stop: # CONTAINED IN ival
 279                 try:
 280                     s = ival[2] # get seq string from our cache
 281                 except IndexError: # use strslice() to retrieve from storage
 282                     s = seq.strslice(ival_start, ival_stop, useCache=False)
 283                     ival.append(s)
 284                     try: # does owner want to reference this cached seq?
 285                         save_f = owner.cache_reference
 286                     except AttributeError:
 287                         pass # no, so nothing to do
 288                     else: # let owner control caching in our _weakValueDict
 289                         save_f(seq)     # # @CTB untested
 290                 return s[start - ival_start:stop - ival_start]
 291         raise IndexError('interval not found in cache') # @CTB untested
 292
 293     # these methods should all be implemented on all SequenceDBs.
 294     def close(self):
 295         pass # subclass should implement closing of its open resources!
 296
 297     def __iter__(self):
 298         return iter(self.seqInfoDict)
 299
 300     def iteritems(self):
 301         for seqID in self:
 302             yield seqID, self[seqID]
 303
 304     def __len__(self):
 305         return len(self.seqInfoDict)
 306
 307     def __getitem__(self, seqID):
 308         """Retrieve sequence by id, using cache if available."""
 309         try: # for speed, default case (cache hit) should return immediately
 310             return self._weakValueDict[seqID]
 311         except KeyError: # not in cache?  try loading.
 312             try:
 313                 s = self.itemClass(self, seqID)
 314             except KeyError:
 315                 raise KeyError("no key '%s' in database %s"
 316                                % (seqID, repr(self)))
 317             self._weakValueDict[seqID] = s # save in cache.
 318             return s
 319
 320     def keys(self):
 321         return self.seqInfoDict.keys()
 322
 323     def __contains__(self, key):
 324         return key in self.seqInfoDict
 325
 326     def __repr__(self):
 327         return "<%s '%s'>" % (self.__class__.__name__,
 328                               self.itemClass.__class__.__name__)
 329
 330     def clear_cache(self):
 331         """Empty the cache."""
 332         self._weakValueDict.clear()
 333
 334     # these methods should not be implemented for read-only database.
 335     clear = setdefault = pop = popitem = copy = update = \
 336             classutil.read_only_error
 337
 338 ####
 339 #
 340 # FileDBSequence and SequenceFileDB, and associated support classes.
 341 #
 342
 343 class _FileDBSeqDescriptor(object):
 344     """Descriptor to retrieve entire sequence from seqLenDict."""
 345     def __get__(self, obj, objtype):
 346         length = obj.db.seqLenDict[obj.id][0]
 347         return obj.strslice(0, length)
 348
 349 class FileDBSequence(SequenceBase):
 350     """Default sequence object for file-based storage mechanism.
 351
 352     See SequenceFileDB for the associated database class.
 353
 354     In general, you should not create objects from this class directly;
 355     retrieve them from SequenceFileDB objects, instead.
 356
 357     NOTE: 'self.db' is attached to all instances of this class that come
 358     from a particular database by 'classutil.get_bound_subclass'.
 359
 360     """
 361     seq = _FileDBSeqDescriptor()        # dynamically retrieve 'seq'.
 362     __reduce__ = classutil.item_reducer # for pickling purposes.
 363
 364     def __init__(self, db, id):
 365         self.id = id
 366         SequenceBase.__init__(self)
 367         if self.id not in self.db.seqLenDict:
 368             raise KeyError('sequence %s not in db %s' % (self.id, self.db))
 369
 370     def __len__(self):
 371         """Unpack this sequence's length from the seqLenDict."""
 372         return self.db.seqLenDict[self.id][0]
 373
 374     def strslice(self, start, end, useCache=True):
 375         """Access slice of a sequence efficiently, using seqLenDict info."""
 376         if useCache:                    # If it's in the cache, use that!
 377             try:
 378                 return self.db.strsliceCache(self, start, end)
 379             except IndexError:
 380                 pass
 381
 382         return self.db.strslice(self.id, start, end)
 383
 384 class SequenceFileDB(SequenceDB):
 385     """Main class for file-based storage of a sequence database.
 386
 387     By default, SequenceFileDB uses a seqLenDict, a.k.a. a shelve
 388     index of sequence lengths and offsets, to retrieve sequence slices
 389     with fseek.  Thus entire chromosomes (for example) do not have to
 390     be loaded to retrieve a subslice.
 391
 392     Takes one required argument, 'filepath', which should be the name
 393     of a FASTA file (or a file whose format is understood by your
 394     custom reader; see 'reader' kw arg, and the _store_seqlen_dict
 395     function).
 396
 397     The SequenceFileDB seqInfoDict interface is a wrapper around the
 398     seqLenDict created by the __init__ function.
 399
 400     """
 401     itemClass = FileDBSequence
 402
 403     # copy _pickleAttrs and add 'filepath'
 404     _pickleAttrs = SequenceDB._pickleAttrs.copy()
 405     _pickleAttrs['filepath'] = 0
 406
 407     def __init__(self, filepath, reader=None, **kwargs):
 408         # make filepath a pickleable attribute.
 409         self.filepath = classutil.SourceFileName(str(filepath))
 410
 411         fullpath = self.filepath + '.seqlen'
 412         # build the seqLenDict if it doesn't already exist
 413         try:
 414             seqLenDict = classutil.open_shelve(fullpath, 'r')
 415         except NoSuchFileError:
 416             seqLenDict = self._create_seqLenDict(fullpath, filepath, reader)
 417
 418         self.seqLenDict = seqLenDict
 419         self.seqInfoDict = _SeqLenDictWrapper(self) # standard interface
 420
 421         # initialize base class.
 422         dbname = os.path.basename(filepath)
 423         SequenceDB.__init__(self, filepath=filepath, dbname=dbname, **kwargs)
 424
 425     def close(self):
 426         '''close our open shelve index file and _pureseq...'''
 427         self.seqLenDict.close()
 428         try:
 429             do_close = self._pureseq.close
 430         except AttributeError:
 431             pass # _pureseq not open yet, so nothing to do
 432         else:
 433             do_close()
 434
 435     def __repr__(self):
 436         return "<%s '%s'>" % (self.__class__.__name__, self.filepath)
 437
 438     def _create_seqLenDict(self, dictpath, seqpath, reader=None):
 439         """Create a seqLenDict from 'seqpath' and store in 'dictpath'."""
 440         seqLenDict = classutil.open_shelve(dictpath, 'n')
 441         try:
 442             logger.debug('Building sequence length index...')
 443             _store_seqlen_dict(seqLenDict, seqpath, reader)
 444         finally:
 445             seqLenDict.close() # close after writing, no matter what!
 446         return classutil.open_shelve(dictpath, 'r') # re-open read-only
 447
 448     def strslice(self, seqID, start, end, useCache=True):
 449         """Access slice of a sequence efficiently, using seqLenDict info."""
 450         # Retrieve sequence from the .pureseq file based on seqLenDict
 451         # information.
 452         try:
 453             ifile=self._pureseq
 454         except AttributeError:
 455             fullpath = self.filepath + '.pureseq'
 456             ifile = file(fullpath, 'rb')
 457             self._pureseq = ifile
 458
 459         # Now, read in the actual slice.
 460         offset = self.seqLenDict[seqID][1]
 461         ifile.seek(offset + start)
 462         return ifile.read(end - start)
 463
 464 # Some support classes for the SeqLenDict mechanism.
 465
 466 class BasicSeqInfo(object):
 467     """Wrapper to provide the correct seqInfoDict-style object information.
 468
 469     This boils down to providing id, db, length, and possibly offset.
 470
 471     """
 472     def __init__(self, seqID, seqDB, length=None):
 473         self.id = seqID
 474         self.db = seqDB
 475         if length is None:
 476             self.length = len(seqDB[seqID]) # generic but possibly slow
 477         else:
 478             self.length = length
 479
 480 class _SeqLenObject(BasicSeqInfo):
 481     """Wrapper for use with a seqLenDict """
 482     def __init__(self, seqID, seqDB):
 483         length, self.offset = seqDB.seqLenDict[seqID]
 484         BasicSeqInfo.__init__(self, seqID, seqDB, length)
 485
 486 class BasicSeqInfoDict(object, UserDict.DictMixin):
 487     """Wrapper around SequenceDB.seqLenDict to provide seqInfoDict behavior.
 488     This basic version just gets the length from the sequence object itself.
 489     """
 490     itemClass = BasicSeqInfo
 491     def __init__(self, db):
 492         self.seqDB = db
 493
 494     def __getitem__(self, k):
 495         return self.itemClass(k, self.seqDB)
 496
 497     def __len__(self):
 498         return len(self.seqDB.seqLenDict)
 499
 500     def __iter__(self):
 501         return iter(self.seqDB.seqLenDict)
 502
 503     def keys(self):
 504         return self.seqDB.seqLenDict.keys()
 505
 506 class _SeqLenDictWrapper(BasicSeqInfoDict):
 507     """
 508     The default storage mechanism for sequences implemented by FileDBSequence
 509     and SequenceFileDB puts everything in seqLenDict, a shelve index of
 510     lengths and offsets.  This class wraps that dictionary to provide the
 511     interface that SequenceDB expects to see.
 512
 513  """
 514     itemClass = _SeqLenObject
 515
 516 class _SeqLenDictSaver(object):
 517     """Support for generic reading functions, called by _store_seqlen_dict.
 518
 519     This allows you to specify your own 'reader' function when
 520     constructing a FileSequenceDB, e.g. so that you could read
 521     something other than FASTA files into a seqLenDict.  Pass in this
 522     function as the 'reader' kwarg to FileSequenceDB.
 523
 524     Custom reader functions should take a file handle and a filename,
 525     and return a list of sequence info objects with 'id', 'length',
 526     and 'sequence' attributes for each sequence in the given
 527     file/filename.  _SeqLenDictSaver will then construct a '.pureseq'
 528     file containing the concatenated sequences and fill in the
 529     seqLenDict appropriately.
 530
 531     """
 532     def __init__(self, reader):
 533         self.reader = reader
 534
 535     def __call__(self, d, ifile, filename):
 536         offset = 0L
 537         pureseq_fp = file(filename + '.pureseq', 'wb')
 538         try:
 539             for o in self.reader(ifile, filename):
 540                 # store the length & offset in the seqLenDict
 541                 d[o.id] = o.length, offset
 542                 offset += o.length
 543                 if o.length != len(o.sequence):
 544                     raise ValueError('length does not match sequence: %s,%d'
 545                                      % (o.id, o.length))
 546                 pureseq_fp.write(o.sequence)
 547         finally:
 548             pureseq_fp.close()
 549
 550 def _store_seqlen_dict(d, filename, reader=None, mode='rU'):
 551     """Store sequence lengths in a dictionary, e.g. a seqLenDict.
 552
 553     Used by SequenceFileDB._create_seqLenDict.
 554
 555     The 'reader' function implements a custom sequence format reader;
 556     by default, _store_seqlen_dict uses seqfmt.read_fasta_lengths,
 557     which reads FASTA-format files.  See _SeqLenDictSaver for
 558     information on building a custom 'reader', and see the seqdb docs
 559     for an example.
 560
 561     """
 562     # if a custom reader function was passed in, run that.
 563     builder = seqfmt.read_fasta_lengths
 564     if reader is not None:
 565         builder = _SeqLenDictSaver(reader)
 566
 567     ifile = file(filename, mode)
 568     try:
 569         builder(d, ifile, filename) # run the builder on our sequence set
 570     finally:
 571         ifile.close()
 572
 573 ####
 574 #
 575 # class PrefixUnionDict and associated support classes.
 576 #
 577
 578 class _PrefixUnionDictInverse(object):
 579     """Provide inverse (~) operator behavior for PrefixUnionDicts.
 580
 581     This enables ~pud to return a database that, given a sequence
 582     object, returns the corresponding key (prefix.id) to retrieve that
 583     sequence object in the pud.
 584
 585     """
 586
 587     def __init__(self, db):
 588         self.db = db
 589
 590     def __getitem__(self, ival):
 591         seq = ival.pathForward # get the top-level sequence object
 592         try: # for speed, normal case should execute immediately
 593             prefix = self.db.dicts[seq.db]
 594         except KeyError:
 595             try:
 596                 # @CTB abstraction boundary violation! keep? how test?
 597                 if seq.pathForward._anno_seq.db in self.db.dicts:
 598                     raise KeyError('''\
 599 this annotation is not in the PrefixUnion, but its sequence is.
 600 You can get that using its \'sequence\' attribute.''')
 601             except AttributeError:
 602                 pass
 603             raise KeyError('seq.db not in PrefixUnionDict')
 604
 605         return prefix + self.db.separator + str(seq.id)
 606
 607     def __contains__(self, seq):
 608         try:
 609             return seq.pathForward.db in self.db.dicts
 610         except AttributeError:
 611             return False
 612
 613
 614 class _PrefixUnionMemberDict(object, UserDict.DictMixin):
 615     """
 616     @CTB confusing/inappropriate use of a dict interface! keep??
 617     @CTB document.
 618     'd[prefix]=value; d[k] returns value if k is a member of prefix'
 619     """
 620     def __init__(self, puDict, default=None,
 621                  attrMethod=lambda x:x.pathForward.db):
 622         self.values = {}
 623         self.puDict = puDict
 624         self._attrMethod = attrMethod
 625         if default is not None:
 626             self.default = default      # @CTB can we use setdefault for this?
 627
 628     def keys(self):
 629         return self.puDict.prefixDict.keys()
 630     possibleKeys = keys                 # legacy interface (?)
 631
 632     def __setitem__(self, k, v):
 633         if k not in self.puDict.prefixDict:
 634             raise KeyError('key %s is not a valid union prefix string!' % k)
 635         new_k = self.puDict.prefixDict[k]
 636         self.values[new_k] = v
 637
 638     def __getitem__(self, k):
 639         try:
 640             db = self._attrMethod(k)
 641         except AttributeError:
 642             raise TypeError('wrong key type? _attrMethod() failed.')
 643
 644         if db not in self.values:
 645             try:
 646                 return self.default
 647             except AttributeError:      # no default value - raise KeyError.
 648                 raise KeyError('key not a member of this union!')
 649
 650         return self.values[db]
 651
 652 class PrefixUnionDict(object, UserDict.DictMixin):
 653     """Interface to a set of sequence DBs, each assigned a unique prefix.
 654
 655     For example, the sequence ID 'foo.bar' would unpack to ID 'bar' in
 656     the dictionary associated with the prefix 'foo'.  This is a useful
 657     way to combine disparate seqdbs into a single db, without actually
 658     altering the individual seqdbs.
 659
 660     PrefixUnionDicts can be created in one of two ways: either
 661       - pass in a dictionary containing prefix-to-seqdb mappings as
 662         'prefixDict', or
 663       - pass in a header file containing the information necessary to create
 664         such a dictionary.
 665
 666     In the latter case, see the 'writeHeaderFile' method for format
 667     information.  The optional kwarg 'trypath' contains a list of
 668     directories to search for the database file named in each line.
 669     The optional kwarg 'dbClass' specifies the database class to use
 670     to load each sequence file; it defaults to SequenceFileDB.
 671
 672     The default ID separator is '.'; use the 'separator' kwarg to
 673     change it.
 674
 675     @CTB trypath => trypaths?
 676
 677     """
 678     # define ~ (invert) operator to return a lazily-created _PUDInverse.
 679     __invert__ = classutil.lazy_create_invert(_PrefixUnionDictInverse)
 680
 681     def __init__(self, prefixDict=None, separator='.', filename=None,
 682                  dbClass=SequenceFileDB, trypath=None):
 683         # read union header file
 684         if filename is not None:
 685             if prefixDict:
 686                 raise TypeError('''
 687 cannot create with prefixDict and filename both!''')
 688
 689             if trypath is None:
 690                 trypath = [os.path.dirname(filename)]
 691             ifile = file(filename, 'rU')
 692             try:
 693                 it = iter(ifile)
 694                 separator = it.next().strip('\r\n') # remove leading/trailing CR
 695                 prefixDict = {}
 696                 for line in it:
 697                     prefix, filepath=line.strip().split('\t')[:2]
 698                     try:
 699                         dbfile = classutil.search_dirs_for_file(filepath, trypath)
 700                         db = dbClass(dbfile)
 701                     except IOError:
 702                         for db in prefixDict.values():
 703                             db.close() # close databases before exiting
 704                         raise IOError('''\
 705     unable to open database %s: check path or privileges.
 706     Set 'trypath' to give a list of directories to search.''' % filepath)
 707                     else:
 708                         prefixDict[prefix] = db
 709             finally:
 710                 ifile.close()
 711
 712         self.separator = separator
 713         if prefixDict is not None:
 714             self.prefixDict = prefixDict
 715         else:
 716             self.prefixDict = {}
 717
 718         # also create a reverse mapping
 719         d = {}
 720         for k, v in self.prefixDict.items():
 721             d[v] = k
 722
 723         self.dicts = d
 724         self.seqInfoDict = _PUDSeqInfoDict(self) # supply standard interface
 725
 726     def format_id(self, prefix, seqID):
 727         return prefix + self.separator + seqID
 728
 729     def get_prefix_id(self, k):
 730         """Subdivide a key into a prefix and ID using the given separator."""
 731         try:
 732             t = k.split(self.separator, 2) # split into no more than 3 fields
 733         except AttributeError:
 734             raise KeyError('key should be a string! ' + repr(k))
 735         l = len(t)
 736         if l == 2:
 737             return t
 738         elif l<2:
 739             raise KeyError('invalid id format; no prefix: ' + k)
 740         else: # id contains separator character?
 741             prefix = t[0] # assume prefix doesn't contain separator @CTB untested
 742             seqID = k[len(prefix) + 1:] # skip past prefix
 743             return prefix, seqID
 744
 745     def get_subitem(self, d, seqID):
 746         # try int key first
 747         try:
 748             return d[int(seqID)]
 749         except (ValueError, KeyError, TypeError):
 750             pass
 751
 752         # otherwise, use default (str) key
 753         try:
 754             return d[seqID]
 755         except KeyError:
 756             raise KeyError, "no key '%s' in %s" % (seqID, repr(d))
 757
 758     def __getitem__(self, k):
 759         """For 'foo.bar', return 'bar' in dict associated with prefix 'foo'"""
 760         prefix, seqID = self.get_prefix_id(k)
 761         try:
 762             d = self.prefixDict[prefix]
 763         except KeyError, e:
 764             raise KeyError("no key '%s' in %s" % (k, repr(self)))
 765         return self.get_subitem(d, seqID)
 766
 767     def __contains__(self, k):
 768         """Is the given ID in our PrefixUnionDict?"""
 769         # try it out as an ID.
 770         if isinstance(k, str):
 771             try:
 772                 (prefix, id) = self.get_prefix_id(k)
 773                 return id in self.prefixDict[prefix]
 774             except KeyError:
 775                 return False
 776
 777         # otherwise, try treating key as a sequence.
 778         # @CTB inconsistent with 'getitem'.
 779         try:
 780             db = k.pathForward.db
 781         except AttributeError:
 782             raise AttributeError('key must be a sequence with db attribute!')
 783         return db in self.dicts
 784
 785     def has_key(self, k):
 786         return self.__contains__(k)
 787
 788     def __iter__(self):
 789         for p, d in self.prefixDict.items():
 790             for id in d:
 791                 yield self.format_id(p, id)
 792
 793     def keys(self):
 794         return list(self.iterkeys())
 795
 796     def iterkeys(self):
 797         return iter(self)
 798
 799     def iteritems(self):
 800         for p,d in self.prefixDict.items():
 801             for id, seq in d.iteritems():
 802                 yield self.format_id(p, id), seq
 803
 804     def getName(self, ival):
 805         """For a given sequence, return a fully qualified name, 'prefix.id'."""
 806         seq = ival.pathForward # get the top-level sequence object
 807         return self.dicts[seq.db] + self.separator + seq.id
 808
 809     def newMemberDict(self, **kwargs):  # @CTB not used; necessary?
 810         """return a new member dictionary (empty)"""
 811         return _PrefixUnionMemberDict(self, **kwargs)
 812
 813     def writeHeaderFile(self,filename):  # @CTB not used; necessary?
 814         """Save a header file, suitable for later re-creation."""
 815         ifile = file(filename, 'w')
 816         print >>ifile, self.separator
 817         for k, v in self.prefixDict.items():
 818             try:
 819                 print >>ifile, '%s\t%s' % (k, v.filepath)
 820             except AttributeError:
 821                 raise AttributeError('''\
 822 seq db '%s' has no filepath; you may be able to save this to worldbase,
 823 but not to a text HeaderFile!''' % k)
 824         ifile.close()
 825
 826     def __len__(self):
 827         n=0
 828         for db in self.dicts:
 829             n += len(db)
 830         return n
 831
 832     def cacheHint(self, ivalDict, owner=None):  # @CTB untested
 833         '''save a cache hint dict of {id:(start,stop)}'''
 834         d={}
 835         # extract separate cache hint dict for each prefix
 836         for longID,ival in ivalDict.items():
 837             prefix, seqID = self.get_prefix_id(longID)
 838             d.setdefault(prefix, {})[seqID] = ival
 839         for prefix,seqDict in d.items():
 840             try:
 841                 m = self.prefixDict[prefix].cacheHint
 842             except AttributeError: # subdict can't cacheHint(), so just ignore
 843                 pass
 844             else:
 845                 # pass cache hint down to subdictionary
 846                 m(seqDict, owner)
 847
 848     # not clear what this should do for PrefixUnionDict
 849     copy = setdefault = update = classutil.method_not_implemented
 850
 851     # these methods should not be implemented for read-only database.
 852     clear = pop = popitem = classutil.read_only_error
 853
 854
 855 class _PrefixDictInverseAdder(_PrefixUnionDictInverse):
 856     """Inverse class for SeqPrefixUnionDict; adds sequences when looked up.
 857
 858     @CTB is getName only used by __getitem__?  Make private?
 859     """
 860
 861     def getName(self, seq):
 862         """Find in or add the given sequence to the inverse of a PUD."""
 863         try:
 864             return _PrefixUnionDictInverse.__getitem__(self, seq)
 865         except AttributeError: # no seq.db?  treat as a user sequence.
 866             new_id = 'user' + self.db.separator + seq.pathForward.id
 867             # check to make sure it's already in the user seq db...
 868             _ = self.db[new_id]
 869             return new_id
 870
 871     def __getitem__(self, seq):
 872         """__getitem__ interface that calls getName."""
 873         try:
 874             return self.getName(seq)
 875         except KeyError:
 876             if not self.db.addAll:
 877                 raise
 878
 879             # if we should add, add seq & re-try.
 880             self.db += seq
 881             return self.getName(seq)
 882
 883
 884 class SeqPrefixUnionDict(PrefixUnionDict):
 885     """SeqPrefixUnionDict provides += functionality to add seqs to a PUD.
 886
 887     See the __iadd__ method for details.
 888
 889     If addAll is True, then looking a sequence up in the inverse db will
 890     automatically add it to the PrefixUnionDict.
 891     """
 892
 893     __invert__ = classutil.lazy_create_invert(_PrefixDictInverseAdder)
 894
 895     def __init__(self, addAll=False, **kwargs):
 896         PrefixUnionDict.__init__(self, **kwargs)
 897
 898         # override default PrefixUnionDict __invert__ to add sequences;
 899         # see classutil.lazy_create_invert.
 900         self.addAll = addAll  # see _PrefixDictInverseAdder behavior.
 901
 902     def __iadd__(self, k):
 903         """Add a sequence or database to the PUD, with a unique prefix.
 904
 905         NOTE: __iadd__ must return self.
 906
 907         """
 908         # seq or db already present?
 909         if k in (~self):
 910             return self
 911
 912         db = getattr(k, 'db', None)
 913         if db is None:                  # annotation sequence?
 914             db = getattr(k.pathForward, 'db', None) # @CTB untested
 915
 916         if db is None:  # this is a user sequence, with no container; create.
 917             if not self.prefixDict.has_key('user'):
 918                 d = KeepUniqueDict()
 919                 self._add_prefix_dict('user', d)
 920             else:
 921                 d = self.prefixDict['user']
 922
 923             # now add the sequence
 924             d[k.pathForward.id] = k.pathForward
 925             return self
 926
 927         # already contain?  nothing to do.
 928         if db in self.dicts:            # @CTB can this 'if' ever be true?
 929             return self
 930
 931         # ok, not present; add, with a unique name.  does it have
 932         # _persistent_id?
 933         try:
 934             name = db._persistent_id.split('.')[-1]
 935         except AttributeError:          # no; retrieve from filepath?
 936             name = getattr(db, 'filepath', None)
 937             if name:                    # got one; clean up.
 938                 name = os.path.basename(name)
 939                 name = name.split('.')[0]
 940             else:                       # generate one.
 941                 name = 'noname%d' % len(self.dicts)
 942
 943             if name in self.prefixDict:
 944                 logger.debug('''
 945 It appears that two different sequence databases are being assigned
 946 the same prefix ("%s").  For this reason, the attempted automatic
 947 construction of a PrefixUnionDict for you cannot be completed!  You
 948 should instead construct a PrefixUnionDict that assigns a unique
 949 prefix to each sequence database, and supply it directly as the
 950 seqDict argument to the NLMSA constructor.''' % id)
 951                 raise ValueError('''\
 952 cannot automatically construct PrefixUnionDict''')
 953
 954         self._add_prefix_dict(name, db)
 955
 956         return self
 957
 958     def _add_prefix_dict(self, name, d):
 959         self.prefixDict[name] = d
 960         self.dicts[d] = name
 961
 962 class _PUDSeqInfoDict(object, UserDict.DictMixin):
 963     """A wrapper object supplying a standard seqInfoDict interface for PUDs.
 964
 965     This class simply provides a standard dict interface that rewrites
 966     individual sequence IDs into the compound PrefixUnionDict seq IDs
 967     on the fly.
 968
 969     """
 970     def __init__(self, db):
 971         self.seqDB = db
 972
 973     def __iter__(self):
 974         return iter(self.seqDB)
 975
 976     def keys(self):
 977         return list(self.iterkeys())
 978
 979     def iterkeys(self):
 980         for (k, v) in self.iteritems():
 981             yield k
 982
 983     def itervalues(self):
 984         for (k, v) in self.iteritems():
 985             yield v
 986
 987     def iteritems(self):
 988         for p, d in self.seqDB.prefixDict.items():
 989             for seqID, info in d.seqInfoDict.iteritems():
 990                 yield self.seqDB.format_id(p, seqID), info
 991
 992     def __getitem__(self, k):
 993         prefix, seqID = self.seqDB.get_prefix_id(k)
 994         db = self.seqDB.prefixDict[prefix]
 995         return self.seqDB.get_subitem(db.seqInfoDict, seqID)
 996
 997     def has_key(self, k):
 998         return k in self.seqDB
 999
1000 #
1001 # @CTB stopped review here. ###################################################
1002 #
1003
1004 class BlastDB(SequenceFileDB):          # @CTB untested?
1005     '''Deprecated interface provided for backwards compatibility.
1006     Provides blast() and megablast() methods for searching your seq db.
1007     Instead of this, you should use the blast.BlastMapping, which provides
1008     a graph interface to BLAST, or MegablastMapping for megablast.'''
1009     def __reduce__(self): # provided only for compatibility w/ 0.7 clients
1010         return (classutil.ClassicUnpickler, (self.__class__,self.__getstate__()))
1011     def __init__(self, filepath=None, blastReady=False, blastIndexPath=None,
1012                  blastIndexDirs=None, **kwargs):
1013         "format database and build indexes if needed. Provide filepath or file object"
1014         SequenceFileDB.__init__(self, filepath, **kwargs)
1015     def __repr__(self):
1016         return "<BlastDB '%s'>" % (self.filepath)
1017     def blast(self, seq, al=None, blastpath='blastall',
1018               blastprog=None, expmax=0.001, maxseq=None, verbose=True,
1019               opts='', **kwargs):
1020         'run blast with the specified parameters, return NLMSA alignment'
1021         blastmap = self.formatdb()
1022         return blastmap(seq, al, blastpath, blastprog, expmax, maxseq,
1023                         verbose, opts, **kwargs)
1024     def megablast(self,seq, al=None, blastpath='megablast', expmax=1e-20,
1025                   maxseq=None, minIdentity=None, maskOpts='-U T -F m',
1026                   rmPath='RepeatMasker', rmOpts='-xsmall',
1027                   verbose=True, opts='', **kwargs):
1028         'run megablast with the specified parameters, return NLMSA alignment'
1029         from blast import MegablastMapping
1030         blastmap = self.formatdb(attr='megablastMap', mapClass=MegablastMapping)
1031         return blastmap(seq, al, blastpath, expmax, maxseq, minIdentity,
1032                         maskOpts, rmPath, rmOpts, verbose, opts, **kwargs)
1033     def formatdb(self, filepath=None, attr='blastMap', mapClass=None):
1034         'create a blast mapping object if needed, and ensure it is indexed'
1035         try: # see if mapping object already exists
1036             blastmap = getattr(self, attr)
1037         except AttributeError:
1038             if mapClass is None: # default: BlastMapping
1039                 from blast import BlastMapping
1040                 mapClass = BlastMapping
1041             blastmap = mapClass(self)
1042             setattr(self, attr, blastmap) # re-use this in the future
1043         blastmap.formatdb(filepath) # create index file if not already present
1044         return blastmap
1045
1046
1047 class BlastDBXMLRPC(BlastDB):
1048     'XMLRPC server wrapper around a standard BlastDB'
1049     xmlrpc_methods = dict(getSeqLen=0, get_strslice=0, getSeqLenDict=0,
1050                           get_db_size=0, get_seqtype=0, strslice='get_strslice')
1051     def getSeqLen(self,id):
1052         'get sequence length, or -1 if not found'
1053         try:
1054             return len(self[id])
1055         except KeyError:
1056             return -1  # SEQUENCE OBJECT DOES NOT EXIST
1057     def getSeqLenDict(self):
1058         'return seqLenDict over XMLRPC'
1059         d = {}
1060         for k,v in self.seqLenDict.items():
1061             d[k] = v[0],str(v[1]) # CONVERT TO STR TO ALLOW OFFSET>2GB
1062         return d # XML-RPC CANNOT HANDLE INT > 2 GB, SO FORCED TO CONVERT...
1063     def get_db_size(self):
1064         return len(self)
1065     def get_strslice(self,id,start,stop):
1066         'return string sequence for specified interval in the specified sequence'
1067         if start<0: # HANDLE NEGATIVE ORIENTATION
1068             return str((-(self[id]))[-stop:-start])
1069         else: # POSITIVE ORIENTATION
1070             return str(self[id][start:stop])
1071     def get_seqtype(self):
1072         return self._seqtype
1073
1074
1075
1076
1077 class XMLRPCSequence(SequenceBase):
1078     "Represents a sequence in a blast database, accessed via XMLRPC"
1079     def __init__(self, db, id):
1080         self.length = db.server.getSeqLen(id)
1081         if self.length<=0:
1082             raise KeyError('%s not in this database' % id)
1083         self.id = id
1084         SequenceBase.__init__(self)
1085     def strslice(self,start,end,useCache=True):
1086         "XMLRPC access to slice of a sequence"
1087         if useCache:
1088             try:
1089                 return self.db.strsliceCache(self,start,end)
1090             except IndexError: # NOT FOUND IN CACHE
1091                 pass # JUST USE OUR REGULAR XMLRPC METHOD
1092         return self.db.server.get_strslice(self.id,start,end) # GET FROM XMLRPC
1093     def __len__(self):
1094         return self.length
1095
1096 class XMLRPCSeqLenDescr(object):
1097     'descriptor that returns dictionary of remote server seqLenDict'
1098     def __init__(self,attr):
1099         self.attr = attr
1100     def __get__(self,obj,objtype):
1101         'only called if attribute does not already exist. Saves result as attribute'
1102         d = obj.server.getSeqLenDict()
1103         for k,v in d.items():
1104             d[k] = v[0],int(v[1]) # CONVERT OFFSET STR BACK TO INT
1105         obj.__dict__[self.attr] = d # PROVIDE DIRECTLY TO THE __dict__
1106         return d
1107
1108 class XMLRPCSequenceDB(SequenceDB):
1109     'XMLRPC client: access sequence database over XMLRPC'
1110     itemClass = XMLRPCSequence # sequence storage interface
1111     seqLenDict = XMLRPCSeqLenDescr('seqLenDict') # INTERFACE TO SEQLENDICT
1112     def __init__(self, url, name, *args, **kwargs):
1113         import coordinator
1114         self.server = coordinator.get_connection(url, name)
1115         self.url = url
1116         self.name = name
1117         self.seqInfoDict = _SeqLenDictWrapper(self)
1118         SequenceDB.__init__(self, *args, **kwargs)
1119     def __reduce__(self): # provided only for compatibility w/ 0.7 clients
1120         return (classutil.ClassicUnpickler, (self.__class__,self.__getstate__()))
1121     def __getstate__(self): # DO NOT pickle self.itemClass! We provide our own.
1122         return dict(url=self.url, name=self.name) # just need XMLRPC info
1123     def __len__(self):
1124         return self.server.get_db_size()
1125     def __contains__(self, k):
1126         if self.server.getSeqLen(k)>0:
1127             return True
1128         else:
1129             return False
1130     def _set_seqtype(self):
1131         'efficient way to determine sequence type of this database'
1132         try: # if already known, no need to do anything
1133             return self._seqtype
1134         except AttributeError:
1135             self._seqtype = self.server.get_seqtype()
1136             return self._seqtype
1137
1138