protect mysql tests using mysql_enabled()
[pygr.git] / pygr / seqdb.py
blob1c00a8bef5038a56a7a5bbe6c1036ac00101d115
1 """
2 seqdb contains a set of classes for interacting with sequence databases.
4 Primary sequence database classes:
6 - SequenceDB - base class for sequence databases
7 - SequenceFileDB - file-based sequence database
8 - PrefixUnionDict - container to combine multiple sequence databases
9 - XMLRPCSequenceDB - XML-RPC-accessible sequence database
11 Extensions:
13 - SeqPrefixUnionDict - extends PrefixUnionDict to automatically add seqs
14 - BlastDB - implements NCBI-style name munging for lookups
16 Associated sequence classes:
18 - FileDBSequence - sequence associated with a SequenceFileDB
19 - XMLRPCSequence - sequence associated with a XMLRPCSequenceDB
21 ----
23 SequenceDB provides some basic behaviors for sequence databases:
24 dictionary behavior, an invert operator interface, and caching for
25 both sequence objects and sequence intervals. It also relies on a
26 'seqInfoDict' attribute to contain summary information about
27 sequences, so that e.g. slice operations can be done without loading
28 the entire sequence into memory. (See below for more info on
29 seqInfoDict.)
31 SequenceFileDB extends SequenceDB to contain a file-based database of
32 sequences. It constructs a seqLenDict that allows direct on-disk lookup
33 of sequence intervals. (See below for more info on seqLenDict.)
35 PrefixUnionDict provides a unified SequenceDB-like interface to a
36 collection of sequence databases by combining the database name with
37 the sequence ID into a new sequence id. For example, the ID
38 'genome.chrI' would return the sequence 'chrI' in the 'genome'
39 database. This is particularly handy for situations where you want to
40 have seqdbs of multiple sequence types (DNA, protein, annotations,
41 etc.) all associated together.
43 @CTB document XMLRPCSequenceDB.
44 @CTB document SeqPrefixUnionDict.
45 @CTB document BlastDB.
47 ----
49 The seqInfoDict interface
50 -------------------------
52 The seqInfoDict attribute of a SequenceDB is a dictionary-like object,
53 keyed by sequence IDs, with associated values being an information
54 object containing various attributes. seqInfoDict is essentially an
55 optimization that permits other pygr-aware components to access
56 information *about* sequences without actually loading the entire
57 sequence.
59 The only required attribute at the moment is 'length', which is
60 required by some of the NLMSA code. However, seqInfoDict is a good
61 mechanism for the storage of any summary information on a sequence,
62 and so it may be expanded in the future.
64 The seqLenDict interface
65 ------------------------
67 The seqLenDict attribute is specific to a SequenceFileDB, where it
68 provides a file-backed storage of length and offset sequence metadata.
69 It is used to implement a key optimization in SequenceFileDB, in which
70 a sequence's offset within a file is used to read only the required
71 part of the sequence into memory. This optimization is particularly
72 important for large sequences, e.g. chromosomes, where reading the
73 entire sequence into memory shouldn't be done unless it's necessary.
75 The seqLenDict is keyed by sequence ID and the associated values are a
76 2-tuple (length, offset), where the offset indicates the byte offset
77 within the '.pureseq' index file created for each SequenceFileDB.
79 get_bound_subclass and the 'self.db' attribute
80 ----------------------------------------------
82 The SequenceDB constructor calls classutil.get_bound_subclass on its
83 itemClass. What does that do, and what is it for?
85 get_bound_subclass takes an existing class, makes a new subclass of
86 it, binds the variable 'db' to it, and then calls the _init_subclass
87 classmethod (if it exists) on the new class. This has the effect of
88 creating a new class for each SequenceDB instance, tied specifically
89 to that instance and initialized by the _init_subclass method.
91 The main effect of this for SequenceDBs is that for any SequenceDB
92 descendant, the '.db' attribute is automatically set for each Sequence
93 retrieved from the database.
95 CTB: I think this is an optimization?
97 Caching
98 -------
100 @CTB discuss caching.
102 Pickling sequence databases and sequences
103 -----------------------------------------
105 @CTB document pickling issues.
106 programmer notes:
108 extending SequenceDB
109 - seqInfoDict, itemclass
111 extending SequenceFileDB
112 - seqLenDict
113 - using your own itemclass
114 - using your own reader
116 doctests & examples
117 -------------------
119 update docs for these classes!
121 intro:
122 - loading a FASTA file
123 - using a PUD
124 + combining dbs, etc.
125 + inverse
127 Code review issues, short term:
129 - @CTB get_bound_subclass stuff refers directly to itemClass to set 'db'.
130 - @CTB fix 'import *'s
131 - @CTB run lint/checker?
132 - @CTB test _create_seqLenDict
133 - @CTB XMLRPCSequenceDB vs SequenceFileDB
135 Some long term issues:
137 - it should be possible to remove _SeqLenDictSaver and just combine its
138 functionality with _store_seqlen_dict. --titus 3/21/09
142 from __future__ import generators
143 import sys
144 import os
145 import UserDict
146 import weakref
148 from sequence import * # @CTB
149 from sqlgraph import * # @CTB
150 import classutil
151 from annotation import AnnotationDB, AnnotationSeq, AnnotationSlice, \
152 AnnotationServer, AnnotationClient
153 import logger
154 import seqfmt
156 from dbfile import NoSuchFileError
158 ####
160 # SequenceDB and associated support classes.
163 class _SequenceDBInverse(object):
164 """Implements __inverse__ on SequenceDB objects, returning seq name."""
165 def __init__(self, db):
166 self.db = db
168 def __getitem__(self, seq):
169 return seq.pathForward.id
171 def __contains__(self, seq):
172 try:
173 return seq.pathForward.db is self.db
174 except AttributeError:
175 return False
177 class SequenceDB(object, UserDict.DictMixin):
178 """Base class for sequence databases.
180 SequenceDB provides a few basic (base) behaviors:
181 - dict-like interface to sequence objects each with an ID
182 - the ~ (invert) operator returns an 'inverted' database, which is a
183 dict-like object that returns sequence names when given a sequence.
184 - weakref-based automatic flushing of seq objects no longer in use;
185 use autoGC=0 to turn this off.
186 - cacheHint() system for caching a given set of sequence
187 intervals associated with an owner object, which are flushed
188 from cache if the owner object is garbage-collected.
190 For subclassing, note that self.seqInfoDict must be set before
191 SequenceDB.__init__ is called!
194 # class to use for database-linked sequences; no default.
195 itemClass = None
196 # class to use for sequence slices; see sequence.SeqPath.classySlice.
197 itemSliceClass = SeqDBSlice
199 # pickling methods & what attributes to pickle.
200 __getstate__ = classutil.standard_getstate
201 __setstate__ = classutil.standard_setstate
202 _pickleAttrs = dict(autoGC=0)
204 # define ~ (invert) operator to return a lazily-created _SequenceDBInverse
205 __invert__ = classutil.lazy_create_invert(_SequenceDBInverse)
207 def __init__(self, autoGC=True, dbname=None, **kwargs):
208 """Initialize seq db from filepath or ifile."""
209 if autoGC: # automatically garbage collect unused objects
210 self._weakValueDict = classutil.RecentValueDictionary(autoGC)
211 else:
212 self._weakValueDict = {} # object cache @CTB not tested
213 self.autoGC = autoGC
215 # override itemClass and itemSliceClass if specified
216 self.itemClass = kwargs.get('itemClass', self.itemClass)
217 self.itemSliceClass = kwargs.get('itemSliceClass', self.itemSliceClass)
219 if self.itemClass is None:
220 raise TypeError, "must supply itemClass to SequenceDB"
222 # get a copy we can modify w/o side effects and bind itemClass.
223 kwargs = kwargs.copy()
224 kwargs['db'] = self
225 classutil.get_bound_subclass(self, 'itemClass', dbname,
226 subclassArgs=kwargs)
228 # guess the sequence type
229 self._set_seqtype()
231 def __hash__(self):
232 """Define a hash function to allow this object to be used as a key."""
233 return id(self)
235 def _set_seqtype(self):
236 """Guess the seqtype from 100 chars of 1st seq if not already known."""
237 seqtype = getattr(self, '_seqtype', None)
238 if seqtype is not None:
239 return
241 for seqID in self: # get an iterator
242 seq = self[seqID] # get the 1st sequence
243 ch100 = str(seq[:100])
244 self._seqtype = guess_seqtype(ch100)
245 break # only process the 1st sequence!!!
247 _cache_max=10000 # @CTB move? make settable?
248 def cacheHint(self, ivalDict, owner):
249 """Save a cache hint dict: {id: (start, stop)}.
251 @CTB document!
253 d={}
254 # @CTB refactor, test
255 for id, ival in ivalDict.items(): # BUILD THE CACHE DICTIONARY FOR owner
256 if ival[0] < 0: # FORCE IVAL INTO POSITIVE ORIENTATION
257 ival=(-ival[1], -ival[0]) # @CTB untested
258 if ival[1]-ival[0] > self._cache_max: # TRUNCATE EXCESSIVE LENGTH
259 ival=(ival[0], ival[0] + self._cache_max) # @CTB untested
260 d[id]=[ival[0], ival[1]]
261 try:
262 self._cache[owner] = d # ADD TO EXISTING CACHE
263 except AttributeError:
264 self._cache = weakref.WeakKeyDictionary() # AUTOMATICALLY REMOVE
265 self._cache[owner] = d # FROM CACHE IF owner GOES OUT OF SCOPE
266 def strsliceCache(self, seq, start, stop):
267 """Get strslice using cache hints, if any available."""
268 try:
269 cacheDict=self._cache
270 except AttributeError:
271 raise IndexError('no cache present')
272 for owner, d in cacheDict.items():
273 try:
274 ival = d[seq.id]
275 except KeyError:
276 continue # NOT IN THIS CACHE, SO SKIP @CTB untested
277 ival_start, ival_stop = ival[:2]
278 if start >= ival_start and stop <= ival_stop: # CONTAINED IN ival
279 try:
280 s = ival[2] # get seq string from our cache
281 except IndexError: # use strslice() to retrieve from storage
282 s = seq.strslice(ival_start, ival_stop, useCache=False)
283 ival.append(s)
284 try: # does owner want to reference this cached seq?
285 save_f = owner.cache_reference
286 except AttributeError:
287 pass # no, so nothing to do
288 else: # let owner control caching in our _weakValueDict
289 save_f(seq) # # @CTB untested
290 return s[start - ival_start:stop - ival_start]
291 raise IndexError('interval not found in cache') # @CTB untested
293 # these methods should all be implemented on all SequenceDBs.
294 def close(self):
295 pass # subclass should implement closing of its open resources!
297 def __iter__(self):
298 return iter(self.seqInfoDict)
300 def iteritems(self):
301 for seqID in self:
302 yield seqID, self[seqID]
304 def __len__(self):
305 return len(self.seqInfoDict)
307 def __getitem__(self, seqID):
308 """Retrieve sequence by id, using cache if available."""
309 try: # for speed, default case (cache hit) should return immediately
310 return self._weakValueDict[seqID]
311 except KeyError: # not in cache? try loading.
312 try:
313 s = self.itemClass(self, seqID)
314 except KeyError:
315 raise KeyError("no key '%s' in database %s"
316 % (seqID, repr(self)))
317 self._weakValueDict[seqID] = s # save in cache.
318 return s
320 def keys(self):
321 return self.seqInfoDict.keys()
323 def __contains__(self, key):
324 return key in self.seqInfoDict
326 def __repr__(self):
327 return "<%s '%s'>" % (self.__class__.__name__,
328 self.itemClass.__class__.__name__)
330 def clear_cache(self):
331 """Empty the cache."""
332 self._weakValueDict.clear()
334 # these methods should not be implemented for read-only database.
335 clear = setdefault = pop = popitem = copy = update = \
336 classutil.read_only_error
338 ####
340 # FileDBSequence and SequenceFileDB, and associated support classes.
343 class _FileDBSeqDescriptor(object):
344 """Descriptor to retrieve entire sequence from seqLenDict."""
345 def __get__(self, obj, objtype):
346 length = obj.db.seqLenDict[obj.id][0]
347 return obj.strslice(0, length)
349 class FileDBSequence(SequenceBase):
350 """Default sequence object for file-based storage mechanism.
352 See SequenceFileDB for the associated database class.
354 In general, you should not create objects from this class directly;
355 retrieve them from SequenceFileDB objects, instead.
357 NOTE: 'self.db' is attached to all instances of this class that come
358 from a particular database by 'classutil.get_bound_subclass'.
361 seq = _FileDBSeqDescriptor() # dynamically retrieve 'seq'.
362 __reduce__ = classutil.item_reducer # for pickling purposes.
364 def __init__(self, db, id):
365 self.id = id
366 SequenceBase.__init__(self)
367 if self.id not in self.db.seqLenDict:
368 raise KeyError('sequence %s not in db %s' % (self.id, self.db))
370 def __len__(self):
371 """Unpack this sequence's length from the seqLenDict."""
372 return self.db.seqLenDict[self.id][0]
374 def strslice(self, start, end, useCache=True):
375 """Access slice of a sequence efficiently, using seqLenDict info."""
376 if useCache: # If it's in the cache, use that!
377 try:
378 return self.db.strsliceCache(self, start, end)
379 except IndexError:
380 pass
382 return self.db.strslice(self.id, start, end)
384 class SequenceFileDB(SequenceDB):
385 """Main class for file-based storage of a sequence database.
387 By default, SequenceFileDB uses a seqLenDict, a.k.a. a shelve
388 index of sequence lengths and offsets, to retrieve sequence slices
389 with fseek. Thus entire chromosomes (for example) do not have to
390 be loaded to retrieve a subslice.
392 Takes one required argument, 'filepath', which should be the name
393 of a FASTA file (or a file whose format is understood by your
394 custom reader; see 'reader' kw arg, and the _store_seqlen_dict
395 function).
397 The SequenceFileDB seqInfoDict interface is a wrapper around the
398 seqLenDict created by the __init__ function.
401 itemClass = FileDBSequence
403 # copy _pickleAttrs and add 'filepath'
404 _pickleAttrs = SequenceDB._pickleAttrs.copy()
405 _pickleAttrs['filepath'] = 0
407 def __init__(self, filepath, reader=None, **kwargs):
408 # make filepath a pickleable attribute.
409 self.filepath = classutil.SourceFileName(str(filepath))
411 fullpath = self.filepath + '.seqlen'
412 # build the seqLenDict if it doesn't already exist
413 try:
414 seqLenDict = classutil.open_shelve(fullpath, 'r')
415 except NoSuchFileError:
416 seqLenDict = self._create_seqLenDict(fullpath, filepath, reader)
418 self.seqLenDict = seqLenDict
419 self.seqInfoDict = _SeqLenDictWrapper(self) # standard interface
421 # initialize base class.
422 dbname = os.path.basename(filepath)
423 SequenceDB.__init__(self, filepath=filepath, dbname=dbname, **kwargs)
425 def close(self):
426 '''close our open shelve index file and _pureseq...'''
427 self.seqLenDict.close()
428 try:
429 do_close = self._pureseq.close
430 except AttributeError:
431 pass # _pureseq not open yet, so nothing to do
432 else:
433 do_close()
435 def __repr__(self):
436 return "<%s '%s'>" % (self.__class__.__name__, self.filepath)
438 def _create_seqLenDict(self, dictpath, seqpath, reader=None):
439 """Create a seqLenDict from 'seqpath' and store in 'dictpath'."""
440 seqLenDict = classutil.open_shelve(dictpath, 'n')
441 try:
442 logger.debug('Building sequence length index...')
443 _store_seqlen_dict(seqLenDict, seqpath, reader)
444 finally:
445 seqLenDict.close() # close after writing, no matter what!
446 return classutil.open_shelve(dictpath, 'r') # re-open read-only
448 def strslice(self, seqID, start, end, useCache=True):
449 """Access slice of a sequence efficiently, using seqLenDict info."""
450 # Retrieve sequence from the .pureseq file based on seqLenDict
451 # information.
452 try:
453 ifile=self._pureseq
454 except AttributeError:
455 fullpath = self.filepath + '.pureseq'
456 ifile = file(fullpath, 'rb')
457 self._pureseq = ifile
459 # Now, read in the actual slice.
460 offset = self.seqLenDict[seqID][1]
461 ifile.seek(offset + start)
462 return ifile.read(end - start)
464 # Some support classes for the SeqLenDict mechanism.
466 class BasicSeqInfo(object):
467 """Wrapper to provide the correct seqInfoDict-style object information.
469 This boils down to providing id, db, length, and possibly offset.
472 def __init__(self, seqID, seqDB, length=None):
473 self.id = seqID
474 self.db = seqDB
475 if length is None:
476 self.length = len(seqDB[seqID]) # generic but possibly slow
477 else:
478 self.length = length
480 class _SeqLenObject(BasicSeqInfo):
481 """Wrapper for use with a seqLenDict """
482 def __init__(self, seqID, seqDB):
483 length, self.offset = seqDB.seqLenDict[seqID]
484 BasicSeqInfo.__init__(self, seqID, seqDB, length)
486 class BasicSeqInfoDict(object, UserDict.DictMixin):
487 """Wrapper around SequenceDB.seqLenDict to provide seqInfoDict behavior.
488 This basic version just gets the length from the sequence object itself.
490 itemClass = BasicSeqInfo
491 def __init__(self, db):
492 self.seqDB = db
494 def __getitem__(self, k):
495 return self.itemClass(k, self.seqDB)
497 def __len__(self):
498 return len(self.seqDB.seqLenDict)
500 def __iter__(self):
501 return iter(self.seqDB.seqLenDict)
503 def keys(self):
504 return self.seqDB.seqLenDict.keys()
506 class _SeqLenDictWrapper(BasicSeqInfoDict):
508 The default storage mechanism for sequences implemented by FileDBSequence
509 and SequenceFileDB puts everything in seqLenDict, a shelve index of
510 lengths and offsets. This class wraps that dictionary to provide the
511 interface that SequenceDB expects to see.
514 itemClass = _SeqLenObject
516 class _SeqLenDictSaver(object):
517 """Support for generic reading functions, called by _store_seqlen_dict.
519 This allows you to specify your own 'reader' function when
520 constructing a FileSequenceDB, e.g. so that you could read
521 something other than FASTA files into a seqLenDict. Pass in this
522 function as the 'reader' kwarg to FileSequenceDB.
524 Custom reader functions should take a file handle and a filename,
525 and return a list of sequence info objects with 'id', 'length',
526 and 'sequence' attributes for each sequence in the given
527 file/filename. _SeqLenDictSaver will then construct a '.pureseq'
528 file containing the concatenated sequences and fill in the
529 seqLenDict appropriately.
532 def __init__(self, reader):
533 self.reader = reader
535 def __call__(self, d, ifile, filename):
536 offset = 0L
537 pureseq_fp = file(filename + '.pureseq', 'wb')
538 try:
539 for o in self.reader(ifile, filename):
540 # store the length & offset in the seqLenDict
541 d[o.id] = o.length, offset
542 offset += o.length
543 if o.length != len(o.sequence):
544 raise ValueError('length does not match sequence: %s,%d'
545 % (o.id, o.length))
546 pureseq_fp.write(o.sequence)
547 finally:
548 pureseq_fp.close()
550 def _store_seqlen_dict(d, filename, reader=None, mode='rU'):
551 """Store sequence lengths in a dictionary, e.g. a seqLenDict.
553 Used by SequenceFileDB._create_seqLenDict.
555 The 'reader' function implements a custom sequence format reader;
556 by default, _store_seqlen_dict uses seqfmt.read_fasta_lengths,
557 which reads FASTA-format files. See _SeqLenDictSaver for
558 information on building a custom 'reader', and see the seqdb docs
559 for an example.
562 # if a custom reader function was passed in, run that.
563 builder = seqfmt.read_fasta_lengths
564 if reader is not None:
565 builder = _SeqLenDictSaver(reader)
567 ifile = file(filename, mode)
568 try:
569 builder(d, ifile, filename) # run the builder on our sequence set
570 finally:
571 ifile.close()
573 ####
575 # class PrefixUnionDict and associated support classes.
578 class _PrefixUnionDictInverse(object):
579 """Provide inverse (~) operator behavior for PrefixUnionDicts.
581 This enables ~pud to return a database that, given a sequence
582 object, returns the corresponding key (prefix.id) to retrieve that
583 sequence object in the pud.
587 def __init__(self, db):
588 self.db = db
590 def __getitem__(self, ival):
591 seq = ival.pathForward # get the top-level sequence object
592 try: # for speed, normal case should execute immediately
593 prefix = self.db.dicts[seq.db]
594 except KeyError:
595 try:
596 # @CTB abstraction boundary violation! keep? how test?
597 if seq.pathForward._anno_seq.db in self.db.dicts:
598 raise KeyError('''\
599 this annotation is not in the PrefixUnion, but its sequence is.
600 You can get that using its \'sequence\' attribute.''')
601 except AttributeError:
602 pass
603 raise KeyError('seq.db not in PrefixUnionDict')
605 return prefix + self.db.separator + str(seq.id)
607 def __contains__(self, seq):
608 try:
609 return seq.pathForward.db in self.db.dicts
610 except AttributeError:
611 return False
614 class _PrefixUnionMemberDict(object, UserDict.DictMixin):
616 @CTB confusing/inappropriate use of a dict interface! keep??
617 @CTB document.
618 'd[prefix]=value; d[k] returns value if k is a member of prefix'
620 def __init__(self, puDict, default=None,
621 attrMethod=lambda x:x.pathForward.db):
622 self.values = {}
623 self.puDict = puDict
624 self._attrMethod = attrMethod
625 if default is not None:
626 self.default = default # @CTB can we use setdefault for this?
628 def keys(self):
629 return self.puDict.prefixDict.keys()
630 possibleKeys = keys # legacy interface (?)
632 def __setitem__(self, k, v):
633 if k not in self.puDict.prefixDict:
634 raise KeyError('key %s is not a valid union prefix string!' % k)
635 new_k = self.puDict.prefixDict[k]
636 self.values[new_k] = v
638 def __getitem__(self, k):
639 try:
640 db = self._attrMethod(k)
641 except AttributeError:
642 raise TypeError('wrong key type? _attrMethod() failed.')
644 if db not in self.values:
645 try:
646 return self.default
647 except AttributeError: # no default value - raise KeyError.
648 raise KeyError('key not a member of this union!')
650 return self.values[db]
652 class PrefixUnionDict(object, UserDict.DictMixin):
653 """Interface to a set of sequence DBs, each assigned a unique prefix.
655 For example, the sequence ID 'foo.bar' would unpack to ID 'bar' in
656 the dictionary associated with the prefix 'foo'. This is a useful
657 way to combine disparate seqdbs into a single db, without actually
658 altering the individual seqdbs.
660 PrefixUnionDicts can be created in one of two ways: either
661 - pass in a dictionary containing prefix-to-seqdb mappings as
662 'prefixDict', or
663 - pass in a header file containing the information necessary to create
664 such a dictionary.
666 In the latter case, see the 'writeHeaderFile' method for format
667 information. The optional kwarg 'trypath' contains a list of
668 directories to search for the database file named in each line.
669 The optional kwarg 'dbClass' specifies the database class to use
670 to load each sequence file; it defaults to SequenceFileDB.
672 The default ID separator is '.'; use the 'separator' kwarg to
673 change it.
675 @CTB trypath => trypaths?
678 # define ~ (invert) operator to return a lazily-created _PUDInverse.
679 __invert__ = classutil.lazy_create_invert(_PrefixUnionDictInverse)
681 def __init__(self, prefixDict=None, separator='.', filename=None,
682 dbClass=SequenceFileDB, trypath=None):
683 # read union header file
684 if filename is not None:
685 if prefixDict:
686 raise TypeError('''
687 cannot create with prefixDict and filename both!''')
689 if trypath is None:
690 trypath = [os.path.dirname(filename)]
691 ifile = file(filename, 'rU')
692 try:
693 it = iter(ifile)
694 separator = it.next().strip('\r\n') # remove leading/trailing CR
695 prefixDict = {}
696 for line in it:
697 prefix, filepath=line.strip().split('\t')[:2]
698 try:
699 dbfile = classutil.search_dirs_for_file(filepath, trypath)
700 db = dbClass(dbfile)
701 except IOError:
702 for db in prefixDict.values():
703 db.close() # close databases before exiting
704 raise IOError('''\
705 unable to open database %s: check path or privileges.
706 Set 'trypath' to give a list of directories to search.''' % filepath)
707 else:
708 prefixDict[prefix] = db
709 finally:
710 ifile.close()
712 self.separator = separator
713 if prefixDict is not None:
714 self.prefixDict = prefixDict
715 else:
716 self.prefixDict = {}
718 # also create a reverse mapping
719 d = {}
720 for k, v in self.prefixDict.items():
721 d[v] = k
723 self.dicts = d
724 self.seqInfoDict = _PUDSeqInfoDict(self) # supply standard interface
726 def format_id(self, prefix, seqID):
727 return prefix + self.separator + seqID
729 def get_prefix_id(self, k):
730 """Subdivide a key into a prefix and ID using the given separator."""
731 try:
732 t = k.split(self.separator, 2) # split into no more than 3 fields
733 except AttributeError:
734 raise KeyError('key should be a string! ' + repr(k))
735 l = len(t)
736 if l == 2:
737 return t
738 elif l<2:
739 raise KeyError('invalid id format; no prefix: ' + k)
740 else: # id contains separator character?
741 prefix = t[0] # assume prefix doesn't contain separator @CTB untested
742 seqID = k[len(prefix) + 1:] # skip past prefix
743 return prefix, seqID
745 def get_subitem(self, d, seqID):
746 # try int key first
747 try:
748 return d[int(seqID)]
749 except (ValueError, KeyError, TypeError):
750 pass
752 # otherwise, use default (str) key
753 try:
754 return d[seqID]
755 except KeyError:
756 raise KeyError, "no key '%s' in %s" % (seqID, repr(d))
758 def __getitem__(self, k):
759 """For 'foo.bar', return 'bar' in dict associated with prefix 'foo'"""
760 prefix, seqID = self.get_prefix_id(k)
761 try:
762 d = self.prefixDict[prefix]
763 except KeyError, e:
764 raise KeyError("no key '%s' in %s" % (k, repr(self)))
765 return self.get_subitem(d, seqID)
767 def __contains__(self, k):
768 """Is the given ID in our PrefixUnionDict?"""
769 # try it out as an ID.
770 if isinstance(k, str):
771 try:
772 (prefix, id) = self.get_prefix_id(k)
773 return id in self.prefixDict[prefix]
774 except KeyError:
775 return False
777 # otherwise, try treating key as a sequence.
778 # @CTB inconsistent with 'getitem'.
779 try:
780 db = k.pathForward.db
781 except AttributeError:
782 raise AttributeError('key must be a sequence with db attribute!')
783 return db in self.dicts
785 def has_key(self, k):
786 return self.__contains__(k)
788 def __iter__(self):
789 for p, d in self.prefixDict.items():
790 for id in d:
791 yield self.format_id(p, id)
793 def keys(self):
794 return list(self.iterkeys())
796 def iterkeys(self):
797 return iter(self)
799 def iteritems(self):
800 for p,d in self.prefixDict.items():
801 for id, seq in d.iteritems():
802 yield self.format_id(p, id), seq
804 def getName(self, ival):
805 """For a given sequence, return a fully qualified name, 'prefix.id'."""
806 seq = ival.pathForward # get the top-level sequence object
807 return self.dicts[seq.db] + self.separator + seq.id
809 def newMemberDict(self, **kwargs): # @CTB not used; necessary?
810 """return a new member dictionary (empty)"""
811 return _PrefixUnionMemberDict(self, **kwargs)
813 def writeHeaderFile(self,filename): # @CTB not used; necessary?
814 """Save a header file, suitable for later re-creation."""
815 ifile = file(filename, 'w')
816 print >>ifile, self.separator
817 for k, v in self.prefixDict.items():
818 try:
819 print >>ifile, '%s\t%s' % (k, v.filepath)
820 except AttributeError:
821 raise AttributeError('''\
822 seq db '%s' has no filepath; you may be able to save this to worldbase,
823 but not to a text HeaderFile!''' % k)
824 ifile.close()
826 def __len__(self):
828 for db in self.dicts:
829 n += len(db)
830 return n
832 def cacheHint(self, ivalDict, owner=None): # @CTB untested
833 '''save a cache hint dict of {id:(start,stop)}'''
834 d={}
835 # extract separate cache hint dict for each prefix
836 for longID,ival in ivalDict.items():
837 prefix, seqID = self.get_prefix_id(longID)
838 d.setdefault(prefix, {})[seqID] = ival
839 for prefix,seqDict in d.items():
840 try:
841 m = self.prefixDict[prefix].cacheHint
842 except AttributeError: # subdict can't cacheHint(), so just ignore
843 pass
844 else:
845 # pass cache hint down to subdictionary
846 m(seqDict, owner)
848 # not clear what this should do for PrefixUnionDict
849 copy = setdefault = update = classutil.method_not_implemented
851 # these methods should not be implemented for read-only database.
852 clear = pop = popitem = classutil.read_only_error
855 class _PrefixDictInverseAdder(_PrefixUnionDictInverse):
856 """Inverse class for SeqPrefixUnionDict; adds sequences when looked up.
858 @CTB is getName only used by __getitem__? Make private?
861 def getName(self, seq):
862 """Find in or add the given sequence to the inverse of a PUD."""
863 try:
864 return _PrefixUnionDictInverse.__getitem__(self, seq)
865 except AttributeError: # no seq.db? treat as a user sequence.
866 new_id = 'user' + self.db.separator + seq.pathForward.id
867 # check to make sure it's already in the user seq db...
868 _ = self.db[new_id]
869 return new_id
871 def __getitem__(self, seq):
872 """__getitem__ interface that calls getName."""
873 try:
874 return self.getName(seq)
875 except KeyError:
876 if not self.db.addAll:
877 raise
879 # if we should add, add seq & re-try.
880 self.db += seq
881 return self.getName(seq)
884 class SeqPrefixUnionDict(PrefixUnionDict):
885 """SeqPrefixUnionDict provides += functionality to add seqs to a PUD.
887 See the __iadd__ method for details.
889 If addAll is True, then looking a sequence up in the inverse db will
890 automatically add it to the PrefixUnionDict.
893 __invert__ = classutil.lazy_create_invert(_PrefixDictInverseAdder)
895 def __init__(self, addAll=False, **kwargs):
896 PrefixUnionDict.__init__(self, **kwargs)
898 # override default PrefixUnionDict __invert__ to add sequences;
899 # see classutil.lazy_create_invert.
900 self.addAll = addAll # see _PrefixDictInverseAdder behavior.
902 def __iadd__(self, k):
903 """Add a sequence or database to the PUD, with a unique prefix.
905 NOTE: __iadd__ must return self.
908 # seq or db already present?
909 if k in (~self):
910 return self
912 db = getattr(k, 'db', None)
913 if db is None: # annotation sequence?
914 db = getattr(k.pathForward, 'db', None) # @CTB untested
916 if db is None: # this is a user sequence, with no container; create.
917 if not self.prefixDict.has_key('user'):
918 d = KeepUniqueDict()
919 self._add_prefix_dict('user', d)
920 else:
921 d = self.prefixDict['user']
923 # now add the sequence
924 d[k.pathForward.id] = k.pathForward
925 return self
927 # already contain? nothing to do.
928 if db in self.dicts: # @CTB can this 'if' ever be true?
929 return self
931 # ok, not present; add, with a unique name. does it have
932 # _persistent_id?
933 try:
934 name = db._persistent_id.split('.')[-1]
935 except AttributeError: # no; retrieve from filepath?
936 name = getattr(db, 'filepath', None)
937 if name: # got one; clean up.
938 name = os.path.basename(name)
939 name = name.split('.')[0]
940 else: # generate one.
941 name = 'noname%d' % len(self.dicts)
943 if name in self.prefixDict:
944 logger.debug('''
945 It appears that two different sequence databases are being assigned
946 the same prefix ("%s"). For this reason, the attempted automatic
947 construction of a PrefixUnionDict for you cannot be completed! You
948 should instead construct a PrefixUnionDict that assigns a unique
949 prefix to each sequence database, and supply it directly as the
950 seqDict argument to the NLMSA constructor.''' % id)
951 raise ValueError('''\
952 cannot automatically construct PrefixUnionDict''')
954 self._add_prefix_dict(name, db)
956 return self
958 def _add_prefix_dict(self, name, d):
959 self.prefixDict[name] = d
960 self.dicts[d] = name
962 class _PUDSeqInfoDict(object, UserDict.DictMixin):
963 """A wrapper object supplying a standard seqInfoDict interface for PUDs.
965 This class simply provides a standard dict interface that rewrites
966 individual sequence IDs into the compound PrefixUnionDict seq IDs
967 on the fly.
970 def __init__(self, db):
971 self.seqDB = db
973 def __iter__(self):
974 return iter(self.seqDB)
976 def keys(self):
977 return list(self.iterkeys())
979 def iterkeys(self):
980 for (k, v) in self.iteritems():
981 yield k
983 def itervalues(self):
984 for (k, v) in self.iteritems():
985 yield v
987 def iteritems(self):
988 for p, d in self.seqDB.prefixDict.items():
989 for seqID, info in d.seqInfoDict.iteritems():
990 yield self.seqDB.format_id(p, seqID), info
992 def __getitem__(self, k):
993 prefix, seqID = self.seqDB.get_prefix_id(k)
994 db = self.seqDB.prefixDict[prefix]
995 return self.seqDB.get_subitem(db.seqInfoDict, seqID)
997 def has_key(self, k):
998 return k in self.seqDB
1001 # @CTB stopped review here. ###################################################
1004 class BlastDB(SequenceFileDB): # @CTB untested?
1005 '''Deprecated interface provided for backwards compatibility.
1006 Provides blast() and megablast() methods for searching your seq db.
1007 Instead of this, you should use the blast.BlastMapping, which provides
1008 a graph interface to BLAST, or MegablastMapping for megablast.'''
1009 def __reduce__(self): # provided only for compatibility w/ 0.7 clients
1010 return (classutil.ClassicUnpickler, (self.__class__,self.__getstate__()))
1011 def __init__(self, filepath=None, blastReady=False, blastIndexPath=None,
1012 blastIndexDirs=None, **kwargs):
1013 "format database and build indexes if needed. Provide filepath or file object"
1014 SequenceFileDB.__init__(self, filepath, **kwargs)
1015 def __repr__(self):
1016 return "<BlastDB '%s'>" % (self.filepath)
1017 def blast(self, seq, al=None, blastpath='blastall',
1018 blastprog=None, expmax=0.001, maxseq=None, verbose=True,
1019 opts='', **kwargs):
1020 'run blast with the specified parameters, return NLMSA alignment'
1021 blastmap = self.formatdb()
1022 return blastmap(seq, al, blastpath, blastprog, expmax, maxseq,
1023 verbose, opts, **kwargs)
1024 def megablast(self,seq, al=None, blastpath='megablast', expmax=1e-20,
1025 maxseq=None, minIdentity=None, maskOpts='-U T -F m',
1026 rmPath='RepeatMasker', rmOpts='-xsmall',
1027 verbose=True, opts='', **kwargs):
1028 'run megablast with the specified parameters, return NLMSA alignment'
1029 from blast import MegablastMapping
1030 blastmap = self.formatdb(attr='megablastMap', mapClass=MegablastMapping)
1031 return blastmap(seq, al, blastpath, expmax, maxseq, minIdentity,
1032 maskOpts, rmPath, rmOpts, verbose, opts, **kwargs)
1033 def formatdb(self, filepath=None, attr='blastMap', mapClass=None):
1034 'create a blast mapping object if needed, and ensure it is indexed'
1035 try: # see if mapping object already exists
1036 blastmap = getattr(self, attr)
1037 except AttributeError:
1038 if mapClass is None: # default: BlastMapping
1039 from blast import BlastMapping
1040 mapClass = BlastMapping
1041 blastmap = mapClass(self)
1042 setattr(self, attr, blastmap) # re-use this in the future
1043 blastmap.formatdb(filepath) # create index file if not already present
1044 return blastmap
1047 class BlastDBXMLRPC(BlastDB):
1048 'XMLRPC server wrapper around a standard BlastDB'
1049 xmlrpc_methods = dict(getSeqLen=0, get_strslice=0, getSeqLenDict=0,
1050 get_db_size=0, get_seqtype=0, strslice='get_strslice')
1051 def getSeqLen(self,id):
1052 'get sequence length, or -1 if not found'
1053 try:
1054 return len(self[id])
1055 except KeyError:
1056 return -1 # SEQUENCE OBJECT DOES NOT EXIST
1057 def getSeqLenDict(self):
1058 'return seqLenDict over XMLRPC'
1059 d = {}
1060 for k,v in self.seqLenDict.items():
1061 d[k] = v[0],str(v[1]) # CONVERT TO STR TO ALLOW OFFSET>2GB
1062 return d # XML-RPC CANNOT HANDLE INT > 2 GB, SO FORCED TO CONVERT...
1063 def get_db_size(self):
1064 return len(self)
1065 def get_strslice(self,id,start,stop):
1066 'return string sequence for specified interval in the specified sequence'
1067 if start<0: # HANDLE NEGATIVE ORIENTATION
1068 return str((-(self[id]))[-stop:-start])
1069 else: # POSITIVE ORIENTATION
1070 return str(self[id][start:stop])
1071 def get_seqtype(self):
1072 return self._seqtype
1077 class XMLRPCSequence(SequenceBase):
1078 "Represents a sequence in a blast database, accessed via XMLRPC"
1079 def __init__(self, db, id):
1080 self.length = db.server.getSeqLen(id)
1081 if self.length<=0:
1082 raise KeyError('%s not in this database' % id)
1083 self.id = id
1084 SequenceBase.__init__(self)
1085 def strslice(self,start,end,useCache=True):
1086 "XMLRPC access to slice of a sequence"
1087 if useCache:
1088 try:
1089 return self.db.strsliceCache(self,start,end)
1090 except IndexError: # NOT FOUND IN CACHE
1091 pass # JUST USE OUR REGULAR XMLRPC METHOD
1092 return self.db.server.get_strslice(self.id,start,end) # GET FROM XMLRPC
1093 def __len__(self):
1094 return self.length
1096 class XMLRPCSeqLenDescr(object):
1097 'descriptor that returns dictionary of remote server seqLenDict'
1098 def __init__(self,attr):
1099 self.attr = attr
1100 def __get__(self,obj,objtype):
1101 'only called if attribute does not already exist. Saves result as attribute'
1102 d = obj.server.getSeqLenDict()
1103 for k,v in d.items():
1104 d[k] = v[0],int(v[1]) # CONVERT OFFSET STR BACK TO INT
1105 obj.__dict__[self.attr] = d # PROVIDE DIRECTLY TO THE __dict__
1106 return d
1108 class XMLRPCSequenceDB(SequenceDB):
1109 'XMLRPC client: access sequence database over XMLRPC'
1110 itemClass = XMLRPCSequence # sequence storage interface
1111 seqLenDict = XMLRPCSeqLenDescr('seqLenDict') # INTERFACE TO SEQLENDICT
1112 def __init__(self, url, name, *args, **kwargs):
1113 import coordinator
1114 self.server = coordinator.get_connection(url, name)
1115 self.url = url
1116 self.name = name
1117 self.seqInfoDict = _SeqLenDictWrapper(self)
1118 SequenceDB.__init__(self, *args, **kwargs)
1119 def __reduce__(self): # provided only for compatibility w/ 0.7 clients
1120 return (classutil.ClassicUnpickler, (self.__class__,self.__getstate__()))
1121 def __getstate__(self): # DO NOT pickle self.itemClass! We provide our own.
1122 return dict(url=self.url, name=self.name) # just need XMLRPC info
1123 def __len__(self):
1124 return self.server.get_db_size()
1125 def __contains__(self, k):
1126 if self.server.getSeqLen(k)>0:
1127 return True
1128 else:
1129 return False
1130 def _set_seqtype(self):
1131 'efficient way to determine sequence type of this database'
1132 try: # if already known, no need to do anything
1133 return self._seqtype
1134 except AttributeError:
1135 self._seqtype = self.server.get_seqtype()
1136 return self._seqtype