2 seqdb contains a set of classes for interacting with sequence databases.
4 Primary sequence database classes:
6 - SequenceDB - base class for sequence databases
7 - SequenceFileDB - file-based sequence database
8 - PrefixUnionDict - container to combine multiple sequence databases
9 - XMLRPCSequenceDB - XML-RPC-accessible sequence database
13 - SeqPrefixUnionDict - extends PrefixUnionDict to automatically add seqs
14 - BlastDB - implements NCBI-style name munging for lookups
16 Associated sequence classes:
18 - FileDBSequence - sequence associated with a SequenceFileDB
19 - XMLRPCSequence - sequence associated with a XMLRPCSequenceDB
23 SequenceDB provides some basic behaviors for sequence databases:
24 dictionary behavior, an invert operator interface, and caching for
25 both sequence objects and sequence intervals. It also relies on a
26 'seqInfoDict' attribute to contain summary information about
27 sequences, so that e.g. slice operations can be done without loading
28 the entire sequence into memory. (See below for more info on
31 SequenceFileDB extends SequenceDB to contain a file-based database of
32 sequences. It constructs a seqLenDict that allows direct on-disk lookup
33 of sequence intervals. (See below for more info on seqLenDict.)
35 PrefixUnionDict provides a unified SequenceDB-like interface to a
36 collection of sequence databases by combining the database name with
37 the sequence ID into a new sequence id. For example, the ID
38 'genome.chrI' would return the sequence 'chrI' in the 'genome'
39 database. This is particularly handy for situations where you want to
40 have seqdbs of multiple sequence types (DNA, protein, annotations,
41 etc.) all associated together.
43 @CTB document XMLRPCSequenceDB.
44 @CTB document SeqPrefixUnionDict.
45 @CTB document BlastDB.
49 The seqInfoDict interface
50 -------------------------
52 The seqInfoDict attribute of a SequenceDB is a dictionary-like object,
53 keyed by sequence IDs, with associated values being an information
54 object containing various attributes. seqInfoDict is essentially an
55 optimization that permits other pygr-aware components to access
56 information *about* sequences without actually loading the entire
59 The only required attribute at the moment is 'length', which is
60 required by some of the NLMSA code. However, seqInfoDict is a good
61 mechanism for the storage of any summary information on a sequence,
62 and so it may be expanded in the future.
64 The seqLenDict interface
65 ------------------------
67 The seqLenDict attribute is specific to a SequenceFileDB, where it
68 provides a file-backed storage of length and offset sequence metadata.
69 It is used to implement a key optimization in SequenceFileDB, in which
70 a sequence's offset within a file is used to read only the required
71 part of the sequence into memory. This optimization is particularly
72 important for large sequences, e.g. chromosomes, where reading the
73 entire sequence into memory shouldn't be done unless it's necessary.
75 The seqLenDict is keyed by sequence ID and the associated values are a
76 2-tuple (length, offset), where the offset indicates the byte offset
77 within the '.pureseq' index file created for each SequenceFileDB.
79 get_bound_subclass and the 'self.db' attribute
80 ----------------------------------------------
82 The SequenceDB constructor calls classutil.get_bound_subclass on its
83 itemClass. What does that do, and what is it for?
85 get_bound_subclass takes an existing class, makes a new subclass of
86 it, binds the variable 'db' to it, and then calls the _init_subclass
87 classmethod (if it exists) on the new class. This has the effect of
88 creating a new class for each SequenceDB instance, tied specifically
89 to that instance and initialized by the _init_subclass method.
91 The main effect of this for SequenceDBs is that for any SequenceDB
92 descendant, the '.db' attribute is automatically set for each Sequence
93 retrieved from the database.
95 CTB: I think this is an optimization?
100 @CTB discuss caching.
102 Pickling sequence databases and sequences
103 -----------------------------------------
105 @CTB document pickling issues.
109 - seqInfoDict, itemclass
111 extending SequenceFileDB
113 - using your own itemclass
114 - using your own reader
119 update docs for these classes!
122 - loading a FASTA file
124 + combining dbs, etc.
127 Code review issues, short term:
129 - @CTB get_bound_subclass stuff refers directly to itemClass to set 'db'.
130 - @CTB fix 'import *'s
131 - @CTB run lint/checker?
132 - @CTB test _create_seqLenDict
133 - @CTB XMLRPCSequenceDB vs SequenceFileDB
135 Some long term issues:
137 - it should be possible to remove _SeqLenDictSaver and just combine its
138 functionality with _store_seqlen_dict. --titus 3/21/09
142 from __future__
import generators
148 from sequence
import * # @CTB
149 from sqlgraph
import * # @CTB
151 from annotation
import AnnotationDB
, AnnotationSeq
, AnnotationSlice
, \
152 AnnotationServer
, AnnotationClient
156 from dbfile
import NoSuchFileError
160 # SequenceDB and associated support classes.
163 class _SequenceDBInverse(object):
164 """Implements __inverse__ on SequenceDB objects, returning seq name."""
165 def __init__(self
, db
):
168 def __getitem__(self
, seq
):
169 return seq
.pathForward
.id
171 def __contains__(self
, seq
):
173 return seq
.pathForward
.db
is self
.db
174 except AttributeError:
177 class SequenceDB(object, UserDict
.DictMixin
):
178 """Base class for sequence databases.
180 SequenceDB provides a few basic (base) behaviors:
181 - dict-like interface to sequence objects each with an ID
182 - the ~ (invert) operator returns an 'inverted' database, which is a
183 dict-like object that returns sequence names when given a sequence.
184 - weakref-based automatic flushing of seq objects no longer in use;
185 use autoGC=0 to turn this off.
186 - cacheHint() system for caching a given set of sequence
187 intervals associated with an owner object, which are flushed
188 from cache if the owner object is garbage-collected.
190 For subclassing, note that self.seqInfoDict must be set before
191 SequenceDB.__init__ is called!
194 # class to use for database-linked sequences; no default.
196 # class to use for sequence slices; see sequence.SeqPath.classySlice.
197 itemSliceClass
= SeqDBSlice
199 # pickling methods & what attributes to pickle.
200 __getstate__
= classutil
.standard_getstate
201 __setstate__
= classutil
.standard_setstate
202 _pickleAttrs
= dict(autoGC
=0)
204 # define ~ (invert) operator to return a lazily-created _SequenceDBInverse
205 __invert__
= classutil
.lazy_create_invert(_SequenceDBInverse
)
207 def __init__(self
, autoGC
=True, dbname
=None, **kwargs
):
208 """Initialize seq db from filepath or ifile."""
209 if autoGC
: # automatically garbage collect unused objects
210 self
._weakValueDict
= classutil
.RecentValueDictionary(autoGC
)
212 self
._weakValueDict
= {} # object cache @CTB not tested
215 # override itemClass and itemSliceClass if specified
216 self
.itemClass
= kwargs
.get('itemClass', self
.itemClass
)
217 self
.itemSliceClass
= kwargs
.get('itemSliceClass', self
.itemSliceClass
)
219 if self
.itemClass
is None:
220 raise TypeError, "must supply itemClass to SequenceDB"
222 # get a copy we can modify w/o side effects and bind itemClass.
223 kwargs
= kwargs
.copy()
225 classutil
.get_bound_subclass(self
, 'itemClass', dbname
,
228 # guess the sequence type
232 """Define a hash function to allow this object to be used as a key."""
235 def _set_seqtype(self
):
236 """Guess the seqtype from 100 chars of 1st seq if not already known."""
237 seqtype
= getattr(self
, '_seqtype', None)
238 if seqtype
is not None:
241 for seqID
in self
: # get an iterator
242 seq
= self
[seqID
] # get the 1st sequence
243 ch100
= str(seq
[:100])
244 self
._seqtype
= guess_seqtype(ch100
)
245 break # only process the 1st sequence!!!
247 _cache_max
=10000 # @CTB move? make settable?
248 def cacheHint(self
, ivalDict
, owner
):
249 """Save a cache hint dict: {id: (start, stop)}.
254 # @CTB refactor, test
255 for id, ival
in ivalDict
.items(): # BUILD THE CACHE DICTIONARY FOR owner
256 if ival
[0] < 0: # FORCE IVAL INTO POSITIVE ORIENTATION
257 ival
=(-ival
[1], -ival
[0]) # @CTB untested
258 if ival
[1]-ival
[0] > self
._cache
_max
: # TRUNCATE EXCESSIVE LENGTH
259 ival
=(ival
[0], ival
[0] + self
._cache
_max
) # @CTB untested
260 d
[id]=[ival
[0], ival
[1]]
262 self
._cache
[owner
] = d
# ADD TO EXISTING CACHE
263 except AttributeError:
264 self
._cache
= weakref
.WeakKeyDictionary() # AUTOMATICALLY REMOVE
265 self
._cache
[owner
] = d
# FROM CACHE IF owner GOES OUT OF SCOPE
266 def strsliceCache(self
, seq
, start
, stop
):
267 """Get strslice using cache hints, if any available."""
269 cacheDict
=self
._cache
270 except AttributeError:
271 raise IndexError('no cache present')
272 for owner
, d
in cacheDict
.items():
276 continue # NOT IN THIS CACHE, SO SKIP @CTB untested
277 ival_start
, ival_stop
= ival
[:2]
278 if start
>= ival_start
and stop
<= ival_stop
: # CONTAINED IN ival
280 s
= ival
[2] # get seq string from our cache
281 except IndexError: # use strslice() to retrieve from storage
282 s
= seq
.strslice(ival_start
, ival_stop
, useCache
=False)
284 try: # does owner want to reference this cached seq?
285 save_f
= owner
.cache_reference
286 except AttributeError:
287 pass # no, so nothing to do
288 else: # let owner control caching in our _weakValueDict
289 save_f(seq
) # # @CTB untested
290 return s
[start
- ival_start
:stop
- ival_start
]
291 raise IndexError('interval not found in cache') # @CTB untested
293 # these methods should all be implemented on all SequenceDBs.
295 pass # subclass should implement closing of its open resources!
298 return iter(self
.seqInfoDict
)
302 yield seqID
, self
[seqID
]
305 return len(self
.seqInfoDict
)
307 def __getitem__(self
, seqID
):
308 """Retrieve sequence by id, using cache if available."""
309 try: # for speed, default case (cache hit) should return immediately
310 return self
._weakValueDict
[seqID
]
311 except KeyError: # not in cache? try loading.
313 s
= self
.itemClass(self
, seqID
)
315 raise KeyError("no key '%s' in database %s"
316 % (seqID
, repr(self
)))
317 self
._weakValueDict
[seqID
] = s
# save in cache.
321 return self
.seqInfoDict
.keys()
323 def __contains__(self
, key
):
324 return key
in self
.seqInfoDict
327 return "<%s '%s'>" % (self
.__class
__.__name
__,
328 self
.itemClass
.__class
__.__name
__)
330 def clear_cache(self
):
331 """Empty the cache."""
332 self
._weakValueDict
.clear()
334 # these methods should not be implemented for read-only database.
335 clear
= setdefault
= pop
= popitem
= copy
= update
= \
336 classutil
.read_only_error
340 # FileDBSequence and SequenceFileDB, and associated support classes.
343 class _FileDBSeqDescriptor(object):
344 """Descriptor to retrieve entire sequence from seqLenDict."""
345 def __get__(self
, obj
, objtype
):
346 length
= obj
.db
.seqLenDict
[obj
.id][0]
347 return obj
.strslice(0, length
)
349 class FileDBSequence(SequenceBase
):
350 """Default sequence object for file-based storage mechanism.
352 See SequenceFileDB for the associated database class.
354 In general, you should not create objects from this class directly;
355 retrieve them from SequenceFileDB objects, instead.
357 NOTE: 'self.db' is attached to all instances of this class that come
358 from a particular database by 'classutil.get_bound_subclass'.
361 seq
= _FileDBSeqDescriptor() # dynamically retrieve 'seq'.
362 __reduce__
= classutil
.item_reducer
# for pickling purposes.
364 def __init__(self
, db
, id):
366 SequenceBase
.__init
__(self
)
367 if self
.id not in self
.db
.seqLenDict
:
368 raise KeyError('sequence %s not in db %s' % (self
.id, self
.db
))
371 """Unpack this sequence's length from the seqLenDict."""
372 return self
.db
.seqLenDict
[self
.id][0]
374 def strslice(self
, start
, end
, useCache
=True):
375 """Access slice of a sequence efficiently, using seqLenDict info."""
376 if useCache
: # If it's in the cache, use that!
378 return self
.db
.strsliceCache(self
, start
, end
)
382 return self
.db
.strslice(self
.id, start
, end
)
384 class SequenceFileDB(SequenceDB
):
385 """Main class for file-based storage of a sequence database.
387 By default, SequenceFileDB uses a seqLenDict, a.k.a. a shelve
388 index of sequence lengths and offsets, to retrieve sequence slices
389 with fseek. Thus entire chromosomes (for example) do not have to
390 be loaded to retrieve a subslice.
392 Takes one required argument, 'filepath', which should be the name
393 of a FASTA file (or a file whose format is understood by your
394 custom reader; see 'reader' kw arg, and the _store_seqlen_dict
397 The SequenceFileDB seqInfoDict interface is a wrapper around the
398 seqLenDict created by the __init__ function.
401 itemClass
= FileDBSequence
403 # copy _pickleAttrs and add 'filepath'
404 _pickleAttrs
= SequenceDB
._pickleAttrs
.copy()
405 _pickleAttrs
['filepath'] = 0
407 def __init__(self
, filepath
, reader
=None, **kwargs
):
408 # make filepath a pickleable attribute.
409 self
.filepath
= classutil
.SourceFileName(str(filepath
))
411 fullpath
= self
.filepath
+ '.seqlen'
412 # build the seqLenDict if it doesn't already exist
414 seqLenDict
= classutil
.open_shelve(fullpath
, 'r')
415 except NoSuchFileError
:
416 seqLenDict
= self
._create
_seqLenDict
(fullpath
, filepath
, reader
)
418 self
.seqLenDict
= seqLenDict
419 self
.seqInfoDict
= _SeqLenDictWrapper(self
) # standard interface
421 # initialize base class.
422 dbname
= os
.path
.basename(filepath
)
423 SequenceDB
.__init
__(self
, filepath
=filepath
, dbname
=dbname
, **kwargs
)
426 '''close our open shelve index file and _pureseq...'''
427 self
.seqLenDict
.close()
429 do_close
= self
._pureseq
.close
430 except AttributeError:
431 pass # _pureseq not open yet, so nothing to do
436 return "<%s '%s'>" % (self
.__class
__.__name
__, self
.filepath
)
438 def _create_seqLenDict(self
, dictpath
, seqpath
, reader
=None):
439 """Create a seqLenDict from 'seqpath' and store in 'dictpath'."""
440 seqLenDict
= classutil
.open_shelve(dictpath
, 'n')
442 logger
.debug('Building sequence length index...')
443 _store_seqlen_dict(seqLenDict
, seqpath
, reader
)
445 seqLenDict
.close() # close after writing, no matter what!
446 return classutil
.open_shelve(dictpath
, 'r') # re-open read-only
448 def strslice(self
, seqID
, start
, end
, useCache
=True):
449 """Access slice of a sequence efficiently, using seqLenDict info."""
450 # Retrieve sequence from the .pureseq file based on seqLenDict
454 except AttributeError:
455 fullpath
= self
.filepath
+ '.pureseq'
456 ifile
= file(fullpath
, 'rb')
457 self
._pureseq
= ifile
459 # Now, read in the actual slice.
460 offset
= self
.seqLenDict
[seqID
][1]
461 ifile
.seek(offset
+ start
)
462 return ifile
.read(end
- start
)
464 # Some support classes for the SeqLenDict mechanism.
466 class BasicSeqInfo(object):
467 """Wrapper to provide the correct seqInfoDict-style object information.
469 This boils down to providing id, db, length, and possibly offset.
472 def __init__(self
, seqID
, seqDB
, length
=None):
476 self
.length
= len(seqDB
[seqID
]) # generic but possibly slow
480 class _SeqLenObject(BasicSeqInfo
):
481 """Wrapper for use with a seqLenDict """
482 def __init__(self
, seqID
, seqDB
):
483 length
, self
.offset
= seqDB
.seqLenDict
[seqID
]
484 BasicSeqInfo
.__init
__(self
, seqID
, seqDB
, length
)
486 class BasicSeqInfoDict(object, UserDict
.DictMixin
):
487 """Wrapper around SequenceDB.seqLenDict to provide seqInfoDict behavior.
488 This basic version just gets the length from the sequence object itself.
490 itemClass
= BasicSeqInfo
491 def __init__(self
, db
):
494 def __getitem__(self
, k
):
495 return self
.itemClass(k
, self
.seqDB
)
498 return len(self
.seqDB
.seqLenDict
)
501 return iter(self
.seqDB
.seqLenDict
)
504 return self
.seqDB
.seqLenDict
.keys()
506 class _SeqLenDictWrapper(BasicSeqInfoDict
):
508 The default storage mechanism for sequences implemented by FileDBSequence
509 and SequenceFileDB puts everything in seqLenDict, a shelve index of
510 lengths and offsets. This class wraps that dictionary to provide the
511 interface that SequenceDB expects to see.
514 itemClass
= _SeqLenObject
516 class _SeqLenDictSaver(object):
517 """Support for generic reading functions, called by _store_seqlen_dict.
519 This allows you to specify your own 'reader' function when
520 constructing a FileSequenceDB, e.g. so that you could read
521 something other than FASTA files into a seqLenDict. Pass in this
522 function as the 'reader' kwarg to FileSequenceDB.
524 Custom reader functions should take a file handle and a filename,
525 and return a list of sequence info objects with 'id', 'length',
526 and 'sequence' attributes for each sequence in the given
527 file/filename. _SeqLenDictSaver will then construct a '.pureseq'
528 file containing the concatenated sequences and fill in the
529 seqLenDict appropriately.
532 def __init__(self
, reader
):
535 def __call__(self
, d
, ifile
, filename
):
537 pureseq_fp
= file(filename
+ '.pureseq', 'wb')
539 for o
in self
.reader(ifile
, filename
):
540 # store the length & offset in the seqLenDict
541 d
[o
.id] = o
.length
, offset
543 if o
.length
!= len(o
.sequence
):
544 raise ValueError('length does not match sequence: %s,%d'
546 pureseq_fp
.write(o
.sequence
)
550 def _store_seqlen_dict(d
, filename
, reader
=None, mode
='rU'):
551 """Store sequence lengths in a dictionary, e.g. a seqLenDict.
553 Used by SequenceFileDB._create_seqLenDict.
555 The 'reader' function implements a custom sequence format reader;
556 by default, _store_seqlen_dict uses seqfmt.read_fasta_lengths,
557 which reads FASTA-format files. See _SeqLenDictSaver for
558 information on building a custom 'reader', and see the seqdb docs
562 # if a custom reader function was passed in, run that.
563 builder
= seqfmt
.read_fasta_lengths
564 if reader
is not None:
565 builder
= _SeqLenDictSaver(reader
)
567 ifile
= file(filename
, mode
)
569 builder(d
, ifile
, filename
) # run the builder on our sequence set
575 # class PrefixUnionDict and associated support classes.
578 class _PrefixUnionDictInverse(object):
579 """Provide inverse (~) operator behavior for PrefixUnionDicts.
581 This enables ~pud to return a database that, given a sequence
582 object, returns the corresponding key (prefix.id) to retrieve that
583 sequence object in the pud.
587 def __init__(self
, db
):
590 def __getitem__(self
, ival
):
591 seq
= ival
.pathForward
# get the top-level sequence object
592 try: # for speed, normal case should execute immediately
593 prefix
= self
.db
.dicts
[seq
.db
]
596 # @CTB abstraction boundary violation! keep? how test?
597 if seq
.pathForward
._anno
_seq
.db
in self
.db
.dicts
:
599 this annotation is not in the PrefixUnion, but its sequence is.
600 You can get that using its \'sequence\' attribute.''')
601 except AttributeError:
603 raise KeyError('seq.db not in PrefixUnionDict')
605 return prefix
+ self
.db
.separator
+ str(seq
.id)
607 def __contains__(self
, seq
):
609 return seq
.pathForward
.db
in self
.db
.dicts
610 except AttributeError:
614 class _PrefixUnionMemberDict(object, UserDict
.DictMixin
):
616 @CTB confusing/inappropriate use of a dict interface! keep??
618 'd[prefix]=value; d[k] returns value if k is a member of prefix'
620 def __init__(self
, puDict
, default
=None,
621 attrMethod
=lambda x
:x
.pathForward
.db
):
624 self
._attrMethod
= attrMethod
625 if default
is not None:
626 self
.default
= default
# @CTB can we use setdefault for this?
629 return self
.puDict
.prefixDict
.keys()
630 possibleKeys
= keys
# legacy interface (?)
632 def __setitem__(self
, k
, v
):
633 if k
not in self
.puDict
.prefixDict
:
634 raise KeyError('key %s is not a valid union prefix string!' % k
)
635 new_k
= self
.puDict
.prefixDict
[k
]
636 self
.values
[new_k
] = v
638 def __getitem__(self
, k
):
640 db
= self
._attrMethod
(k
)
641 except AttributeError:
642 raise TypeError('wrong key type? _attrMethod() failed.')
644 if db
not in self
.values
:
647 except AttributeError: # no default value - raise KeyError.
648 raise KeyError('key not a member of this union!')
650 return self
.values
[db
]
652 class PrefixUnionDict(object, UserDict
.DictMixin
):
653 """Interface to a set of sequence DBs, each assigned a unique prefix.
655 For example, the sequence ID 'foo.bar' would unpack to ID 'bar' in
656 the dictionary associated with the prefix 'foo'. This is a useful
657 way to combine disparate seqdbs into a single db, without actually
658 altering the individual seqdbs.
660 PrefixUnionDicts can be created in one of two ways: either
661 - pass in a dictionary containing prefix-to-seqdb mappings as
663 - pass in a header file containing the information necessary to create
666 In the latter case, see the 'writeHeaderFile' method for format
667 information. The optional kwarg 'trypath' contains a list of
668 directories to search for the database file named in each line.
669 The optional kwarg 'dbClass' specifies the database class to use
670 to load each sequence file; it defaults to SequenceFileDB.
672 The default ID separator is '.'; use the 'separator' kwarg to
675 @CTB trypath => trypaths?
678 # define ~ (invert) operator to return a lazily-created _PUDInverse.
679 __invert__
= classutil
.lazy_create_invert(_PrefixUnionDictInverse
)
681 def __init__(self
, prefixDict
=None, separator
='.', filename
=None,
682 dbClass
=SequenceFileDB
, trypath
=None):
683 # read union header file
684 if filename
is not None:
687 cannot create with prefixDict and filename both!''')
690 trypath
= [os
.path
.dirname(filename
)]
691 ifile
= file(filename
, 'rU')
694 separator
= it
.next().strip('\r\n') # remove leading/trailing CR
697 prefix
, filepath
=line
.strip().split('\t')[:2]
699 dbfile
= classutil
.search_dirs_for_file(filepath
, trypath
)
702 for db
in prefixDict
.values():
703 db
.close() # close databases before exiting
705 unable to open database %s: check path or privileges.
706 Set 'trypath' to give a list of directories to search.''' % filepath
)
708 prefixDict
[prefix
] = db
712 self
.separator
= separator
713 if prefixDict
is not None:
714 self
.prefixDict
= prefixDict
718 # also create a reverse mapping
720 for k
, v
in self
.prefixDict
.items():
724 self
.seqInfoDict
= _PUDSeqInfoDict(self
) # supply standard interface
726 def format_id(self
, prefix
, seqID
):
727 return prefix
+ self
.separator
+ seqID
729 def get_prefix_id(self
, k
):
730 """Subdivide a key into a prefix and ID using the given separator."""
732 t
= k
.split(self
.separator
, 2) # split into no more than 3 fields
733 except AttributeError:
734 raise KeyError('key should be a string! ' + repr(k
))
739 raise KeyError('invalid id format; no prefix: ' + k
)
740 else: # id contains separator character?
741 prefix
= t
[0] # assume prefix doesn't contain separator @CTB untested
742 seqID
= k
[len(prefix
) + 1:] # skip past prefix
745 def get_subitem(self
, d
, seqID
):
749 except (ValueError, KeyError, TypeError):
752 # otherwise, use default (str) key
756 raise KeyError, "no key '%s' in %s" % (seqID
, repr(d
))
758 def __getitem__(self
, k
):
759 """For 'foo.bar', return 'bar' in dict associated with prefix 'foo'"""
760 prefix
, seqID
= self
.get_prefix_id(k
)
762 d
= self
.prefixDict
[prefix
]
764 raise KeyError("no key '%s' in %s" % (k
, repr(self
)))
765 return self
.get_subitem(d
, seqID
)
767 def __contains__(self
, k
):
768 """Is the given ID in our PrefixUnionDict?"""
769 # try it out as an ID.
770 if isinstance(k
, str):
772 (prefix
, id) = self
.get_prefix_id(k
)
773 return id in self
.prefixDict
[prefix
]
777 # otherwise, try treating key as a sequence.
778 # @CTB inconsistent with 'getitem'.
780 db
= k
.pathForward
.db
781 except AttributeError:
782 raise AttributeError('key must be a sequence with db attribute!')
783 return db
in self
.dicts
785 def has_key(self
, k
):
786 return self
.__contains
__(k
)
789 for p
, d
in self
.prefixDict
.items():
791 yield self
.format_id(p
, id)
794 return list(self
.iterkeys())
800 for p
,d
in self
.prefixDict
.items():
801 for id, seq
in d
.iteritems():
802 yield self
.format_id(p
, id), seq
804 def getName(self
, ival
):
805 """For a given sequence, return a fully qualified name, 'prefix.id'."""
806 seq
= ival
.pathForward
# get the top-level sequence object
807 return self
.dicts
[seq
.db
] + self
.separator
+ seq
.id
809 def newMemberDict(self
, **kwargs
): # @CTB not used; necessary?
810 """return a new member dictionary (empty)"""
811 return _PrefixUnionMemberDict(self
, **kwargs
)
813 def writeHeaderFile(self
,filename
): # @CTB not used; necessary?
814 """Save a header file, suitable for later re-creation."""
815 ifile
= file(filename
, 'w')
816 print >>ifile
, self
.separator
817 for k
, v
in self
.prefixDict
.items():
819 print >>ifile
, '%s\t%s' % (k
, v
.filepath
)
820 except AttributeError:
821 raise AttributeError('''\
822 seq db '%s' has no filepath; you may be able to save this to worldbase,
823 but not to a text HeaderFile!''' % k
)
828 for db
in self
.dicts
:
832 def cacheHint(self
, ivalDict
, owner
=None): # @CTB untested
833 '''save a cache hint dict of {id:(start,stop)}'''
835 # extract separate cache hint dict for each prefix
836 for longID
,ival
in ivalDict
.items():
837 prefix
, seqID
= self
.get_prefix_id(longID
)
838 d
.setdefault(prefix
, {})[seqID
] = ival
839 for prefix
,seqDict
in d
.items():
841 m
= self
.prefixDict
[prefix
].cacheHint
842 except AttributeError: # subdict can't cacheHint(), so just ignore
845 # pass cache hint down to subdictionary
848 # not clear what this should do for PrefixUnionDict
849 copy
= setdefault
= update
= classutil
.method_not_implemented
851 # these methods should not be implemented for read-only database.
852 clear
= pop
= popitem
= classutil
.read_only_error
855 class _PrefixDictInverseAdder(_PrefixUnionDictInverse
):
856 """Inverse class for SeqPrefixUnionDict; adds sequences when looked up.
858 @CTB is getName only used by __getitem__? Make private?
861 def getName(self
, seq
):
862 """Find in or add the given sequence to the inverse of a PUD."""
864 return _PrefixUnionDictInverse
.__getitem
__(self
, seq
)
865 except AttributeError: # no seq.db? treat as a user sequence.
866 new_id
= 'user' + self
.db
.separator
+ seq
.pathForward
.id
867 # check to make sure it's already in the user seq db...
871 def __getitem__(self
, seq
):
872 """__getitem__ interface that calls getName."""
874 return self
.getName(seq
)
876 if not self
.db
.addAll
:
879 # if we should add, add seq & re-try.
881 return self
.getName(seq
)
884 class SeqPrefixUnionDict(PrefixUnionDict
):
885 """SeqPrefixUnionDict provides += functionality to add seqs to a PUD.
887 See the __iadd__ method for details.
889 If addAll is True, then looking a sequence up in the inverse db will
890 automatically add it to the PrefixUnionDict.
893 __invert__
= classutil
.lazy_create_invert(_PrefixDictInverseAdder
)
895 def __init__(self
, addAll
=False, **kwargs
):
896 PrefixUnionDict
.__init
__(self
, **kwargs
)
898 # override default PrefixUnionDict __invert__ to add sequences;
899 # see classutil.lazy_create_invert.
900 self
.addAll
= addAll
# see _PrefixDictInverseAdder behavior.
902 def __iadd__(self
, k
):
903 """Add a sequence or database to the PUD, with a unique prefix.
905 NOTE: __iadd__ must return self.
908 # seq or db already present?
912 db
= getattr(k
, 'db', None)
913 if db
is None: # annotation sequence?
914 db
= getattr(k
.pathForward
, 'db', None) # @CTB untested
916 if db
is None: # this is a user sequence, with no container; create.
917 if not self
.prefixDict
.has_key('user'):
919 self
._add
_prefix
_dict
('user', d
)
921 d
= self
.prefixDict
['user']
923 # now add the sequence
924 d
[k
.pathForward
.id] = k
.pathForward
927 # already contain? nothing to do.
928 if db
in self
.dicts
: # @CTB can this 'if' ever be true?
931 # ok, not present; add, with a unique name. does it have
934 name
= db
._persistent
_id
.split('.')[-1]
935 except AttributeError: # no; retrieve from filepath?
936 name
= getattr(db
, 'filepath', None)
937 if name
: # got one; clean up.
938 name
= os
.path
.basename(name
)
939 name
= name
.split('.')[0]
940 else: # generate one.
941 name
= 'noname%d' % len(self
.dicts
)
943 if name
in self
.prefixDict
:
945 It appears that two different sequence databases are being assigned
946 the same prefix ("%s"). For this reason, the attempted automatic
947 construction of a PrefixUnionDict for you cannot be completed! You
948 should instead construct a PrefixUnionDict that assigns a unique
949 prefix to each sequence database, and supply it directly as the
950 seqDict argument to the NLMSA constructor.''' % id)
951 raise ValueError('''\
952 cannot automatically construct PrefixUnionDict''')
954 self
._add
_prefix
_dict
(name
, db
)
958 def _add_prefix_dict(self
, name
, d
):
959 self
.prefixDict
[name
] = d
962 class _PUDSeqInfoDict(object, UserDict
.DictMixin
):
963 """A wrapper object supplying a standard seqInfoDict interface for PUDs.
965 This class simply provides a standard dict interface that rewrites
966 individual sequence IDs into the compound PrefixUnionDict seq IDs
970 def __init__(self
, db
):
974 return iter(self
.seqDB
)
977 return list(self
.iterkeys())
980 for (k
, v
) in self
.iteritems():
983 def itervalues(self
):
984 for (k
, v
) in self
.iteritems():
988 for p
, d
in self
.seqDB
.prefixDict
.items():
989 for seqID
, info
in d
.seqInfoDict
.iteritems():
990 yield self
.seqDB
.format_id(p
, seqID
), info
992 def __getitem__(self
, k
):
993 prefix
, seqID
= self
.seqDB
.get_prefix_id(k
)
994 db
= self
.seqDB
.prefixDict
[prefix
]
995 return self
.seqDB
.get_subitem(db
.seqInfoDict
, seqID
)
997 def has_key(self
, k
):
998 return k
in self
.seqDB
1001 # @CTB stopped review here. ###################################################
1004 class BlastDB(SequenceFileDB
): # @CTB untested?
1005 '''Deprecated interface provided for backwards compatibility.
1006 Provides blast() and megablast() methods for searching your seq db.
1007 Instead of this, you should use the blast.BlastMapping, which provides
1008 a graph interface to BLAST, or MegablastMapping for megablast.'''
1009 def __reduce__(self
): # provided only for compatibility w/ 0.7 clients
1010 return (classutil
.ClassicUnpickler
, (self
.__class
__,self
.__getstate
__()))
1011 def __init__(self
, filepath
=None, blastReady
=False, blastIndexPath
=None,
1012 blastIndexDirs
=None, **kwargs
):
1013 "format database and build indexes if needed. Provide filepath or file object"
1014 SequenceFileDB
.__init
__(self
, filepath
, **kwargs
)
1016 return "<BlastDB '%s'>" % (self
.filepath
)
1017 def blast(self
, seq
, al
=None, blastpath
='blastall',
1018 blastprog
=None, expmax
=0.001, maxseq
=None, verbose
=True,
1020 'run blast with the specified parameters, return NLMSA alignment'
1021 blastmap
= self
.formatdb()
1022 return blastmap(seq
, al
, blastpath
, blastprog
, expmax
, maxseq
,
1023 verbose
, opts
, **kwargs
)
1024 def megablast(self
,seq
, al
=None, blastpath
='megablast', expmax
=1e-20,
1025 maxseq
=None, minIdentity
=None, maskOpts
='-U T -F m',
1026 rmPath
='RepeatMasker', rmOpts
='-xsmall',
1027 verbose
=True, opts
='', **kwargs
):
1028 'run megablast with the specified parameters, return NLMSA alignment'
1029 from blast
import MegablastMapping
1030 blastmap
= self
.formatdb(attr
='megablastMap', mapClass
=MegablastMapping
)
1031 return blastmap(seq
, al
, blastpath
, expmax
, maxseq
, minIdentity
,
1032 maskOpts
, rmPath
, rmOpts
, verbose
, opts
, **kwargs
)
1033 def formatdb(self
, filepath
=None, attr
='blastMap', mapClass
=None):
1034 'create a blast mapping object if needed, and ensure it is indexed'
1035 try: # see if mapping object already exists
1036 blastmap
= getattr(self
, attr
)
1037 except AttributeError:
1038 if mapClass
is None: # default: BlastMapping
1039 from blast
import BlastMapping
1040 mapClass
= BlastMapping
1041 blastmap
= mapClass(self
)
1042 setattr(self
, attr
, blastmap
) # re-use this in the future
1043 blastmap
.formatdb(filepath
) # create index file if not already present
1047 class BlastDBXMLRPC(BlastDB
):
1048 'XMLRPC server wrapper around a standard BlastDB'
1049 xmlrpc_methods
= dict(getSeqLen
=0, get_strslice
=0, getSeqLenDict
=0,
1050 get_db_size
=0, get_seqtype
=0, strslice
='get_strslice')
1051 def getSeqLen(self
,id):
1052 'get sequence length, or -1 if not found'
1054 return len(self
[id])
1056 return -1 # SEQUENCE OBJECT DOES NOT EXIST
1057 def getSeqLenDict(self
):
1058 'return seqLenDict over XMLRPC'
1060 for k
,v
in self
.seqLenDict
.items():
1061 d
[k
] = v
[0],str(v
[1]) # CONVERT TO STR TO ALLOW OFFSET>2GB
1062 return d
# XML-RPC CANNOT HANDLE INT > 2 GB, SO FORCED TO CONVERT...
1063 def get_db_size(self
):
1065 def get_strslice(self
,id,start
,stop
):
1066 'return string sequence for specified interval in the specified sequence'
1067 if start
<0: # HANDLE NEGATIVE ORIENTATION
1068 return str((-(self
[id]))[-stop
:-start
])
1069 else: # POSITIVE ORIENTATION
1070 return str(self
[id][start
:stop
])
1071 def get_seqtype(self
):
1072 return self
._seqtype
1077 class XMLRPCSequence(SequenceBase
):
1078 "Represents a sequence in a blast database, accessed via XMLRPC"
1079 def __init__(self
, db
, id):
1080 self
.length
= db
.server
.getSeqLen(id)
1082 raise KeyError('%s not in this database' % id)
1084 SequenceBase
.__init
__(self
)
1085 def strslice(self
,start
,end
,useCache
=True):
1086 "XMLRPC access to slice of a sequence"
1089 return self
.db
.strsliceCache(self
,start
,end
)
1090 except IndexError: # NOT FOUND IN CACHE
1091 pass # JUST USE OUR REGULAR XMLRPC METHOD
1092 return self
.db
.server
.get_strslice(self
.id,start
,end
) # GET FROM XMLRPC
1096 class XMLRPCSeqLenDescr(object):
1097 'descriptor that returns dictionary of remote server seqLenDict'
1098 def __init__(self
,attr
):
1100 def __get__(self
,obj
,objtype
):
1101 'only called if attribute does not already exist. Saves result as attribute'
1102 d
= obj
.server
.getSeqLenDict()
1103 for k
,v
in d
.items():
1104 d
[k
] = v
[0],int(v
[1]) # CONVERT OFFSET STR BACK TO INT
1105 obj
.__dict
__[self
.attr
] = d
# PROVIDE DIRECTLY TO THE __dict__
1108 class XMLRPCSequenceDB(SequenceDB
):
1109 'XMLRPC client: access sequence database over XMLRPC'
1110 itemClass
= XMLRPCSequence
# sequence storage interface
1111 seqLenDict
= XMLRPCSeqLenDescr('seqLenDict') # INTERFACE TO SEQLENDICT
1112 def __init__(self
, url
, name
, *args
, **kwargs
):
1114 self
.server
= coordinator
.get_connection(url
, name
)
1117 self
.seqInfoDict
= _SeqLenDictWrapper(self
)
1118 SequenceDB
.__init
__(self
, *args
, **kwargs
)
1119 def __reduce__(self
): # provided only for compatibility w/ 0.7 clients
1120 return (classutil
.ClassicUnpickler
, (self
.__class
__,self
.__getstate
__()))
1121 def __getstate__(self
): # DO NOT pickle self.itemClass! We provide our own.
1122 return dict(url
=self
.url
, name
=self
.name
) # just need XMLRPC info
1124 return self
.server
.get_db_size()
1125 def __contains__(self
, k
):
1126 if self
.server
.getSeqLen(k
)>0:
1130 def _set_seqtype(self
):
1131 'efficient way to determine sequence type of this database'
1132 try: # if already known, no need to do anything
1133 return self
._seqtype
1134 except AttributeError:
1135 self
._seqtype
= self
.server
.get_seqtype()
1136 return self
._seqtype