Bio/Tools/SeqPattern.pm

   1 # $Id$
   2 #
   3 # bioperl module for Bio::Tools::SeqPattern
   4 #
   5 # Cared for by  Steve Chervitz  (sac-at-bioperl.org)
   6 #
   7 # Copyright  Steve Chervitz
   8 #
   9 # You may distribute this module under the same terms as perl itself
  10
  11 # POD documentation - main docs before the code
  12
  13 =head1 NAME
  14
  15 Bio::Tools::SeqPattern - represent a sequence pattern or motif
  16
  17 =head1 SYNOPSIS
  18
  19  use Bio::Tools::SeqPattern;
  20
  21  my $pat1     = 'T[GA]AA...TAAT';
  22  my $pattern1 = Bio::Tools::SeqPattern->new(-SEQ =>$pat1, -TYPE =>'Dna');
  23
  24  my $pat2     = '[VILM]R(GXX){3,2}...[^PG]';
  25  my $pattern2 = Bio::Tools::SeqPattern->new(-SEQ =>$pat2, -TYPE =>'Amino');
  26
  27 =head1 DESCRIPTION
  28
  29 L<Bio::Tools::SeqPattern> module encapsulates generic data and
  30 methods for manipulating regular expressions describing nucleic or
  31 amino acid sequence patterns (a.k.a, "motifs").
  32
  33 L<Bio::Tools::SeqPattern> is a concrete class that inherits from L<Bio::Seq>.
  34
  35 This class grew out of a need to have a standard module for doing routine
  36 tasks with sequence patterns such as:
  37
  38   -- Forming a reverse-complement version of a nucleotide sequence pattern
  39   -- Expanding patterns containing ambiguity codes
  40   -- Checking for invalid regexp characters
  41   -- Untainting yet preserving special characters in the pattern
  42
  43 Other features to look for in the future:
  44
  45   -- Full pattern syntax checking
  46   -- Conversion between expanded and condensed forms of the pattern
  47
  48 =head1 MOTIVATIONS
  49
  50 A key motivation for L<Bio::Tools::SeqPattern> is to have a way to
  51 generate a reverse complement of a nucleotide sequence pattern.
  52 This makes possible simultaneous pattern matching on both sense and
  53 anti-sense strands of a query sequence.
  54
  55 In principle, one could do such a search more inefficiently by testing
  56 against both sense and anti-sense versions of a sequence.
  57 It is entirely equivalent to test a regexp containing both sense and
  58 anti-sense versions of the *pattern* against one copy of the sequence.
  59 The latter approach is much more efficient since:
  60
  61    1) You need only one copy of the sequence.
  62    2) Only one regexp is executed.
  63    3) Regexp patterns are typically much smaller than sequences.
  64
  65 Patterns can be quite complex and it is often difficult to
  66 generate the reverse complement pattern. The Bioperl SeqPattern.pm
  67 addresses this problem, providing a convenient set of tools
  68 for working with biological sequence regular expressions.
  69
  70 Not all patterns have been tested. If you discover a pattern that
  71 is not handled properly by Bio::Tools::SeqPattern.pm, please
  72 send me some email (sac@bioperl.org). Thanks.
  73
  74 =head1 OTHER FEATURES
  75
  76 =head2 Extended Alphabet Support
  77
  78 This module supports the same set of ambiguity codes for nucleotide
  79 sequences as supported by L<Bio::Seq>. These ambiguity codes
  80 define the behavior or the L<expand> method.
  81
  82  ------------------------------------------
  83  Symbol       Meaning      Nucleic Acid
  84  ------------------------------------------
  85   A            A           (A)denine
  86   C            C           (C)ytosine
  87   G            G           (G)uanine
  88   T            T           (T)hymine
  89   U            U           (U)racil
  90   M          A or C        a(M)ino group
  91   R          A or G        pu(R)ine
  92   W          A or T        (W)eak bond
  93   S          C or G        (S)trong bond
  94   Y          C or T        p(Y)rimidine
  95   K          G or T        (K)eto group
  96   V        A or C or G
  97   H        A or C or T
  98   D        A or G or T
  99   B        C or G or T
 100   X      G or A or T or C
 101   N      G or A or T or C
 102   .      G or A or T or C
 103
 104
 105
 106  ------------------------------------------
 107  Symbol           Meaning
 108  ------------------------------------------
 109  A        Alanine
 110  C        Cysteine
 111  D        Aspartic Acid
 112  E        Glutamic Acid
 113  F        Phenylalanine
 114  G        Glycine
 115  H        Histidine
 116  I        Isoleucine
 117  K        Lysine
 118  L        Leucine
 119  M        Methionine
 120  N        Asparagine
 121  P        Proline
 122  Q        Glutamine
 123  R        Arginine
 124  S        Serine
 125  T        Threonine
 126  V        Valine
 127  W        Tryptophan
 128  Y        Tyrosine
 129
 130  B        Aspartic Acid, Asparagine
 131  Z        Glutamic Acid, Glutamine
 132  X        Any amino acid
 133  .        Any amino acid
 134
 135
 136 =head2 Multiple Format Support
 137
 138 Ultimately, this module should be able to build SeqPattern.pm objects
 139 using a variety of pattern formats such as ProSite, Blocks, Prints, GCG, etc.
 140 Currently, this module only supports patterns using a grep-like syntax.
 141
 142 =head1 USAGE
 143
 144 A simple demo script called seq_pattern.pl is included in the examples/
 145 directory of the central Bioperl distribution.
 146
 147 =head1 SEE ALSO
 148
 149 L<Bio::Seq> - Lightweight sequence object.
 150
 151 =head1 FEEDBACK
 152
 153 =head2 Mailing Lists
 154
 155 User feedback is an integral part of the evolution of this and other
 156 Bioperl modules.  Send your comments and suggestions preferably to one
 157 of the Bioperl mailing lists.  Your participation is much appreciated.
 158
 159   bioperl-l@bioperl.org                  - General discussion
 160   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 161
 162 =head2 Reporting Bugs
 163
 164 Report bugs to the Bioperl bug tracking system to help us keep track
 165 the bugs and their resolution. Bug reports can be submitted via the
 166 web:
 167
 168   http://bugzilla.open-bio.org/
 169
 170 =head1 AUTHOR
 171
 172 Steve Chervitz, sac-at-bioperl.org
 173
 174 =head1 COPYRIGHT
 175
 176 Copyright (c) 1997-8 Steve Chervitz. All Rights Reserved.
 177 This module is free software; you can redistribute it and/or
 178 modify it under the same terms as Perl itself.
 179
 180 =cut
 181
 182 #
 183 ##
 184 ###
 185 #### END of main POD documentation.
 186 ###
 187 ##
 188 #'
 189 # CREATED : 28 Aug 1997
 190
 191
 192 package Bio::Tools::SeqPattern;
 193
 194 use base qw(Bio::Root::Root);
 195 use strict;
 196 use vars qw ($ID);
 197 $ID  = 'Bio::Tools::SeqPattern';
 198
 199 ## These constants may be more appropriate in a Bio::Dictionary.pm
 200 ## type of class.
 201 my $PURINES      = 'AG';
 202 my $PYRIMIDINES  = 'CT';
 203 my $BEE      = 'DN';
 204 my $ZED      = 'EQ';
 205 my $Regexp_chars = '\w,.\*()\[\]<>\{\}^\$';  # quoted for use in regexps
 206
 207 ## Package variables used in reverse complementing.
 208 my (%Processed_braces, %Processed_asterics);
 209
 210 #####################################################################################
 211 ##                                 CONSTRUCTOR                                     ##
 212 #####################################################################################
 213
 214
 215 =head1 new
 216
 217  Title     : new
 218  Usage     : my $seqpat = Bio::Tools::SeqPattern->new();
 219  Purpose   : Verifies that the type is correct for superclass (Bio::Seq.pm)
 220            : and calls superclass constructor last.
 221  Returns   : n/a
 222  Argument  : Parameters passed to new()
 223  Throws    : Exception if the pattern string (seq) is empty.
 224  Comments  : The process of creating a new SeqPattern.pm object
 225            : ensures that the pattern string is untained.
 226
 227 See Also   : L<Bio::Root::Root::new>,
 228              L<Bio::Seq::_initialize>
 229
 230 =cut
 231
 232 #----------------
 233 sub new {
 234 #----------------
 235     my($class, %param) = @_;
 236
 237     my $self = $class->SUPER::new(%param);
 238     my ($seq,$type) = $self->_rearrange([qw(SEQ TYPE)], %param);
 239
 240     $seq || $self->throw("Empty pattern.");
 241     my $t;
 242     # Get the type ready for Bio::Seq.pm
 243     if ($type =~ /nuc|[dr]na/i) {
 244         $t = 'Dna';
 245     } elsif ($type =~ /amino|pep|prot/i) {
 246         $t = 'Amino';
 247     }
 248     $seq =~ tr/a-z/A-Z/;  #ps 8/8/00 Canonicalize to upper case
 249     $self->str($seq);
 250     $self->type($t);
 251
 252     return $self;
 253 }
 254
 255
 256 =head1 alphabet_ok
 257
 258  Title     : alphabet_ok
 259  Usage     : $mypat->alphabet_ok;
 260  Purpose   : Checks for invalid regexp characters.
 261            : Overrides Bio::Seq::alphabet_ok() to allow
 262            : additional regexp characters ,.*()[]<>{}^$
 263            : in addition to the standard genetic alphabet.
 264            : Also untaints the pattern and sets the sequence
 265            : object's sequence to the untained string.
 266  Returns   : Boolean (1 | 0)
 267  Argument  : n/a
 268  Throws    : Exception if the pattern contains invalid characters.
 269  Comments  : Does not call the superclass method.
 270            : Actually permits any alphanumeric, not just the
 271            : standard genetic alphabet.
 272
 273 =cut
 274
 275 #----------------'
 276 sub alphabet_ok {
 277 #----------------
 278     my( $self) = @_;
 279
 280     return 1 if $self->{'_alphabet_checked'};
 281
 282     $self->{'_alphabet_checked'} = 1;
 283
 284     my $pat = $self->seq();
 285
 286     if($pat =~ /[^$Regexp_chars]/io) {
 287         $self->throw("Pattern contains invalid characters: $pat",
 288                      'Legal characters: a-z,A-Z,0-9,,.*()[]<>{}^$ ');
 289     }
 290
 291     # Untaint pattern (makes code taint-safe).
 292     $pat  =~ /([$Regexp_chars]+)/io;
 293     $self->setseq(uc($1));
 294 #    print STDERR "\npattern ok: $pat\n";
 295     1;
 296 }
 297
 298 =head1 expand
 299
 300  Title     : expand
 301  Usage     : $seqpat_object->expand();
 302  Purpose   : Expands the sequence pattern using special ambiguity codes.
 303  Example   : $pat = $seq_pat->expand();
 304  Returns   : String containing fully expanded sequence pattern
 305  Argument  : n/a
 306  Throws    : Exception if sequence type is not recognized
 307            : (i.e., is not one of [DR]NA, Amino)
 308
 309 See Also   : L<Extended Alphabet Support>, L<_expand_pep>(), L<_expand_nuc>()
 310
 311 =cut
 312
 313 #----------
 314 sub expand {
 315 #----------
 316     my $self = shift;
 317
 318     if($self->type =~ /[DR]na/i) { $self->_expand_nuc(); }
 319     elsif($self->type =~ /Amino/i) { $self->_expand_pep(); }
 320     else{
 321         $self->throw("Don't know how to expand ${\$self->type} patterns.\n");
 322     }
 323 }
 324
 325
 326 =head1 _expand_pep
 327
 328  Title     : _expand_pep
 329  Usage     : n/a; automatically called by expand()
 330  Purpose   : Expands peptide patterns
 331  Returns   : String (the expanded pattern)
 332  Argument  : String (the unexpanded pattern)
 333  Throws    : n/a
 334
 335 See Also   : L<expand>(), L<_expand_nuc>()
 336
 337 =cut
 338
 339 #----------------
 340 sub _expand_pep {
 341 #----------------
 342     my ($self,$pat) = @_;
 343     $pat ||= $self->str;
 344     $pat =~ s/X/./g;
 345     $pat =~ s/^</\^/;
 346     $pat =~ s/>$/\$/;
 347
 348     ## Avoid nested situations: [bmnq] --/--> [[$ZED]mnq]
 349     ## Yet correctly deal with: fze[bmnq] ---> f[$BEE]e[$ZEDmnq]
 350     if($pat =~ /\[\w*[BZ]\w*\]/) {
 351         $pat =~ s/\[(\w*)B(\w*)\]/\[$1$ZED$2\]/g;
 352         $pat =~ s/\[(\w*)Z(\w*)\]/\[$1$BEE$2\]/g;
 353         $pat =~ s/B/\[$ZED\]/g;
 354         $pat =~ s/Z/\[$BEE\]/g;
 355     } else {
 356         $pat =~ s/B/\[$ZED\]/g;
 357         $pat =~ s/Z/\[$BEE\]/g;
 358     }
 359     $pat =~ s/\((.)\)/$1/g;  ## Doing these last since:
 360     $pat =~ s/\[(.)\]/$1/g;  ## Pattern could contain [B] (for example)
 361
 362     return $pat;
 363 }
 364
 365
 366
 367 =head1 _expand_nuc
 368
 369  Title     : _expand_nuc
 370  Purpose   : Expands nucleotide patterns
 371  Returns   : String (the expanded pattern)
 372  Argument  : String (the unexpanded pattern)
 373  Throws    : n/a
 374
 375 See Also   : L<expand>(), L<_expand_pep>()
 376
 377 =cut
 378
 379 #---------------
 380 sub _expand_nuc {
 381 #---------------
 382     my ($self,$pat) = @_;
 383
 384     $pat ||= $self->str;
 385     $pat =~ s/N|X/./g;
 386     $pat =~ s/pu/R/ig;
 387     $pat =~ s/py/Y/ig;
 388     $pat =~ s/U/T/g;
 389     $pat =~ s/^</\^/;
 390     $pat =~ s/>$/\$/;
 391
 392     ## Avoid nested situations: [ya] --/--> [[ct]a]
 393     ## Yet correctly deal with: sg[ya] ---> [gc]g[cta]
 394     if($pat =~ /\[\w*[RYSWMK]\w*\]/) {
 395         $pat =~ s/\[(\w*)R(\w*)\]/\[$1$PURINES$2\]/g;
 396         $pat =~ s/\[(\w*)Y(\w*)\]/\[$1$PYRIMIDINES$2\]/g;
 397         $pat =~ s/\[(\w*)S(\w*)\]/\[$1GC$2\]/g;
 398         $pat =~ s/\[(\w*)W(\w*)\]/\[$1AT$2\]/g;
 399         $pat =~ s/\[(\w*)M(\w*)\]/\[$1AC$2\]/g;
 400         $pat =~ s/\[(\w*)K(\w*)\]/\[$1GT$2\]/g;
 401         $pat =~ s/\[(\w*)V(\w*)\]/\[$1ACG$2\]/g;
 402         $pat =~ s/\[(\w*)H(\w*)\]/\[$1ACT$2\]/g;
 403         $pat =~ s/\[(\w*)D(\w*)\]/\[$1AGT$2\]/g;
 404         $pat =~ s/\[(\w*)B(\w*)\]/\[$1CGT$2\]/g;
 405         $pat =~ s/R/\[$PURINES\]/g;
 406         $pat =~ s/Y/\[$PYRIMIDINES\]/g;
 407         $pat =~ s/S/\[GC\]/g;
 408         $pat =~ s/W/\[AT\]/g;
 409         $pat =~ s/M/\[AC\]/g;
 410         $pat =~ s/K/\[GT\]/g;
 411         $pat =~ s/V/\[ACG\]/g;
 412         $pat =~ s/H/\[ACT\]/g;
 413         $pat =~ s/D/\[AGT\]/g;
 414         $pat =~ s/B/\[CGT\]/g;
 415     } else {
 416         $pat =~ s/R/\[$PURINES\]/g;
 417         $pat =~ s/Y/\[$PYRIMIDINES\]/g;
 418         $pat =~ s/S/\[GC\]/g;
 419         $pat =~ s/W/\[AT\]/g;
 420         $pat =~ s/M/\[AC\]/g;
 421         $pat =~ s/K/\[GT\]/g;
 422         $pat =~ s/V/\[ACG\]/g;
 423         $pat =~ s/H/\[ACT\]/g;
 424         $pat =~ s/D/\[AGT\]/g;
 425         $pat =~ s/B/\[CGT\]/g;
 426     }
 427     $pat =~ s/\((.)\)/$1/g;  ## Doing thses last since:
 428     $pat =~ s/\[(.)\]/$1/g;  ## Pattern could contain [y] (for example)
 429
 430     return $pat;
 431 }
 432
 433
 434
 435 =head1 revcom
 436
 437  Title     : revcom
 438  Usage     : revcom([1]);
 439  Purpose   : Forms a pattern capable of recognizing the reverse complement
 440            : version of a nucleotide sequence pattern.
 441  Example   : $pattern_object->revcom();
 442            : $pattern_object->revcom(1); ## returns expanded rev complement pattern.
 443  Returns   : Object reference for a new Bio::Tools::SeqPattern containing
 444            : the revcom of the current pattern as its sequence.
 445  Argument  : (1) boolean (optional) (default= false)
 446            :     true : expand the pattern before rev-complementing.
 447            :     false: don't expand pattern before or after rev-complementing.
 448  Throws    : Exception if called for amino acid sequence pattern.
 449  Comments  : This method permits the simultaneous searching of both
 450            : sense and anti-sense versions of a nucleotide pattern
 451            : by means of a grep-type of functionality in which any
 452            : number of patterns may be or-ed into the recognition
 453            : pattern.
 454            : Overrides Bio::Seq::revcom() and calls it first thing.
 455            : The order of _fixpat() calls is critical.
 456
 457 See Also   : L<Bio::Seq::revcom()>, L<_fixpat_1>(), L<_fixpat_2>(), L<_fixpat_3>(), L<_fixpat_4>(), L<_fixpat_5>()
 458
 459 =cut
 460
 461 #-----------'
 462 sub revcom {
 463 #-----------
 464     my($self,$expand) = @_;
 465
 466     if ($self->type !~ /Dna|Rna/i) {
 467         $self->throw("Can't get revcom for ${\$self->type} sequence types.\n");
 468     }
 469 #    return $self->{'_rev'} if defined $self->{'_rev'};
 470
 471     $expand ||= 0;
 472     my $str = $self->str;
 473     $str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
 474     my $rev = CORE::reverse $str;
 475     $rev    =~ tr/[](){}<>/][)(}{></;
 476
 477     if($expand) {
 478         $rev = $self->_expand_nuc($rev);
 479 #       print "\nExpanded: $rev\n";
 480     }
 481
 482     %Processed_braces = ();
 483     %Processed_asterics = ();
 484
 485     my $fixrev = _fixpat_1($rev);
 486 #   print "FIX 1: $fixrev";<STDIN>;
 487
 488      $fixrev = _fixpat_2($fixrev);
 489 #   print "FIX 2: $fixrev";<STDIN>;
 490
 491      $fixrev = _fixpat_3($fixrev);
 492 #    print "FIX 3: $fixrev";<STDIN>;
 493
 494      $fixrev = _fixpat_4($fixrev);
 495 #    print "FIX 4: $fixrev";<STDIN>;
 496
 497      $fixrev = _fixpat_5($fixrev);
 498 #    print "FIX 5: $fixrev";<STDIN>;
 499
 500 ##### Added by ps 8/7/00 to allow non-greedy matching
 501      $fixrev = _fixpat_6($fixrev);
 502 #    print "FIX 6: $fixrev";<STDIN>;
 503
 504 #    $self->{'_rev'} = $fixrev;
 505
 506      return new Bio::Tools::SeqPattern(-seq =>$fixrev, -type =>$self->type);
 507 }
 508
 509
 510
 511 =head1 _fixpat_1
 512
 513  Title     : _fixpat_1
 514  Usage     : n/a; called automatically by revcom()
 515  Purpose   : Utility method for revcom()
 516            : Converts all {7,5} --> {5,7}     (Part I)
 517            :           and [T^] --> [^T]      (Part II)
 518            :           and *N   --> N*        (Part III)
 519  Returns   : String (the new, partially reversed pattern)
 520  Argument  : String (the expanded pattern)
 521  Throws    : n/a
 522
 523 See Also   : L<revcom>()
 524
 525 =cut
 526
 527 #--------------
 528 sub _fixpat_1 {
 529 #--------------
 530     my $pat = shift;
 531
 532     ## Part I:
 533     my (@done,@parts);
 534     while(1) {
 535         $pat =~ /(.*)\{(\S+?)\}(.*)/ or do{ push @done, $pat; last; };
 536         $pat = $1.'#{'.reverse($2).'}'.$3;
 537 #       print "1: $1\n2: $2\n3: $3\n";
 538 #       print "modified pat: $pat";<STDIN>;
 539         @parts = split '#', $pat;
 540         push @done, $parts[1];
 541         $pat = $parts[0];
 542 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 543         last if not $pat;
 544     }
 545     $pat = join('', reverse @done);
 546
 547     ## Part II:
 548     @done = ();
 549     while(1) {
 550         $pat =~ /(.*)\[(\S+?)\](.*)/ or do{ push @done, $pat; last; };
 551         $pat = $1.'#['.reverse($2).']'.$3;
 552 #       print "1: $1\n2: $2\n3: $3\n";
 553 #       print "modified pat: $pat";<STDIN>;
 554         @parts = split '#', $pat;
 555         push @done, $parts[1];
 556         $pat = $parts[0];
 557 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 558         last if not $pat;
 559     }
 560     $pat = join('', reverse @done);
 561
 562     ## Part III:
 563     @done = ();
 564     while(1) {
 565         $pat =~ /(.*)\*([\w.])(.*)/ or do{ push @done, $pat; last; };
 566         $pat = $1.'#'.$2.'*'.$3;
 567         $Processed_asterics{$2}++;
 568 #       print "1: $1\n2: $2\n3: $3\n";
 569 #       print "modified pat: $pat";<STDIN>;
 570         @parts = split '#', $pat;
 571         push @done, $parts[1];
 572         $pat = $parts[0];
 573 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 574         last if not $pat;
 575     }
 576     return join('', reverse @done);
 577
 578 }
 579
 580
 581 =head1 _fixpat_2
 582
 583  Title     : _fixpat_2
 584  Usage     : n/a; called automatically by revcom()
 585  Purpose   : Utility method for revcom()
 586            : Converts all {5,7}Y ---> Y{5,7}
 587            :          and {10,}. ---> .{10,}
 588  Returns   : String (the new, partially reversed pattern)
 589  Argument  : String (the expanded, partially reversed pattern)
 590  Throws    : n/a
 591
 592 See Also   : L<revcom>()
 593
 594 =cut
 595
 596 #--------------
 597 sub _fixpat_2 {
 598 #--------------
 599     my $pat = shift;
 600
 601     local($^W) = 0;
 602     my (@done,@parts,$braces);
 603     while(1) {
 604 #       $pat =~ s/(.*)([^])])(\{\S+?\})([\w.])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
 605         $pat =~ s/(.*)(\{\S+?\})([\w.])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
 606         $braces = $2;
 607         $braces =~ s/[{}]//g;
 608         $Processed_braces{"$3$braces"}++;
 609 #       print "modified pat: $pat";<STDIN>;
 610         @parts = split '#', $pat;
 611         push @done, $parts[1];
 612         $pat = $parts[0];
 613 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 614         last if not $pat;
 615     }
 616     return join('', reverse @done);
 617 }
 618
 619
 620 =head1 _fixpat_3
 621
 622  Title     : _fixpat_3
 623  Usage     : n/a; called automatically by revcom()
 624  Purpose   : Utility method for revcom()
 625            : Converts all {5,7}(XXX) ---> (XXX){5,7}
 626  Returns   : String (the new, partially reversed pattern)
 627  Argument  : String (the expanded, partially reversed pattern)
 628  Throws    : n/a
 629
 630 See Also   : L<revcom>()
 631
 632 =cut
 633
 634 #-------------
 635 sub _fixpat_3 {
 636 #-------------
 637     my $pat = shift;
 638
 639     my (@done,@parts,$braces,$newpat,$oldpat);
 640     while(1) {
 641 #       $pat =~ s/(.+)(\{\S+\})(\(\w+\))(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
 642         if( $pat =~ /(.*)(.)(\{\S+\})(\(\w+\))(.*)/) {
 643             $newpat = "$1#$2$4$3$5";
 644 ##ps        $oldpat = "$1#$2$3$4$5";
 645 #           print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
 646 ##ps        $braces = $3;
 647 ##ps        $braces =~ s/[{}]//g;
 648 ##ps        if( exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
 649 ##ps            $pat = $oldpat;  # Don't change it. Already processed.
 650 #               print "saved pat: $pat";<STDIN>;
 651 ##ps        } else {
 652 #               print "new pat: $newpat";<STDIN>;
 653                 $pat = $newpat;  # Change it.
 654 ##ps        }
 655         } elsif( $pat =~ /^(\{\S+\})(\(\w+\))(.*)/) {
 656             $pat = "#$2$1$3";
 657         } else {
 658             push @done, $pat; last;
 659         }
 660         @parts = split '#', $pat;
 661         push @done, $parts[1];
 662         $pat = $parts[0];
 663 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 664         last if not $pat;
 665     }
 666     return join('', reverse @done);
 667 }
 668
 669
 670 =head1 _fixpat_4
 671
 672  Title     : _fixpat_4
 673  Usage     : n/a; called automatically by revcom()
 674  Purpose   : Utility method for revcom()
 675            : Converts all {5,7}[XXX] ---> [XXX]{5,7}
 676  Returns   : String (the new, partially reversed pattern)
 677  Argument  : String (the expanded, partially reversed  pattern)
 678  Throws    : n/a
 679
 680 See Also   : L<revcom>()
 681
 682 =cut
 683
 684 #---------------
 685 sub _fixpat_4 {
 686 #---------------
 687     my $pat = shift;
 688
 689     my (@done,@parts,$braces,$newpat,$oldpat);
 690     while(1) {
 691 #       $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
 692 #       $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
 693         if( $pat =~ /(.*)(.)(\{\S+\})(\[\w+\])(.*)/) {
 694             $newpat = "$1#$2$4$3$5";
 695             $oldpat = "$1#$2$3$4$5";
 696 #           print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
 697             $braces = $3;
 698             $braces =~ s/[{}]//g;
 699             if( (defined $braces and defined $2) and
 700                 exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
 701                 $pat = $oldpat;  # Don't change it. Already processed.
 702 #               print "saved pat: $pat";<STDIN>;
 703             } else {
 704                 $pat = $newpat;  # Change it.
 705 #               print "new pat: $pat";<STDIN>;
 706             }
 707         } elsif( $pat =~ /^(\{\S+\})(\[\w+\])(.*)/) {
 708             $pat = "#$2$1$3";
 709         } else {
 710             push @done, $pat; last;
 711         }
 712
 713         @parts = split '#', $pat;
 714         push @done, $parts[1];
 715         $pat = $parts[0];
 716 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 717         last if not $pat;
 718     }
 719     return join('', reverse @done);
 720 }
 721
 722
 723 =head1 _fixpat_5
 724
 725  Title     : _fixpat_5
 726  Usage     : n/a; called automatically by revcom()
 727  Purpose   : Utility method for revcom()
 728            : Converts all *[XXX]  ---> [XXX]*
 729            :          and *(XXX)  ---> (XXX)*
 730  Returns   : String (the new, partially reversed pattern)
 731  Argument  : String (the expanded, partially reversed pattern)
 732  Throws    : n/a
 733
 734 See Also   : L<revcom>()
 735
 736 =cut
 737
 738 #--------------
 739 sub _fixpat_5 {
 740 #--------------
 741     my $pat = shift;
 742
 743     my (@done,@parts,$newpat,$oldpat);
 744     while(1) {
 745 #       $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
 746 #       $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
 747         if( $pat =~ /(.*)(.)\*(\[\w+\]|\(\w+\))(.*)/) {
 748             $newpat = "$1#$2$3*$4";
 749             $oldpat = "$1#$2*$3$4";
 750 #           print "1: $1\n2: $2\n3: $3\n4: $4\n";
 751             if( exists $Processed_asterics{$2}) {
 752                 $pat = $oldpat;  # Don't change it. Already processed.
 753 #               print "saved pat: $pat";<STDIN>;
 754             } else {
 755                 $pat = $newpat;  # Change it.
 756 #               print "new pat: $pat";<STDIN>;
 757             }
 758         } elsif( $pat =~ /^\*(\[\w+\]|\(\w+\))(.*)/) {
 759             $pat = "#$1*$3";
 760         } else {
 761             push @done, $pat; last;
 762         }
 763
 764         @parts = split '#', $pat;
 765         push @done, $parts[1];
 766         $pat = $parts[0];
 767 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 768         last if not $pat;
 769     }
 770     return join('', reverse @done);
 771 }
 772
 773
 774
 775
 776
 777 ############################
 778 #
 779 #  PS: Added 8/7/00 to allow non-greedy matching patterns
 780 #
 781 ######################################
 782
 783 =head1 _fixpat_6
 784
 785  Title     : _fixpat_6
 786  Usage     : n/a; called automatically by revcom()
 787  Purpose   : Utility method for revcom()
 788            : Converts all ?Y{5,7}  ---> Y{5,7}?
 789            :          and ?(XXX){5,7}  ---> (XXX){5,7}?
 790            :          and ?[XYZ]{5,7}  ---> [XYZ]{5,7}?
 791  Returns   : String (the new, partially reversed pattern)
 792  Argument  : String (the expanded, partially reversed pattern)
 793  Throws    : n/a
 794
 795 See Also   : L<revcom>()
 796
 797 =cut
 798
 799 #--------------
 800 sub _fixpat_6 {
 801 #--------------
 802     my $pat = shift;
 803     my (@done,@parts);
 804
 805    @done = ();
 806     while(1) {
 807         $pat =~   /(.*)\?(\[\w+\]|\(\w+\)|\w)(\{\S+?\})?(.*)/ or do{ push @done, $pat; last; };
 808      my $quantifier = $3 ? $3 : ""; # Shut up warning if no explicit quantifier
 809         $pat = $1.'#'.$2.$quantifier.'?'.$4;
 810 #       $pat = $1.'#'.$2.$3.'?'.$4;
 811
 812 #       print "1: $1\n2: $2\n3: $3\n";
 813 #       print "modified pat: $pat";<STDIN>;
 814         @parts = split '#', $pat;
 815         push @done, $parts[1];
 816         $pat = $parts[0];
 817 #       print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
 818         last if not $pat;
 819     }
 820     return join('', reverse @done);
 821
 822  }
 823
 824 =head2 str
 825
 826  Title   : str
 827  Usage   : $obj->str($newval)
 828  Function:
 829  Returns : value of str
 830  Args    : newvalue (optional)
 831
 832
 833 =cut
 834
 835 sub str{
 836    my $obj = shift;
 837    if( @_ ) {
 838       my $value = shift;
 839       $obj->{'str'} = $value;
 840     }
 841     return $obj->{'str'};
 842
 843 }
 844
 845 =head2 type
 846
 847  Title   : type
 848  Usage   : $obj->type($newval)
 849  Function:
 850  Returns : value of type
 851  Args    : newvalue (optional)
 852
 853
 854 =cut
 855
 856 sub type{
 857    my $obj = shift;
 858    if( @_ ) {
 859       my $value = shift;
 860       $obj->{'type'} = $value;
 861     }
 862     return $obj->{'type'};
 863
 864 }
 865
 866 1;
 867
 868 __END__
 869
 870 #########################################################################
 871 #  End of class
 872 #########################################################################
 873
 874 =head1 FOR DEVELOPERS ONLY
 875
 876 =head2 Data Members
 877
 878 Information about the various data members of this module is provided
 879 for those wishing to modify or understand the code. Two things to bear
 880 in mind:
 881
 882 =over 2
 883
 884 =item 1 Do NOT rely on these in any code outside of this module.
 885
 886 All data members are prefixed with an underscore to signify that they
 887 are private.  Always use accessor methods. If the accessor doesn't
 888 exist or is inadequate, create or modify an accessor (and let me know,
 889 too!).
 890
 891 =item 2 This documentation may be incomplete and out of date.
 892
 893 It is easy for this documentation to become obsolete as this module is
 894 still evolving.  Always double check this info and search for members
 895 not described here.
 896
 897 =back
 898
 899 An instance of Bio::Tools::RestrictionEnzyme.pm is a blessed reference
 900 to a hash containing all or some of the following fields:
 901
 902  FIELD          VALUE
 903  ------------------------------------------------------------------------
 904  _rev     : The corrected reverse complement of the fully expanded pattern.
 905
 906  INHERITED DATA MEMBERS:
 907
 908  _seq     : (From Bio::Seq.pm) The original, unexpanded input sequence after untainting.
 909  _type    : (From Bio::Seq.pm) 'Dna' or 'Amino'
 910
 911
 912 =cut