9 bp_genbank2gff3.pl -- Genbank-E<gt>gbrowse-friendly GFF3
13 bp_genbank2gff3.pl [options] filename(s)
15 # process a directory containing GenBank flatfiles
16 perl bp_genbank2gff3.pl --dir path_to_files --zip
18 # process a single file, ignore explicit exons and introns
19 perl bp_genbank2gff3.pl --filter exon --filter intron file.gbk.gz
21 # process a list of files
22 perl bp_genbank2gff3.pl *gbk.gz
24 # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader
25 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
26 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
27 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
30 --noinfer -r don't infer exon/mRNA subfeatures
31 --conf -i path to the curation configuration file that contains user preferences
32 for Genbank entries (must be YAML format)
33 (if --manual is passed without --ini, user will be prompted to
34 create the file if any manual input is saved)
35 --sofile -l path to to the so.obo file to use for feature type mapping
36 (--sofile live will download the latest online revision)
37 --manual -m when trying to guess the proper SO term, if more than
38 one option matches the primary tag, the converter will
39 wait for user input to choose the correct one
40 (only works with --sofile)
41 --dir -d path to a list of genbank flatfiles
42 --outdir -o location to write GFF files (can be 'stdout' or '-' for pipe)
43 --zip -z compress GFF3 output files with gzip
44 --summary -s print a summary of the features in each contig
45 --filter -x genbank feature type(s) to ignore
46 --split -y split output to separate GFF and fasta files for
48 --nolump -n separate file for each reference sequence
49 (default is to lump all records together into one
50 output file for each input file)
51 --ethresh -e error threshold for unflattener
52 set this high (>2) to ignore all unflattener errors
53 --[no]CDS -c Keep CDS-exons, or convert to alternate gene-RNA-protein-exon
54 model. --CDS is default. Use --CDS to keep default GFF gene model,
55 use --noCDS to convert to g-r-p-e.
56 --format -f Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work
58 --GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available
59 --quiet don't talk about what is being processed
60 --typesource SO sequence type for source (e.g. chromosome; region; contig)
61 --help -h display this message
66 This script uses Bio::SeqFeature::Tools::Unflattener and
67 Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene
68 containment hierarchies mapped for optimal display in gbrowse.
70 The input files are assumed to be gzipped GenBank flatfiles for refseq
71 contigs. The files may contain multiple GenBank records. Either a
72 single file or an entire directory can be processed. By default, the
73 DNA sequence is embedded in the GFF but it can be saved into separate
74 fasta file with the --split(-y) option.
76 If an input file contains multiple records, the default behaviour is
77 to dump all GFF and sequence to a file of the same name (with .gff
78 appended). Using the 'nolump' option will create a separate file for
79 each genbank record. Using the 'split' option will create separate
80 GFF and Fasta files for each genbank record.
85 =head3 'split' and 'nolump' produce many files
87 In cases where the input files contain many GenBank records (for
88 example, the chromosome files for the mouse genome build), a very
89 large number of output files will be produced if the 'split' or
90 'nolump' options are selected. If you do have lists of files E<gt> 6000,
91 use the --long_list option in bp_bulk_load_gff.pl or
92 bp_fast_load_gff.pl to load the gff and/ or fasta files.
94 =head3 Designed for RefSeq
96 This script is designed for RefSeq genomic sequence entries. It may
97 work for third party annotations but this has not been tested.
98 But see below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl
99 if you don't mind some gene model unflattener errors (dgg).
101 =head3 G-R-P-E Gene Model
103 Don Gilbert worked this over with needs to produce GFF3 suited to
104 loading to GMOD Chado databases. Most of the changes I believe are
105 suited for general use. One main chado-specific addition is the
106 --[no]cds2protein flag
108 My favorite GFF is to set the above as ON by default (disable with --nocds2prot)
109 For general use it probably should be OFF, enabled with --cds2prot.
111 This writes GFF with an alternate, but useful Gene model,
112 instead of the consensus model for GFF3
114 [ gene > mRNA> (exon,CDS,UTR) ]
118 gene > mRNA > polypeptide > exon
120 means the only feature with dna bases is the exon. The others
121 specify only location ranges on a genome. Exon of course is a child
122 of mRNA and protein/peptide.
124 The protein/polypeptide feature is an important one, having all the
125 annotations of the GenBank CDS feature, protein ID, translation, GO
126 terms, Dbxrefs to other proteins.
128 UTRs, introns, CDS-exons are all inferred from the primary exon bases
129 inside/outside appropriate higher feature ranges. Other special gene
130 model features remain the same.
132 Several other improvements and bugfixes, minor but useful are included
135 curl ftp://ncbigenomes/... | bp_genbank2gff3 --in stdin --out stdout | gff2chado ...
137 * GenBank main record fields are added to source feature, e.g. organism, date,
138 and the sourcetype, commonly chromosome for genomes, is used.
140 * Gene Model handling for ncRNA, pseudogenes are added.
142 * GFF header is cleaner, more informative.
143 --GFF_VERSION flag allows choice of v2 as well as default v3
145 * GFF ##FASTA inclusion is improved, and
146 CDS translation sequence is moved to FASTA records.
148 * FT -> GFF attribute mapping is improved.
150 * --format choice of SeqIO input formats (GenBank default).
151 Uniprot/Swissprot and EMBL work and produce useful GFF.
153 * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
154 and more flexible usage.
158 =head2 Are these additions desired?
160 * filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY
161 * handle Entrezgene, other non-sequence SeqIO structures (really should change
162 those parsers to produce consistent annotation tags).
164 =head2 Related bugfixes/tests
166 These items from Bioperl mail were tested (sample data generating
167 errors), and found corrected:
169 From: Ed Green <green <at> eva.mpg.de>
170 Subject: genbank2gff3.pl on new human RefSeq
171 Date: 2006-03-13 21:22:26 GMT
172 -- unspecified errors (sample data works now).
174 From: Eric Just <e-just <at> northwestern.edu>
175 Subject: genbank2gff3.pl
176 Date: 2007-01-26 17:08:49 GMT
177 -- bug fixed in genbank2gff3 for multi-record handling
179 This error is for a /trans_splice gene that is hard to handle, and
180 unflattner/genbank2 doesn't
182 From: Chad Matsalla <chad <at> dieselwurks.com>
183 Subject: genbank2gff3.PLS and the unflatenner - Inconsistent order?
184 Date: 2005-07-15 19:51:48 GMT
188 Sheldon McKay (mckays@cshl.edu)
190 Copyright (c) 2004 Cold Spring Harbor Laboratory.
192 =head2 AUTHOR of hacks for GFF2Chado loading
194 Don Gilbert (gilbertd@indiana.edu)
202 use lib
"$ENV{HOME}/bioperl-live";
203 # chad put this here to enable situations when this script is tested
204 # against bioperl compiled into blib along with other programs using blib
206 unshift(@INC,'blib/lib');
209 use Bio
::Root
::RootI
;
212 use Bio
::SeqFeature
::Tools
::Unflattener
;
213 use Bio
::SeqFeature
::Tools
::TypeMapper
;
214 use Bio
::SeqFeature
::Tools
::IDHandler
;
215 use Bio
::Location
::SplitLocationI
;
216 use Bio
::Location
::Simple
;
219 use List
::Util
qw(first);
221 use YAML
qw(Dump LoadFile DumpFile);
224 use vars qw
/$split @filter $zip $outdir $help $ethresh
225 $ONTOLOGY %FEATURES %DESCENDANTS @RETURN $MANUAL @GFF_LINE_FEAT
226 $CONF $YAML $TYPE_MAP $SYN_MAP $noinfer $SO_FILE
227 $file @files $dir $summary $nolump
228 $source_type %proteinfa %exonpar $didheader $verbose $DEBUG $GFF_VERSION
229 $gene_id $rna_id $tnum $ncrna_id $rnum %method %id %seen/;
231 use constant SO_URL
=>
232 'http://song.cvs.sourceforge.net/viewvc/*checkout*/song/ontology/so.obo';
233 use constant ALPHABET
=> [qw(a b c d e f g h i j k l m n o p q r s t u v w x y z)];
234 use constant ALPHABET_TO_NUMBER
=> {
235 a
=> 0, b
=> 1, c
=> 2, d
=> 3, e
=> 4, f
=> 5, g
=> 6, h
=> 7, i
=> 8,
236 j
=> 9, k
=> 10, l
=> 11, m
=> 12, n
=> 13, o
=> 14, p
=> 15, q
=> 16,
237 r
=> 17, s
=> 18, t
=> 19, u
=> 20, v
=> 21, w
=> 22, x
=> 23, y
=> 24,
240 use constant ALPHABET_DIVISOR
=> 26;
241 use constant GM_NEW_TOPLEVEL
=> 2;
242 use constant GM_NEW_PART
=> 1;
243 use constant GM_DUP_PART
=> 0;
244 use constant GM_NOT_PART
=> -1;
246 # Options cycle in multiples of 2 because of left side/right side pairing.
247 # You can make this number odd, but displayed matches will still round up
248 use constant OPTION_CYCLE
=> 6;
250 $GFF_VERSION = 3; # allow v2 ...
251 $verbose = 1; # right default? -nov to turn off
253 # dgg: change the gene model to Gene/mRNA/Polypeptide/exons...
254 my $CDSkeep= 1; # default should be ON (prior behavior), see gene_features()
255 my $PROTEIN_TYPE = 'polypeptide'; # for noCDSkeep;
256 # protein = flybase chado usage; GMOD Perls use 'polypeptide' with software support
258 my $FORMAT="GenBank"; # swiss ; embl; genbank ; ** guess from SOURCEID **
259 my $SOURCEID= $FORMAT; # "UniProt" "GenBank" "EMBL" should work
260 # other Bio::SeqIO formats may work. TEST: EntrezGene < problematic tags; InterPro KEGG
266 note
=> 'Note', # also pull GO: ids into Ontology_term
268 symbol
=> 'Alias', # is symbol still used?
269 # protein_id => 'Dbxref', also seen Dbxref tags: EC_number
270 # translation: handled in gene_features
275 my $quiet= !$verbose;
276 my $ok= GetOptions
( 'd|dir|input:s' => \
$dir,
279 's|summary' => \
$summary,
280 'r|noinfer' => \
$noinfer,
281 'i|conf=s' => \
$CONF,
282 'sofile=s' => \
$SO_FILE,
283 'm|manual' => \
$MANUAL,
284 'o|outdir|output:s'=> \
$outdir,
285 'x|filter:s'=> \
@filter,
286 'y|split' => \
$split,
287 "ethresh|e=s"=>\
$ethresh,
288 'c|CDS!' => \
$CDSkeep,
289 'f|format=s' => \
$FORMAT,
290 'typesource=s' => \
$source_type,
291 'GFF_VERSION=s' => \
$GFF_VERSION,
292 'quiet!' => \
$quiet, # swap quiet to verbose
294 'n|nolump' => \
$nolump);
296 my $lump = 1 unless $nolump || $split;
299 # look for help request
300 pod2usage
(2) if $help || !$ok;
302 # keep SOURCEID as-is and change FORMAT for SeqIO types;
303 # note SeqIO uses file.suffix to guess type; not useful here
305 $FORMAT = "swiss" if $FORMAT =~/UniProt|trembl/;
306 $verbose =1 if($DEBUG);
308 # initialize handlers
309 my $unflattener = Bio
::SeqFeature
::Tools
::Unflattener
->new; # for ensembl genomes (-trust_grouptag=>1);
310 $unflattener->error_threshold($ethresh) if $ethresh;
311 $unflattener->verbose(1) if($DEBUG);
312 # $unflattener->group_tag('gene') if($FORMAT =~ /embl/i) ; #? ensembl only?
313 # ensembl parsing is still problematic, forget this
315 my $tm = Bio
::SeqFeature
::Tools
::TypeMapper
->new;
316 my $idh = Bio
::SeqFeature
::Tools
::IDHandler
->new;
319 $source_type ||= "region"; # should really parse from FT.source contents below
321 #my $FTSOmap = $tm->FT_SO_map();
325 if (defined($SO_FILE) && $SO_FILE eq 'live') {
326 print "\nDownloading the latest SO file from ".SO_URL
."\n\n";
328 my $ua = LWP
::UserAgent
->new(timeout
=> 30);
329 my $request = HTTP
::Request
->new(GET
=> SO_URL
);
330 my $response = $ua->request($request);
333 if ($response->status_line =~ /200/) {
334 use File
::Temp qw
/ tempfile /;
335 my ($fh, $fn) = tempfile
();
336 print $fh $response->content;
339 print "Couldn't download SO file online...skipping validation.\n"
340 . "HTTP Status was " . $response->status_line . "\n"
350 my $parser = Bio
::OntologyIO
->new( -format
=> "obo", -file
=> $SO_FILE );
351 $ONTOLOGY = $parser->next_ontology();
353 for ($ONTOLOGY->get_all_terms) {
356 $terms{$feat->name} = $feat->name;
357 #$terms{$feat->name} = $feat;
359 my @syn = $_->each_synonym;
361 push @
{$syn{$_}}, $feat->name for @syn;
362 #push @{$syn{$_}}, $feat for @syn;
366 $FTSOsynonyms = \
%syn;
368 my %hardTerms = %{ $tm->FT_SO_map() };
369 map { $FTSOmap->{$_} ||= $hardTerms{$_} } keys %hardTerms;
372 my %terms = %{ $tm->FT_SO_map() };
373 while (my ($k,$v) = each %terms) {
374 $FTSOmap->{$k} = ref($v) ?
shift @
$v : $v;
378 $TYPE_MAP = $FTSOmap;
379 $SYN_MAP = $FTSOsynonyms;
382 # #convert $FTSOmap undefined to valid SO : moved to TypeMapper->map_types( -undefined => "region")
384 # stringify filter list if applicable
385 my $filter = join ' ', @filter if @filter;
387 # determine input files
388 my $stdin=0; # dgg: let dir == stdin == '-' for pipe use
389 if ($dir && ($dir eq '-' || $dir eq 'stdin')) {
390 $stdin=1; $dir=''; @files=('stdin');
394 opendir DIR
, $dir or die "could not open $dir for reading: $!";
395 @files = map { "$dir/$_";} grep { /\
.gb
.*/ } readdir DIR
;
399 die "$dir is not a directory\n";
407 # we should have some files by now
408 pod2usage
(2) unless @files;
411 my $stdout=0; # dgg: let outdir == stdout == '-' for pipe use
412 if($outdir && ($outdir eq '-' || $outdir eq 'stdout')) {
413 warn("std. output chosen: cannot split\n") if($split);
414 warn("std. output chosen: cannot zip\n") if($zip);
415 warn("std. output chosen: cannot nolump\n") if($nolump);
416 $stdout=1; $lump=1; $split= 0; $zip= 0; # unless we pipe stdout thru gzip
418 } elsif ( $outdir && !-e
$outdir ) {
419 mkdir($outdir) or die "could not create directory $outdir: $!\n";
422 $outdir = $dir || '.';
425 for my $file ( @files ) {
426 # dgg ; allow 'stdin' / '-' input ?
428 die "$! $file" unless($stdin || -e
$file);
429 print "# Input: $file\n" if($verbose);
431 my ($lump_fh, $lumpfa_fh, $outfile, $outfa);
433 $lump_fh= *STDOUT
; $lump="stdout$$";
434 $outfa= "stdout$$.fa"; # this is a temp file ... see below
435 open $lumpfa_fh, ">$outfa" or die "Could not create a lump outfile called ($outfa) because ($!)\n";
438 my ($vol,$dirs,$fileonly) = File
::Spec
->splitpath($file);
439 $lump = File
::Spec
->catfile($outdir, $fileonly.'.gff');
440 ($outfa = $lump) =~ s/\.gff/\.fa/;
441 open $lump_fh, ">$lump" or die "Could not create a lump outfile called ($lump) because ($!)\n";
442 open $lumpfa_fh, ">$outfa" or die "Could not create a lump outfile called ($outfa) because ($!)\n";
446 # open input file, unzip if req'd
449 } elsif ( $file =~ /\.gz/ ) {
450 open FH
, "gunzip -c $file |";
456 my $in = Bio
::SeqIO
->new(-fh
=> \
*FH
, -format
=> $FORMAT, -debug
=>$DEBUG);
457 my $gffio = Bio
::Tools
::GFF
->new( -noparse
=> 1, -gff_version
=> $GFF_VERSION );
459 while ( my $seq = $in->next_seq() ) {
460 my $seq_name = $seq->accession_number;
461 my $end = $seq->length;
464 # arrange disposition of GFF output
465 $outfile = $lump || File
::Spec
->catfile($outdir, $seq_name.'.gff');
473 $outfile = File
::Spec
->catfile($outdir, $seq_name.'.gff');
474 open $out, ">$outfile";
477 # filter out unwanted features
478 my $source_feat= undef;
479 my @source= filter
($seq); $source_feat= $source[0];
481 ($source_type,$source_feat)=
482 getSourceInfo
( $seq, $source_type, $source_feat ) ;
483 # always; here we build main prot $source_feat; # if @source;
485 # abort if there are no features
486 warn "$seq_name has no features, skipping\n" and next
487 if !$seq->all_SeqFeatures;
490 $FTSOmap->{'source'} = $source_type;
491 ## $FTSOmap->{'CDS'}= $PROTEIN_TYPE; # handle this in gene_features
493 # construct a GFF header
494 # add: get source_type from attributes of source feature? chromosome=X tag
495 # also combine 1st ft line here with source ft from $seq ..
496 my($header,$info)= gff_header
($seq_name, $end, $source_type, $source_feat);
498 print "# working on $info\n" if($verbose);
500 # unflatten gene graphs, apply SO types, etc; this also does TypeMapper ..
503 # Note that we use our own get_all_SeqFeatures function
504 # to rescue cloned exons
507 for my $feature ( get_all_SeqFeatures
($seq) ) {
509 my $method = $feature->primary_tag;
510 next if($SOURCEID =~/UniProt|swiss|trembl/i && $method ne $source_type);
512 $feature->seq_id($seq->id) unless($feature->seq_id);
513 $feature->source_tag($SOURCEID);
515 # dgg; need to convert some Genbank to GFF tags: note->Note; db_xref->Dbxref;
516 ## also, pull any GO:000 ids from /note tag and put into Ontology_term
517 maptags2gff
($feature);
519 # current gene name. The unflattened gene features should be in order so any
520 # exons, CDSs, etc that follow will belong to this gene
522 if ( $method eq 'gene' || $method eq 'pseudogene' ) {
523 @to_print= print_held
($out, $gffio, \
@to_print);
524 $gene_id = $gene_name= gene_name
($feature);
526 $gene_name= gene_name
($feature);
529 #?? should gene_name from /locus_tag,/gene,/product,/transposon=xxx
530 # be converted to or added as Name=xxx (if not ID= or as well)
531 ## problematic: convert_to_name ($feature); # drops /locus_tag,/gene, tags
532 convert_to_name
($feature);
534 ## dgg: extended to protein|polypeptide
535 ## this test ($feature->has_tag('gene') ||) is not good: repeat_regions over genes
536 ## in yeast have that genbank tag; why?
537 ## these include pseudogene ...
539 ## Note we also have mapped types to SO, so these RNA's are now transcripts:
540 # pseudomRNA => "pseudogenic_transcript",
541 # pseudotranscript" => "pseudogenic_transcript",
542 # misc_RNA=>'processed_transcript',
544 warn "#at: $method $gene_id/$gene_name\n" if $DEBUG;
546 if ( $method =~ /(gene|RNA|CDS|exon|UTR|protein|polypeptide|transcript)/
547 || ( $gene_id && $gene_name eq $gene_id ) ) {
549 my $action = gene_features
($feature, $gene_id, $gene_name); # -1, 0, 1, 2 result
550 if ($action == GM_DUP_PART
) {
551 # ignore, this is dupl. exon with new parent ...
553 } elsif ($action == GM_NOT_PART
) {
554 add_generic_id
( $feature, $gene_name, "nocount");
555 my $gff = $gffio->gff_string($feature);
556 push @GFF_LINE_FEAT, $feature;
557 #print $out "$gff\n" if $gff;
559 } elsif ($action > 0) {
560 # hold off print because exon etc. may get 2nd, 3rd parents
561 @to_print= print_held
($out, $gffio, \
@to_print) if ($action == GM_NEW_TOPLEVEL
);
562 push(@to_print, $feature);
567 # otherwise handle as generic feats with IDHandler labels
569 add_generic_id
( $feature, $gene_name, "");
570 my $gff= $gffio->gff_string($feature);
571 push @GFF_LINE_FEAT, $feature;
572 #print $out "$gff\n" if $gff;
576 # don't like doing this after others; do after each new gene id?
577 @to_print= print_held
($out, $gffio, \
@to_print);
579 gff_validate
(@GFF_LINE_FEAT);
581 for my $feature (@GFF_LINE_FEAT) {
582 my $gff= $gffio->gff_string($feature);
583 print $out "$gff\n" if $gff;
586 # deal with the corresponding DNA
587 my ($fa_out,$fa_outfile);
589 if($dna || %proteinfa) {
590 $method{'RESIDUES'} += length($dna);
591 $dna =~ s/(\S{60})/$1\n/g;
595 $fa_outfile = $outfile;
596 $fa_outfile =~ s/gff$/fa/;
597 open $fa_out, ">$fa_outfile" or die $!;
598 print $fa_out ">$seq_name\n$dna" if $dna;
599 foreach my $aid (sort keys %proteinfa) {
600 my $aa= delete $proteinfa{$aid};
601 $method{'RESIDUES(tr)'} += length($aa);
602 $aa =~ s/(\S{60})/$1\n/g;
603 print $fa_out ">$aid\n$aa\n";
608 ## problem here when multiple GB Seqs in one file; all FASTA needs to go at end of $out
609 ## see e.g. Mouse: mm_ref_chr19.gbk has NT_082868 and NT_039687 parts in one .gbk
610 ## maybe write this to temp .fa then cat to end of lumped gff $out
611 print $lumpfa_fh ">$seq_name\n$dna" if $dna;
612 foreach my $aid (sort keys %proteinfa) {
613 my $aa= delete $proteinfa{$aid};
614 $method{'RESIDUES(tr)'} += length($aa);
615 $aa =~ s/(\S{60})/$1\n/g;
616 print $lumpfa_fh ">$aid\n$aa\n";
623 if ( $zip && !$lump ) {
624 system "gzip -f $outfile";
625 system "gzip -f $fa_outfile" if($fa_outfile);
627 $fa_outfile .= '.gz' if $split;
630 # print "\n>EOF\n" if($stdout); #?? need this if summary goes to stdout after FASTA
631 print "# GFF3 saved to $outfile" unless( !$verbose || $stdout || $lump);
632 print ($split ?
"; DNA saved to $fa_outfile\n" : "\n") unless($stdout|| $lump);
634 # dgg: moved to after all inputs; here it prints cumulative sum for each record
636 # print "# Summary:\n# Feature\tCount\n# -------\t-----\n";
638 # for ( keys %method ) {
639 # print "# $_ $method{$_}\n";
646 print "# GFF3 saved to $outfile\n" if( $verbose && $lump);
648 print "# Summary:\n# Feature\tCount\n# -------\t-----\n";
649 for ( keys %method ) {
650 print "# $_ $method{$_}\n";
655 ## FIXME for piped output w/ split FA files ...
656 close($lumpfa_fh) if $lumpfa_fh;
657 if (!$split && $outfa && $lump_fh) {
658 print $lump_fh "##FASTA\n"; # GFF3 spec
659 open $lumpfa_fh, $outfa or warn "reading FA $outfa: $!";
660 while( <$lumpfa_fh>) {
662 } # is $lump_fh still open?
668 if ( $zip && $lump ) {
669 system "gzip -f $lump";
680 return 1 if ($_[0] =~ /gene/);
681 return 2 if ($_[0] =~ /RNA|transcript/);
682 return 3 if ($_[0] =~ /protein|peptide/);
683 return 4 if ($_[0] =~ /exon|CDS/);
684 return 3; # default before exon (smallest part)
687 sub sort_by_feattype
{
688 my($at,$bt)= ($a->primary_tag, $b->primary_tag);
689 return (typeorder
($at) <=> typeorder
($bt))
691 ## or ($a->name() cmp $b->name());
695 my($out,$gffio,$to_print)= @_;
696 return unless(@
$to_print);
697 @
$to_print = sort sort_by_feattype @
$to_print; # put exons after mRNA, otherwise chado loader chokes
698 while ( my $feature = shift @
$to_print) {
699 my $gff= $gffio->gff_string($feature); # $gff =~ s/\'/./g; # dang bug in encode
700 push @GFF_LINE_FEAT, $feature;
701 #print $out "$gff\n";
703 return (); # @to_print
708 ## should copy/move locus_tag to Alias, if not ID/Name/Alias already
709 # but see below /gene /locus_tag usage
710 foreach my $tag (keys %TAG_MAP) {
711 if ($f->has_tag($tag)) {
712 my $newtag= $TAG_MAP{$tag};
713 my @v= $f->get_tag_values($tag);
714 $f->remove_tag($tag);
715 $f->add_tag_value($newtag,@v);
717 ## also, pull any GO:000 ids from /note tag and put into Ontology_term
718 ## ncbi syntax in CDS /note is now '[goid GO:0005886]' OR '[goid 0005624]'
719 if ($tag eq 'note') {
720 map { s/\[goid (\d+)/\[goid GO:$1/g; } @v;
721 my @go= map { m/(GO:\d+)/g } @v;
722 $f->add_tag_value('Ontology_term',@go) if(@go);
731 my ($seq, $source_type, $sf) = @_;
733 my $is_swiss= ($SOURCEID =~/UniProt|swiss|trembl/i);
734 my $is_gene = ($SOURCEID =~/entrezgene/i);
735 my $is_rich = (ref($seq) =~ /RichSeq/);
736 my $seq_name= $seq->accession_number();
738 unless($sf) { # make one
739 $source_type= $is_swiss ?
$PROTEIN_TYPE
740 : $is_gene ?
"eneg" # "gene" # "region" #
741 : $is_rich ?
$seq->molecule : $source_type;
742 $sf = Bio
::SeqFeature
::Generic
->direct_new();
744 my $len = $seq->length(); $len=1 if($len<1); my $start = 1; ##$start= $len if ($len<1);
745 my $loc= $seq->can('location') ?
$seq->location()
746 : new Bio
::Location
::Simple
( -start
=> $start, -end
=> $len);
747 $sf->location( $loc );
748 $sf->primary_tag($source_type);
749 $sf->source_tag($SOURCEID);
750 $sf->seq_id( $seq_name);
751 #? $sf->display_name($seq->id()); ## Name or Alias ?
752 $sf->add_tag_value( Alias
=> $seq->id()); # unless id == accession
753 $seq->add_SeqFeature($sf);
754 ## $source_feat= $sf;
757 if ($sf->has_tag("chromosome")) {
758 $source_type= "chromosome";
759 my ($chrname) = $sf->get_tag_values("chromosome");
760 ## PROBLEM with Name <> ID, RefName for Gbrowse; use Alias instead
761 ## e.g. Mouse chr 19 has two IDs in NCBI genbank now
762 $sf->add_tag_value( Alias
=> $chrname );
765 # pull GB Comment, Description for source ft ...
766 # add reference - can be long, not plain string...
767 warn "# $SOURCEID:$seq_name fields = ", join(",", $seq->annotation->get_all_annotation_keys()),"\n" if $DEBUG;
768 # GenBank fields: keyword,comment,reference,date_changed
769 # Entrezgene fields 850293 =ALIAS_SYMBOL,RefSeq status,chromosome,SGD,dblink,Entrez Gene Status,OntologyTerm,LOCUS_SYNONYM
771 # is this just for main $seq object or for all seqfeatures ?
773 'gene_name' => 'Alias',
774 'ALIAS_SYMBOL' => 'Alias', # Entrezgene
775 'LOCUS_SYNONYM' => 'Alias', #?
777 'synonym' => 'Alias',
778 'dblink' => 'Dbxref',
779 'product' => 'product',
780 'Reference' => 'reference',
781 'OntologyTerm' => 'Ontology_term',
783 'comment1' => 'Note',
784 # various map-type locations
785 # gene accession tag is named per source db !??
786 # 'Index terms' => keywords ??
790 my ($desc)= $seq->annotation->get_Annotations("desc") || ( $seq->desc() );
791 my ($date)= $seq->annotation->get_Annotations("dates")
792 || $seq->annotation->get_Annotations("update-date")
793 || $is_rich ?
$seq->get_dates() : ();
794 my ($comment)= $seq->annotation->get_Annotations("comment");
795 my ($species)= $seq->annotation->get_Annotations("species");
797 && $seq->can('species')
798 && defined $seq->species()
799 && $seq->species()->can('binomial') ) {
800 $species= $seq->species()->binomial();
803 # update source feature with main GB fields
804 $sf->add_tag_value( ID
=> $seq_name ) unless $sf->has_tag('ID');
805 $sf->add_tag_value( Note
=> $desc ) if($desc && ! $sf->has_tag('Note'));
806 $sf->add_tag_value( organism
=> $species ) if($species && ! $sf->has_tag('organism'));
807 $sf->add_tag_value( comment1
=> $comment ) if(!$is_swiss && $comment && ! $sf->has_tag('comment1'));
808 $sf->add_tag_value( date
=> $date ) if($date && ! $sf->has_tag('date'));
810 $sf->add_tag_value( Dbxref
=> $SOURCEID.':'.$seq_name ) if $is_swiss || $is_gene;
812 foreach my $atag (sort keys %AnnotTagMap) {
813 my $gtag= $AnnotTagMap{$atag}; next unless($gtag);
815 if (ref $_ && $_->can('get_all_values')) {
816 split( /[,;] */, join ";", $_->get_all_values)
818 elsif (ref $_ && $_->can('display_text')) {
819 split( /[,;] */, $_->display_text)
821 elsif (ref $_ && $_->can('value')) {
822 split( /[,;] */, $_->value)
827 } $seq->annotation->get_Annotations($atag);
828 foreach(@anno) { $sf->add_tag_value( $gtag => $_ ); }
831 #my @genes = map{ split( /[,;] */, "$_"); } $seq->annotation->get_Annotations('gene_name');
832 #$sf->add_tag_value( Alias => $_ ) foreach(@genes);
834 #my @dblink= map { "$_"; } $seq->annotation->get_Annotations("dblink"); # add @all
835 #$sf->add_tag_value( Dbxref => $_ ) foreach(@dblink);
837 return (wantarray)?
($source_type,$sf) : $source_type; #?
842 my ($f, $gene_id, $genelinkID) = @_;
843 local $_ = $f->primary_tag;
847 $f->add_tag_value( ID
=> $gene_id ) unless($f->has_tag('ID')); # check is same value!?
848 $tnum = $rnum= 0; $ncrna_id= $rna_id = '';
849 return GM_NEW_TOPLEVEL
;
852 return GM_NOT_PART
unless $gene_id;
853 return GM_NOT_PART
if($genelinkID && $genelinkID ne $gene_id);
854 ($rna_id = $gene_id ) =~ s/gene/mRNA/;
855 $rna_id .= '.t0' . ++$tnum;
856 $f->add_tag_value( ID
=> $rna_id );
857 $f->add_tag_value( Parent
=> $gene_id );
859 } elsif ( /RNA|transcript/) {
860 ## misc_RNA here; missing exons ... flattener problem?
861 # all of {t,nc,sn}RNA can have gene models now
862 ## but problem in Worm chr: mRNA > misc_RNA > CDS with same locus tag
863 ## CDS needs to use mRNA, not misc_RNA, rna_id ...
864 ## also need to fix cases where tRNA,... lack a 'gene' parent: make this one top-level
867 return GM_NOT_PART
if($genelinkID && $genelinkID ne $gene_id);
868 ($ncrna_id = $gene_id) =~ s/gene/ncRNA/;
869 $ncrna_id .= '.r0' . ++$rnum;
870 $f->add_tag_value( Parent
=> $gene_id );
871 $f->add_tag_value( ID
=> $ncrna_id );
873 unless ($f->has_tag('ID')) {
875 $f->add_tag_value( ID
=> $genelinkID ) ;
877 $idh->generate_unique_persistent_id($f);
880 ($ncrna_id)= $f->get_tag_values('ID');
881 return GM_NEW_TOPLEVEL
;
882 # this feat now acts as gene-top-level; need to print @to_print to flush prior exons?
885 } elsif ( /exon/ ) { # can belong to any kind of RNA
886 return GM_NOT_PART
unless ($rna_id||$ncrna_id);
887 return GM_NOT_PART
if($genelinkID && $genelinkID ne $gene_id);
888 ## we are getting duplicate Parents here, which chokes chado loader, with reason...
889 ## problem is when mRNA and ncRNA have same exons, both ids are active, called twice
891 for my $expar ($rna_id, $ncrna_id) {
893 if ( $exonpar{$expar} and $f->has_tag('Parent') ) {
894 my @vals = $f->get_tag_values('Parent');
895 next if (grep {$expar eq $_} @vals);
898 $f->add_tag_value( Parent
=> $expar);
899 # last; #? could be both
901 # now we can skip cloned exons
902 # dgg note: multiple parents get added and printed for each unique exon
903 return GM_DUP_PART
if ++$seen{$f} > 1;
905 } elsif ( /CDS|protein|polypeptide/ ) {
906 return GM_NOT_PART
unless $rna_id; ## ignore $ncrna_id ??
907 return GM_NOT_PART
if($genelinkID && $genelinkID ne $gene_id); #??
908 (my $pro_id = $rna_id) =~ s/\.t/\.p/;
910 if( ! $CDSkeep && /CDS/) {
911 $f->primary_tag($PROTEIN_TYPE);
913 ## duplicate problem is Location ..
914 if ($f->location->isa("Bio::Location::SplitLocationI")) {
915 # my($b,$e)=($f->start, $f->end); # is this all we need?
917 foreach my $l ($f->location->each_Location) {
918 $b = $l->start if($b<0 || $b > $l->start);
919 $e = $l->end if($e < $l->end);
921 $f->location( Bio
::Location
::Simple
->new(
922 -start
=> $b, -end
=> $e, -strand
=> $f->strand) );
925 $f->add_tag_value( Derives_from
=> $rna_id );
928 $f->add_tag_value( Parent
=> $rna_id );
931 $f->add_tag_value( ID
=> $pro_id );
933 move_translation_fasta
($f, $pro_id);
934 #if( $f->has_tag('translation')) {
935 # my ($aa) = $f->get_tag_values("translation");
936 # $proteinfa{$pro_id}= $aa;
937 # $f->remove_tag("translation");
938 # $f->add_tag_value("translation","length.".length($aa)); # hack for odd chado gbl problem
940 } elsif ( /region/ ) {
941 $f->primary_tag('gene_component_region');
942 $f->add_tag_value( Parent
=> $gene_id );
944 return GM_NOT_PART
unless $gene_id;
945 $f->add_tag_value( Parent
=> $gene_id );
948 ## return GM_DUP_PART if /exon/ && ++$seen{$f} > 1;
953 ## was generic_features > add_generic_id
955 my ($f, $ft_name, $flags) = @_;
956 my $method = $f->primary_tag;
957 $method{$method}++ unless($flags =~ /nocount/); ## double counts GM_NOT_PART from above
959 if ($f->has_tag('ID')) {
962 elsif ( $f->has_tag($method) ) {
963 my ($name) = $f->get_tag_values($method);
964 $f->add_tag_value( ID
=> "$method:$name" );
966 elsif($ft_name) { # is this unique ?
967 $f->add_tag_value( ID
=> $ft_name );
970 $idh->generate_unique_persistent_id($f);
973 move_translation_fasta
( $f, ($f->get_tag_values("ID"))[0] )
974 if($method =~ /CDS/);
976 # return $io->gff_string($f);
979 sub move_translation_fasta
{
980 my ($f, $ft_id) = @_;
981 if( $ft_id && $f->has_tag('translation') ) {
982 my ($aa) = $f->get_tag_values("translation");
983 if($aa && $aa !~ /^length/) {
984 $proteinfa{$ft_id}= $aa;
985 $f->remove_tag("translation");
986 $f->add_tag_value("translation","length.".length($aa)); # hack for odd chado gbl problem
992 my ($name, $end, $source_type, $source_feat) = @_;
993 $source_type ||= "region";
995 my $info = "$source_type:$name";
996 my $head = "##gff-version $GFF_VERSION\n".
997 "##sequence-region $name 1 $end\n".
998 "# conversion-by bp_genbank2gff3.pl\n";
1000 ## dgg: these header comment fields are not useful when have multi-records, diff organisms
1001 for my $key (qw(organism Note date)) {
1003 if ($source_feat->has_tag($key)) {
1004 ($value) = $source_feat->get_tag_values($key);
1007 $head .= "# $key $value\n";
1008 $info .= ", $value";
1011 $head = "" if $didheader;
1013 $head .= "$name\t$SOURCEID\t$source_type\t1\t$end\t.\t.\t.\tID=$name\n";
1016 return (wantarray) ?
($head,$info) : $head;
1022 ## print "# working on $source_type:", $seq->accession, "\n";
1023 my $uh_oh = "Possible gene unflattening error with" . $seq->accession_number .
1024 ": consult STDERR\n";
1027 $unflattener->unflatten_seq( -seq
=> $seq,
1028 -noinfer
=> $noinfer,
1032 # deal with unflattening errors
1034 warn $seq->accession_number . " Unflattening error:\n";
1035 warn "Details: $@\n";
1039 return 0 if !$seq || !$seq->all_SeqFeatures;
1041 # map feature types to the sequence ontology
1042 ## $tm->map_types_to_SO( -seq => $seq );
1043 #$tm->map_types( -seq => $seq, -type_map => $FTSOmap, -undefined => "region" ); #dgg
1048 -type_map
=> $FTSOmap,
1049 -syn_map
=> $FTSOsynonyms,
1050 -undefined
=> "region"
1057 ## return unless $filter;
1059 my @sources; # dgg; pick source features here; only 1 always?
1061 for my $f ( $seq->remove_SeqFeatures ) {
1062 my $m = $f->primary_tag;
1063 push @sources, $f if ($m eq 'source'); # dgg? but leave in @feats ?
1064 push @feats, $f unless $filter =~ /$m/i;
1066 $seq->add_SeqFeature($_) foreach @feats;
1068 for my $f ( $seq->get_SeqFeatures ){
1069 my $m = $f->primary_tag;
1070 push @sources, $f if ($m eq 'source'); # dgg? but leave in @feats ?
1078 # The default behaviour of Bio::FeatureHolderI:get_all_SeqFeatures
1079 # changed to filter out cloned features. We have to implement the old
1080 # method. These two subroutines were adapted from the v1.4 Bio::FeatureHolderI
1081 sub get_all_SeqFeatures
{
1085 foreach my $feat ( $seq->get_SeqFeatures ){
1086 push(@flatarr,$feat);
1087 _add_flattened_SeqFeatures
(\
@flatarr,$feat);
1094 my $gene_id = ''; # zero it;
1096 if ($g->has_tag('locus_tag')) {
1097 ($gene_id) = $g->get_tag_values('locus_tag');
1100 elsif ($g->has_tag('gene')) {
1101 ($gene_id) = $g->get_tag_values('gene');
1103 elsif ($g->has_tag('ID')) { # for non-Genbank > Entrezgene
1104 ($gene_id) = $g->get_tag_values('ID');
1107 ## See Unflattener comment:
1108 # on rare occasions, records will have no /gene or /locus_tag
1109 # but it WILL have /product tags. These serve the same purpose
1110 # for grouping. For an example, see AY763288 (also in t/data)
1111 # eg. product=tRNA-Asp ; product=similar to crooked neck protein
1112 elsif ($g->has_tag('product')) {
1113 my ($name)= $g->get_tag_values('product');
1114 ($gene_id) = $name unless($name =~ / /); # a description not name
1117 ## dgg; also handle transposon=xxxx ID/name
1118 # ID=GenBank:repeat_region:NC_004353:1278337:1281302;transposon=HeT-A{}1685;Dbxref=FLYBASE:FBti0059746
1119 elsif ($g->has_tag('transposon')) {
1120 my ($name)= $g->get_tag_values('transposon');
1121 ($gene_id) = $name unless($name =~ / /); # a description not name
1127 # same list as gene_name .. change tag to generic Name
1128 sub convert_to_name
{
1130 my $gene_id = ''; # zero it;
1132 if ($g->has_tag('gene')) {
1133 ($gene_id) = $g->get_tag_values('gene');
1134 $g->remove_tag('gene');
1135 $g->add_tag_value('Name', $gene_id);
1137 elsif ($g->has_tag('locus_tag')) {
1138 ($gene_id) = $g->get_tag_values('locus_tag');
1139 $g->remove_tag('locus_tag');
1140 $g->add_tag_value('Name', $gene_id);
1143 elsif ($g->has_tag('product')) {
1144 my ($name)= $g->get_tag_values('product');
1145 ($gene_id) = $name unless($name =~ / /); # a description not name
1146 ## $g->remove_tag('product');
1147 $g->add_tag_value('Name', $gene_id);
1150 elsif ($g->has_tag('transposon')) {
1151 my ($name)= $g->get_tag_values('transposon');
1152 ($gene_id) = $name unless($name =~ / /); # a description not name
1153 ## $g->remove_tag('transposon');
1154 $g->add_tag_value('Name', $gene_id);
1156 elsif ($g->has_tag('ID')) {
1157 my ($name)= $g->get_tag_values('ID');
1158 $g->add_tag_value('Name', $name);
1164 sub _add_flattened_SeqFeatures
{
1165 my ($arrayref,$feat) = @_;
1168 if ($feat->isa("Bio::FeatureHolderI")) {
1169 @subs = $feat->get_SeqFeatures;
1171 elsif ($feat->isa("Bio::SeqFeatureI")) {
1172 @subs = $feat->sub_SeqFeature;
1175 warn ref($feat)." is neither a FeatureHolderI nor a SeqFeatureI. ".
1176 "Don't know how to flatten.";
1179 for my $sub (@subs) {
1180 push(@
$arrayref,$sub);
1181 _add_flattened_SeqFeatures
($arrayref,$sub);
1188 my ($self, @args) = @_;
1190 my($sf, $seq, $type_map, $syn_map, $undefmap) =
1191 $self->_rearrange([qw(FEATURE
1199 if (!$sf && !$seq) {
1200 $self->throw("you need to pass in either -feature or -seq");
1205 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
1206 @sfs = $seq->get_all_SeqFeatures;
1208 $type_map = $type_map || $self->typemap; # dgg: was type_map;
1209 foreach my $feat (@sfs) {
1211 $feat->isa("Bio::SeqFeatureI") || $self->throw("$feat NOT A SeqFeatureI");
1212 $feat->isa("Bio::FeatureHolderI") || $self->throw("$feat NOT A FeatureHolderI");
1214 my $primary_tag = $feat->primary_tag;
1216 #if ($primary_tag =~ /^pseudo(.*)$/) {
1217 # $primary_tag = $1;
1218 # $feat->primary_tag($primary_tag);
1221 my $mtype = $type_map->{$primary_tag};
1224 if (ref($mtype) eq 'ARRAY') {
1226 ($mtype, $soID) = @
$mtype;
1228 if ($soID && ref($ONTOLOGY)) {
1229 my ($term) = $ONTOLOGY->find_terms(-identifier
=> $soID);
1230 $mtype = $term->name if $term;
1232 # if SO ID is undefined AND we have an ontology to search, we want to delete
1233 # the feature type hash entry in order to force a fuzzy search
1234 elsif (! defined $soID && ref($ONTOLOGY)) {
1236 delete $type_map->{$primary_tag};
1238 elsif ($undefmap && $mtype eq 'undefined') { # dgg
1242 $type_map->{$primary_tag} = $mtype if $mtype;
1244 elsif (ref($mtype) eq 'CODE') {
1245 $mtype = $mtype->($feat);
1248 $self->throw('must be scalar or CODE ref');
1251 elsif ($undefmap && $mtype eq 'undefined') { # dgg
1254 $feat->primary_tag($mtype);
1259 my %perfect_matches;
1260 while (my ($p_tag,$rules) = each %$YAML) {
1262 for my $rule (@
$rules) {
1263 for my $tags (@
$rule) {
1264 while (my ($tag,$values) = each %$tags) {
1265 for my $value (@
$values) {
1266 if ($feat->has_tag($tag)) {
1267 for ($feat->get_tag_values($tag)) {
1268 next RULE
unless $_ =~ /\Q$value\E/;
1270 } elsif ($tag eq 'primary_tag') {
1271 next RULE
unless $value eq
1273 } elsif ($tag eq 'location') {
1274 next RULE
unless $value eq
1275 $feat->start.'..'.$feat->end;
1276 } else { next RULE
}
1280 $perfect_matches{$p_tag}++;
1283 if (scalar(keys %perfect_matches) == 1) {
1284 $mtype = $_ for keys %perfect_matches;
1285 } elsif (scalar(keys %perfect_matches) > 1) {
1286 warn "There are conflicting rules in the config file for the" .
1287 " following types: ";
1288 warn "\t$_\n" for keys %perfect_matches;
1289 warn "Until conflict resolution is built into the converter," .
1290 " you will have to manually edit the config file to remove the" .
1291 " conflict. Sorry :(. Skipping user preference for this entry";
1296 if ( ! $mtype && $syn_map) {
1297 if ($feat->has_tag('note')) {
1301 my @note = $feat->each_tag_value('note');
1303 for my $k (keys %$syn_map) {
1305 if ($k =~ /"(.+)"/) {
1309 for my $note (@note) {
1311 # look through the notes to see if the description
1312 # is an exact match for synonyms
1313 if ( $syn eq $note ) {
1315 my @map = @
{$syn_map->{$k}};
1318 my $best_guess = $map[0];
1320 unshift @
{$all_matches[-1]}, [$best_guess];
1323 ? manual_curation
($feat, $best_guess, \
@all_matches)
1326 print '#' x
78 . "\nGuessing the proper SO term for GenBank"
1327 . " entry:\n\n" . GenBank_entry
($feat) . "\nis:\t$mtype\n"
1328 . '#' x
78 . "\n\n";
1331 # check both primary tag and and note against
1332 # SO synonyms for best matching description
1334 SO_fuzzy_match
( $k, $primary_tag, $note, $syn, \
@all_matches);
1342 for my $note (@note) {
1343 for my $name (values %$type_map) {
1344 # check primary tag against SO names for best matching
1345 # descriptions //NML also need to check against
1346 # definition && camel case split terms
1348 SO_fuzzy_match
($name, $primary_tag, $note, $name, \
@all_matches);
1353 if (scalar(@all_matches) && !$mtype) {
1355 my $top_matches = first
{ defined $_ } @
{$all_matches[-1]};
1357 my $best_guess = $top_matches->[0];
1361 # if guess has quotes, it is a synonym term. we need to
1362 # look up the corresponding name term
1363 # otherwise, guess is a name, so we can use it directly
1364 if ($best_guess =~ /"(.+)"/) {
1366 $best_guess = $syn_map->{$best_guess}->[0];
1370 @RETURN = @all_matches;
1372 ? manual_curation
($feat, $best_guess, \
@all_matches)
1375 print '#' x
78 . "\nGuessing the proper SO term for GenBank"
1376 . " entry:\n\n" . GenBank_entry
($feat) . "\nis:\t$mtype\n"
1377 . '#' x
78 . "\n\n";
1381 $mtype ||= $undefmap;
1382 $feat->primary_tag($mtype);
1389 sub SO_fuzzy_match
{
1391 my $candidate = shift;
1392 my $primary_tag = shift;
1394 my $SO_terms = shift;
1395 my $best_matches_ref = shift;
1396 my $modifier = shift;
1402 for ( split(" |_", $primary_tag) ) {
1403 #my @camelCase = /(?:[A-Z]|[a-z])(?:[A-Z]+|[a-z]*)(?=$|[A-Z])/g;
1404 my @camelCase = /(?:[A-Z]|[a-z])(?:[A-Z]+|[a-z]*)(?=$|[A-Z]|[;:.,])/g;
1405 push @feat_terms, @camelCase;
1408 for ( split(" |_", $note) ) {
1409 #my @camelCase = /(?:[A-Z]|[a-z])(?:[A-Z]+|[a-z]*)(?=$|[A-Z])/g;
1410 #my @camelCase = /(?:[A-Z]|[a-z])(?:[A-Z]+|[a-z]*)(?=$|[A-Z]|[;:.,])/g;
1411 (my $word = $_) =~ s/[;:.,]//g;
1412 push @feat_terms, $word;
1416 my @SO_terms = split(" |_", $SO_terms);
1418 # fuzzy match works on a simple point system. When 2 words match,
1419 # the $plus counter adds one. When they don't, the $minus counter adds
1420 # one. This is used to sort similar matches together. Better matches
1421 # are found at the end of the array, near the top.
1423 # NML: can we improve best match by using synonym tags
1424 # EXACT,RELATED,NARROW,BROAD?
1426 my ($plus, $minus) = (0, 0);
1431 map {$feat_terms{$_} = 1} @feat_terms;
1432 map {$SO_terms{$_} = 1} @SO_terms;
1434 for my $st (keys %SO_terms) {
1435 for my $ft (keys %feat_terms) {
1437 ($st =~ m/$modifier\Q$ft\E/) ?
$plus++ : $minus++;
1442 push @
{$$best_matches_ref[$plus][$minus]}, $candidate if $plus;
1446 sub manual_curation
{
1448 my ($feat, $default_opt, $all_matches) = @_;
1450 my @all_matches = @
$all_matches;
1452 # convert all SO synonyms into names and filter
1453 # all matches into unique term list because
1454 # synonyms can map to multiple duplicate names
1456 my (@unique_SO_terms, %seen);
1457 for (reverse @all_matches) {
1461 if ($_ =~ /"(.+)"/) {
1462 for (@
{$SYN_MAP->{$_}}) {
1463 push @unique_SO_terms, $_ unless $seen{$_};
1467 push @unique_SO_terms, $_ unless $seen{$_};
1474 my $s = scalar(@unique_SO_terms);
1479 "[a]uto : automatic input (selects best guess for remaining entries)\r" .
1480 "[f]ind : search for other SO terms matching your query (e.g. f gene)\r" .
1481 "[i]nput : add a specific term\r" .
1482 "[r]eset : reset to the beginning of matches\r" .
1483 "[s]kip : skip this entry (selects best guess for this entry)\r"
1487 "[n]ext : view the next ".OPTION_CYCLE
." terms\r" .
1488 "[p]rev : view the previous ".OPTION_CYCLE
." terms" if ($s > OPTION_CYCLE
);
1490 my $msg = #"\n\n" . '-' x 156 . "\n"
1491 "The converter found $s possible matches for the following GenBank entry: ";
1494 "Type a number to select the SO term that best matches"
1495 . " the genbank entry, or use any of the following options:\r" . '_' x
76 . "\r$more";
1498 # lookup filtered list to pull out definitions
1502 for (['name', 'name'], ['def', 'definition'], ['synonym',
1504 my ($label, $method) = @
$_;
1505 $term{$label} = \@
{[$term->$method]};
1507 [++$choice, $_->name, ($_->definition || 'none'), \
%term, $_->each_synonym ];
1508 } map { $ONTOLOGY->find_terms(-name
=> $_) } @unique_SO_terms;
1511 my $option = options_cycle
(0, OPTION_CYCLE
, $msg, $feat, $directions,
1512 $default_opt, @options);
1514 if ($option eq 'skip') { return $default_opt
1515 } elsif ($option eq 'auto') {
1517 return $default_opt;
1518 } else { return $option }
1524 my ($start, $stop, $msg, $feat, $directions, $best_guess, @opt) = @_;
1526 #NML: really should only call GenBank_entry once. Will need to change
1527 #method to return array & shift off header
1528 my $entry = GenBank_entry
($feat, "\r");
1530 my $total = scalar(@opt);
1532 ($start,$stop) = (0, OPTION_CYCLE
)
1533 if ( ($start < 0) && ($stop > 0) );
1535 ($start,$stop) = (0, OPTION_CYCLE
)
1536 if ( ( ($stop - $start) < OPTION_CYCLE
) && $stop < $total);
1538 ($start,$stop) = ($total - OPTION_CYCLE
, $total) if $start < 0;
1539 ($start,$stop) = (0, OPTION_CYCLE
) if $start >= $total;
1541 $stop = $total if $stop > $total;
1543 my $dir_copy = $directions;
1544 my $msg_copy = $msg;
1545 my $format = "format STDOUT = \n" .
1547 '^' . '<' x
77 . '| Available Commands:' . "\n" .
1548 '$msg_copy' . "\n" .
1551 '^' . '<' x
77 . '| ^' . '<' x
75 . '~' . "\n" .
1552 '$entry' . ' ' x
74 . '$dir_copy,' . "\n" .
1553 (' ' x
20 . '^' . '<' x
57 . '| ^' . '<' x
75 . '~' . "\n" .
1554 ' ' x
20 . '$entry,' . ' ' x
53 . '$dir_copy,' . "\n") x
1000 . ".\n";
1557 # eval throws redefined warning that breaks formatting.
1558 # Turning off warnings just for the eval to fix this.
1559 no warnings
'redefine';
1565 print '-' x
156 . "\n" .
1566 'Showing results ' . ( $stop ?
( $start + 1 ) : $start ) .
1567 " - $stop of possible SO term matches: (best guess is \"$best_guess\")" .
1568 "\n" . '-' x
156 . "\n";
1570 for (my $i = $start; $i < $stop; $i+=2) {
1572 my ($left, $right) = @opt[$i,$i+1];
1574 my ($nL, $nmL, $descL, $termL, @synL) = @
$left;
1576 #odd numbered lists can cause fatal undefined errors, so check
1577 #to make sure we have data
1579 my ($nR, $nmR, $descR, $termR, @synR) = ref($right) ? @
$right : (undef, undef, undef);
1582 my $format = "format STDOUT = \n";
1587 '@>>: name: ^' . '<' x
64 . '~' . ' |' .
1588 ( ref($right) ?
('@>>: name: ^' . '<' x
64 . '~' ) : '' ) . "\n" .
1589 '$nL,' . ' ' x
7 . '$nmL,' .
1590 ( ref($right) ?
(' ' x
63 . '$nR,' . ' ' x
7 . "\$nmR,") : '' ) . "\n" .
1592 ' ' x
11 . '^' . '<' x
61 . '...~' . ' |' .
1593 (ref($right) ?
(' ^' . '<' x
61 . '...~') : '') . "\n" .
1594 ' ' x
11 . '$nmL,' .
1595 (ref($right) ?
(' ' x
74 . '$nmR,') : '') . "\n" .
1596 #' ' x 78 . '|' . "\n" .
1599 ' def: ^' . '<' x
65 . ' |' .
1600 (ref($right) ?
(' def: ^' . '<' x
64 . '~') : '') . "\n" .
1601 ' ' x
11 . '$descL,' .
1602 (ref($right) ?
(' ' x
72 . '$descR,') : '') . "\n" .
1605 (' ^' . '<' x
65 . ' |' .
1606 (ref($right) ?
(' ^' . '<' x
64 . '~') : '') . "\n" .
1607 ' ' x
11 . '$descL,' .
1608 (ref($right) ?
(' ' x
72 . '$descR,') : '') . "\n") x
5 .
1611 ' ^' . '<' x
61 . '...~ |' .
1612 (ref($right) ?
(' ^' . '<' x
61 . '...~') : '') . "\n" .
1613 ' ' x
11 . '$descL,' .
1614 (ref($right) ?
(' ' x
72 . '$descR,') : '') . "\n" .
1619 # eval throws redefined warning that breaks formatting.
1620 # Turning off warnings just for the eval to fix this.
1621 no warnings
'redefine';
1627 print '-' x
156 . "\nenter a command:";
1631 (my $input = $_) =~ s/\s+$//;
1633 if ($input =~ /^\d+$/) {
1634 if ( $input && defined $opt[$input-1] ) {
1635 return $opt[$input-1]->[1]
1637 print "\nThat number is not an option. Please enter a valid number.\n:";
1639 } elsif ($input =~ /^n/i | $input =~ /next/i ) {
1640 return options_cycle
($start + OPTION_CYCLE
, $stop + OPTION_CYCLE
, $msg,
1641 $feat, $directions, $best_guess, @opt)
1642 } elsif ($input =~ /^p/i | $input =~ /prev/i ) {
1643 return options_cycle
($start - OPTION_CYCLE
, $stop - OPTION_CYCLE
, $msg,
1644 $feat, $directions, $best_guess, @opt)
1645 } elsif ( $input =~ /^s/i || $input =~ /skip/i ) { return 'skip'
1646 } elsif ( $input =~ /^a/i || $input =~ /auto/i ) { return 'auto'
1647 } elsif ( $input =~ /^r/i || $input =~ /reset/i ) {
1648 return manual_curation
($feat, $best_guess, \
@RETURN );
1649 } elsif ( $input =~ /^f/i || $input =~ /find/i ) {
1651 my ($query, @query_results);
1653 if ($input =~ /(?:^f|find)\s+?(.*?)$/) { $query = $1;
1657 print "Type your search query\n:";
1658 while (<STDIN
>) { chomp($query = $_); last }
1661 for (keys(%$TYPE_MAP), keys(%$SYN_MAP)) {
1662 SO_fuzzy_match
($_, $query, '', $_, \
@query_results, '(?i)');
1665 return manual_curation
($feat, $best_guess, \
@query_results);
1667 } elsif ( $input =~ /^i/i || $input =~ /input/i ) {
1669 #NML fast input for later
1671 #if ($input =~ /(?:^i|input)\s+?(.*?)$/) { $query = $1 };
1674 print "Type the term you want to use\n:";
1676 chomp(my $input = $_);
1678 if (! $TYPE_MAP->{$input}) {
1680 print "\"$input\" doesn't appear to be a valid SO term. Are ".
1681 "you sure you want to use it? (y or n)\n:";
1685 chomp(my $choice = $_);
1687 if ($choice eq 'y') {
1689 "\nWould you like to save your preference for " .
1690 "future use (so you don't have to redo manual " .
1691 "curation for this feature everytime you run " .
1692 "the converter)? (y or n)\n";
1694 #NML: all these while loops are a mess. Really should condense it.
1697 chomp(my $choice = $_);
1699 if ($choice eq 'y') {
1700 curation_save
($feat, $input);
1702 } elsif ($choice eq 'n') {
1705 print "\nDidn't recognize that command. Please " .
1711 } elsif ($choice eq 'n') {
1712 return options_cycle
($start, $stop, $msg, $feat,
1713 $directions, $best_guess, @opt)
1715 print "\nDidn't recognize that command. Please " .
1722 "\nWould you like to save your preference for " .
1723 "future use (so you don't have to redo manual " .
1724 "curation for this feature everytime you run " .
1725 "the converter)? (y or n)\n";
1727 #NML: all these while loops are a mess. Really should condense it.
1730 chomp(my $choice = $_);
1732 if ($choice eq 'y') {
1733 curation_save
($feat, $input);
1735 } elsif ($choice eq 'n') {
1738 print "\nDidn't recognize that command. Please " .
1747 print "\nDidn't recognize that command. Please re-enter your choice.\n:"
1754 my ($f, $delimiter, $num) = @_;
1756 $delimiter ||= "\n";
1761 ($num ?
' [1] ' : ' ' x
5) . $f->primary_tag
1763 ?
' ' x
(12 - length $f->primary_tag ) . ' [2] '
1764 : ' ' x
(15 - length $f->primary_tag)
1766 . $f->start.'..'.$f->end
1771 words_tag
($f, \
$entry);
1773 for my $tag ($f->all_tags) {
1774 for my $val ( $f->each_tag_value($tag) ) {
1776 #$entry .= "/$tag=\"$val\"$delimiter";
1777 $entry .= $val eq '_no_value'
1779 : "/$tag=\"$val\"$delimiter";
1790 warn "Validating GFF file\n" if $DEBUG;
1793 my (%parent2child, %all_ids, %descendants, %reserved);
1796 for my $aTags (['Parent', \
%parent2child], ['ID', \
%all_ids]) {
1797 map { push @
{$$aTags[1]->{$_}}, $f } $f->get_tag_values($$aTags[0])
1798 if $f->has_tag($$aTags[0]);
1803 while (my ($parentID, $aChildren) = each %parent2child) {
1804 parent_validate
($parentID, $aChildren, \
%all_ids, \
%descendants, \
%reserved);
1808 id_validate
(\
%all_ids, \
%reserved);
1811 sub parent_validate
{
1812 my ($parentID, $aChildren, $hAll, $hDescendants, $hReserved) = @_;
1814 my $aParents = $hAll->{$parentID};
1818 $child->add_tag_value( validation_error
=>
1819 "feature tried to add Parent tag, but no Parent found with ID $parentID"
1822 map { $parents{$_} = 1 } $child->get_tag_values('Parent');
1823 delete $parents{$parentID};
1824 my @parents = keys %parents;
1826 $child->remove_tag('Parent');
1828 unless ($child->has_tag('ID')) {
1829 my $id = gene_name
($child);
1830 $child->add_tag_value('ID', $id);
1831 push @
{$hAll->{$id}}, $child
1834 $child->add_tag_value('Parent', @parents) if @parents;
1836 } @
$aChildren and return unless scalar(@
$aParents);
1838 my $par = join(',', map { $_->primary_tag } @
$aParents);
1839 warn scalar(@
$aParents)." POSSIBLE PARENT(S): $par" if $DEBUG;
1841 #NML manual curation needs to happen here
1844 my %parentsToRemove;
1847 for my $child (@
$aChildren) {
1848 my $childType = $child->primary_tag;
1850 warn "WORKING ON $childType at ".$child->start.' to '.$child->end
1853 for (my $i = 0; $i < scalar(@
$aParents); $i++) {
1854 my $parent = $aParents->[$i];
1855 my $parentType = $parent->primary_tag;
1857 warn "CHECKING $childType against $parentType" if $DEBUG;
1859 #cache descendants so we don't have to do repeat searches
1860 unless ($hDescendants->{$parentType}) {
1862 for my $term ($ONTOLOGY->find_terms(
1863 -name
=> $parentType
1867 $hDescendants->{$parentType}{$_->name}++
1868 } $ONTOLOGY->get_descendant_terms($term);
1872 # NML: hopefully temporary fix.
1873 # SO doesn't consider exon/CDS to be a child of mRNA
1874 # even though common knowledge dictates that they are
1875 # This cheat fixes the false positives for now
1876 if ($parentType eq 'mRNA') {
1877 $hDescendants->{$parentType}{'exon'} = 1;
1878 $hDescendants->{$parentType}{'CDS'} = 1;
1883 warn "\tCAN $childType at " . $child->start . ' to ' . $child->end .
1884 " be a child of $parentType?" if $DEBUG;
1885 if ($hDescendants->{$parentType}{$childType}) {
1886 warn "\tYES, $childType can be a child of $parentType" if $DEBUG;
1888 #NML need to deal with multiple children matched to multiple different
1889 #parents. This model only assumes the first parent id that matches a child will
1890 #be the reserved feature.
1892 $hReserved->{$parentID}{$parent}{'parent'} = $parent;
1893 push @
{$hReserved->{$parentID}{$parent}{'children'}}, $child;
1895 #mark parent for later removal from all IDs
1896 #so we don't accidentally change any parents
1898 $parentsToRemove{$i}++;
1906 #NML shouldn't have to check this; somehow child can lose Parent
1907 #it's happening W3110
1908 #need to track this down
1909 if ( $child->has_tag('Parent') ) {
1911 warn "\tNO, @{[$child->primary_tag]} cannot be a child of $parentID"
1916 map { $parents{$_} = 1 } $child->get_tag_values('Parent');
1918 delete $parents{$parentID};
1919 my @parents = keys %parents;
1921 warn 'VALIDATION ERROR '.$child->primary_tag." at ".$child->start .
1922 ' to ' . $child->end . " cannot be a child of ID $parentID"
1925 $child->add_tag_value( validation_error
=>
1926 "feature cannot be a child of $parentID");
1928 $child->remove_tag('Parent');
1930 unless ($child->has_tag('ID')) {
1931 my $id = gene_name
($child);
1932 $child->add_tag_value('ID', $id);
1933 push @
{$hAll->{$id}}, $child
1936 $child->add_tag_value('Parent', @parents) if @parents;
1941 #delete $aParents->[$_] for keys %parentsToRemove;
1942 splice(@
$aParents, $_, 1) for keys %parentsToRemove;
1946 my ($hAll, $hReserved) = @_;
1949 for my $id (keys %$hAll) {
1951 #since 1 feature can have this id,
1952 #let's just shift it off and uniquify
1953 #the rest (unless it's reserved)
1955 shift @
{$hAll->{$id}} unless $hReserved->{$id};
1956 for my $feat (@
{$hAll->{$id}}) {
1957 id_uniquify
(0, $id, $feat, $hAll);
1961 for my $parentID (keys %$hReserved) {
1963 my @keys = keys %{$hReserved->{$parentID}};
1968 my $parent = $hReserved->{$parentID}{$k}{'parent'};
1969 my $aChildren= $hReserved->{$parentID}{$k}{'children'};
1971 my $value = id_uniquify
(0, $parentID, $parent, $hAll);
1972 for my $child (@
$aChildren) {
1975 map { $parents{$_}++ } $child->get_tag_values('Parent');
1976 $child->remove_tag('Parent');
1977 delete $parents{$parentID};
1979 my @parents = keys %parents;
1980 $child->add_tag_value('Parent', @parents);
1988 my ($count, $value, $feat, $hAll) = @_;
1990 warn "UNIQUIFYING $value" if $DEBUG;
1993 $feat->add_tag_value(Alias
=> $value);
1994 $value .= ('.' . $feat->primary_tag)
1995 } elsif ($count == 1) {
2003 warn "ENDED UP WITH $value" if $DEBUG;
2004 if ( $hAll->{$value} ) {
2005 warn "$value IS ALREADY TAKEN" if $DEBUG;
2006 id_uniquify
($count, $value, $feat, $hAll);
2008 #warn "something's breaking ".$feat->primary_tag.' at '.$feat->start.' to '.$feat->end;
2009 $feat->remove_tag('ID');
2010 $feat->add_tag_value('ID', $value);
2011 push @
{$hAll->{$value}}, $value;
2019 print "\nCannot read $CONF. Change file permissions and retry, " .
2020 "or enter another file\n" and conf_locate
() unless -r
$CONF;
2022 print "\nCannot write $CONF. Change file permissions and retry, " .
2023 "or enter another file\n" and conf_locate
() unless -w
$CONF;
2025 $YAML = LoadFile
($CONF);
2031 my ($path, $input) = @_;
2033 print "Cannot write to $path. Change directory permissions and retry " .
2034 "or enter another save path\n" and conf_locate
() unless -w
$path;
2038 open(FH
, '>', $CONF);
2045 sub conf_write
{ DumpFile
($CONF, $YAML) }
2049 print "\nEnter the location of a previously saved config, or a new " .
2050 "path and file name to create a new config (this step is " .
2051 "necessary to save any preferences)";
2053 print "\n\nenter a command:";
2056 chomp(my $input = $_);
2057 my ($fn, $path, $suffix) = fileparse
($input, qr/\.[^.]*/);
2059 if (-e
$input && (! -d
$input)) {
2061 print "\nReading $input...\n";
2067 } elsif (! -d
$input && $fn.$suffix) {
2069 print "Creating $input...\n";
2070 conf_create
($path, $input);
2073 } elsif (-e
$input && -d
$input) {
2074 print "You only entered a directory. " .
2075 "Please enter BOTH a directory and filename\n";
2077 print "$input does not appear to be a valid path. Please enter a " .
2078 "valid directory and filename\n";
2080 print "\nenter a command:";
2086 my ($feat, $input) = @_;
2088 #my $error = "Enter the location of a previously saved config, or a new " .
2089 # "path and file name to create a new config (this step is " .
2090 # "necessary to save any preferences)\n";
2095 } elsif (! -e
$CONF) {
2096 print "\n\nThe config file you have chosen doesn't exist.\n";
2098 } else { conf_read
() }
2100 my $entry = GenBank_entry
($feat, "\r", 1);
2102 my $msg = "Term entered: $input";
2103 my $directions = "Please select any/all tags that provide evidence for the term you
2104 have entered. You may enter multiple tags by separating them by commas/dashes
2105 (e.g 1,3,5-7). For tags with more than one word value (i.e 'note'), you have
2106 the option of either selecting the entire note as evidence, or specific
2107 keywords. If a tag has multiple keywords, they will be tagged alphabetically
2108 for selection. To select a specific keyword in a tag field, you must enter the
2109 tag number followed by the keyword letter (e.g 3a). Multiple keywords may be
2110 selected by entering each letter separated by commas/dashes (e.g 3b,f,4a-c). The more tags you select, the more specific the GenBank entry will have
2111 to be to match your curation. To match the GenBank entry exactly as it
2112 appears, type every number (start-end), or just type 'all'. Remember, once the converter saves your
2113 preference, you will no longer be prompted to choose a feature type for any
2114 matching entries until you edit the curation.ini file.";
2115 my $msg_copy = $msg;
2116 my $dir_copy = $directions;
2118 my $format = "format STDOUT = \n" .
2120 '^' . '<' x
77 . '| Directions:' . "\n" .
2121 '$msg_copy' . "\n" .
2124 '^' . '<' x
77 . '| ^' . '<' x
75 . '~' . "\n" .
2125 '$entry' . ' ' x
74 . '$dir_copy,' . "\n" .
2126 (' ' x
15 . '^' . '<' x
62 . '| ^' . '<' x
75 . '~' . "\n" .
2127 ' ' x
15 . '$entry,' . ' ' x
58 . '$dir_copy,' . "\n") x
20 . ".\n";
2130 # eval throws redefined warning that breaks formatting.
2131 # Turning off warnings just for the eval to fix this.
2132 no warnings
'redefine';
2137 print '-' x
156 . "\nenter a command:";
2139 my @tags = words_tag
($feat);
2145 chomp(my $choice = $_);
2147 if (scalar(keys %$final) && $choice =~ /^y/i) { last
2149 } elsif (scalar(keys %$final) && $choice =~ /^n/i) { curation_save
($feat, $input)
2151 } elsif (scalar(keys %$final)) { print "\nInvalid selection. Please try again\n";
2153 } elsif ($choice eq 'all') {
2156 for (my $i=1; $i < scalar(@tags); $i++) {
2161 #print "CHOICE [$choice]";
2164 for (split(/(?<=\w)[^[:alnum:]\-]+(?=\d)/, $choice)) {
2165 if ($_ =~ /(\d+)(?:\D*)-(\d+)(.*)/) {
2167 for ($1..$2) { push @selections, $_ }
2169 my $dangling_alphas = $3;
2170 alpha_expand
($dangling_alphas, \
@selections);
2173 alpha_expand
($_, \
@selections);
2177 foreach my $numbers (@selections) {
2179 my @c = split(/(?=[\w])/, $numbers);
2180 s/\W+//g foreach @c;
2185 $num = 0 + shift @c;
2188 my $tag = $tags[$num];
2189 if (ref $tag && scalar(@c)) {
2192 if (defined $tag->{$_}) {
2193 $choices .= "${num}[$_] ";
2194 my ($t,$v) = @
{$tag->{$_}};
2195 push @
{${$final->{$input}}[0]{$t}}, $v;
2197 } else { $no_value++ }
2201 _selection_add
($tag,$final,$input,\
$choices,$num);
2202 #my ($t,$v) = @{$tag->{'all'}};
2203 #unless (defined ${$final->{$input}}[0]{$t}) {
2204 #$choices .= "$num, ";
2205 #push @{${$final->{$input}}[0]{$t}}, $v
2209 $choices = substr($choices, 0, -2);
2212 } elsif (ref $tag) {
2213 _selection_add
($tag,$final,$input,\
$choices,$num);
2214 #my ($t,$v) = @{$tag->{'all'}};
2215 #unless (defined ${$final->{$input}}[0]{$t}) {
2216 #$choices .= "$num, ";
2217 #push @{${$final->{$input}}[0]{$t}}, $v
2221 $choices = substr($choices, 0, -2) if $choices;
2223 print "\nYou have chosen the following tags:\n$choices\n";
2224 print "This will be written to the config file as:\n";
2226 print "\nIs this correct? (y or n)\n";
2227 } else { print "\nInvalid selection. Please try again\n" }
2229 push @
{$YAML->{$input}}, $final->{$input};
2233 # words_tag() splits each tag value string into multiple words so that the
2234 # user can select the parts he/she wants to use for curation
2235 # it can tag 702 (a - zz) separate words; this should be enough
2239 my ($feat, $entry) = @_;
2243 @tags[1,2] = ({'all' => ['primary_tag', $feat->primary_tag]}, {'all' => ['location', $feat->start.'..'.$feat->end]});
2245 foreach my $tag ($feat->all_tags) {
2246 foreach my $value ($feat->each_tag_value($tag)) {
2248 my ($string, $tagged_string);
2250 my @words = split(/(?=\w+?)/, $value);
2255 foreach my $word (@words) {
2257 (my $sanitized_word = $word) =~ s/\W+?//g;
2260 my $lead = int($pos/ALPHABET_DIVISOR
);
2261 my $lag = $pos % ALPHABET_DIVISOR
;
2263 my $a = $lead ?
${(ALPHABET
)}[$lead-1] : '';
2264 $a .= $lag ?
${(ALPHABET
)}[$lag] : 'a';
2266 $tagged_string .= " ($a) $word";
2268 $tags[$i]{$a} = [$tag, $sanitized_word];
2272 $value = $tagged_string if scalar(@words) > 1;
2274 $$entry .= "[$i] /$tag=\"$value\"\r";
2276 $tags[$i]{'all'} = [$tag, $string];
2287 my ($dangling_alphas, $selections) = @_;
2289 if (defined($dangling_alphas) && $dangling_alphas =~ /(\d)*([[:alpha:]]+)-([[:alpha:]]+)/) {
2292 push @
$selections, $digit if $digit;
2297 my @starts = split('', $start);
2298 my @stops = split('', $stop);
2300 my ($final_start, $final_stop);
2302 for ([\
$final_start, \
@starts], [\
$final_stop, \
@stops]) {
2304 my ($final, $splits) = @
$_;
2306 my $int = ${(ALPHABET_TO_NUMBER
)}{$$splits[0]};
2311 $rem = ${(ALPHABET_TO_NUMBER
)}{$$splits[1]};
2319 $$final = $int * ALPHABET_DIVISOR
;
2324 my $last_number = pop @
$selections;
2325 for my $pos ($final_start..$final_stop) {
2326 my $lead = int($pos/ALPHABET_DIVISOR
);
2327 my $lag = $pos % ALPHABET_DIVISOR
;
2328 my $alpha = $lead ?
${(ALPHABET
)}[$lead-1] : '';
2329 $alpha .= $lag ?
${(ALPHABET
)}[$lag] : 'a';
2330 push @
$selections, $last_number.$alpha;
2333 } elsif (defined($dangling_alphas)) {
2334 if ($dangling_alphas =~ /^\d/) {
2335 push @
$selections, $dangling_alphas;
2336 } elsif ($dangling_alphas =~ /^\D/) {
2337 #print "$dangling_alphas ".Dumper @$selections;
2338 my $last_number = pop @
$selections;
2339 $last_number ||= '';
2340 push @
$selections, $last_number.$dangling_alphas;
2341 #$$selections[-1] .= $dangling_alphas;
2347 sub _selection_add
{
2349 my ($tag, $final, $input, $choices, $num) = @_;
2350 my ($t,$v) = @
{$tag->{'all'}};
2351 unless (defined ${$final->{$input}}[0]{$t}) {
2352 $$choices .= "$num, ";
2353 push @
{${$final->{$input}}[0]{$t}}, $v