TODO

   1 * Bison 3.5
   2 ** doc
   3 I feel its ugly to use the GNU style to declare functions in the doc.  It
   4 generates tons of white space in the page, and may contribute to bad page
   5 breaks.
   6
   7 Also, we seem to teach YYPRINT very early on, although it should be
   8 considered deprecated: %printer is superior.
   9
  10 ** glr.cc
  11 move glr.c into the yy namespace
  12
  13 ** improve syntax errors (UTF-8, internationalization)
  14 Bison depends on the current locale.  For instance:
  15
  16 %define parse.error verbose
  17 %code top {
  18  #include <stdio.h>
  19  #include <stdlib.h>
  20  void yyerror(const char* msg) { fprintf(stderr, "%s\n", msg); }
  21  int yylex() { return 0; }
  22 }
  23 %%
  24 exp: "↦" | "🎅🐃" | '\n'
  25 %%
  26 int main() { return yyparse(); }
  27
  28 gives different results with/without LC_ALL=C.
  29
  30 $ LC_ALL=C /opt/local/bin/bison /tmp/mangle.y -o ascii.c
  31 $ /opt/local/bin/bison /tmp/mangle.y -o utf8.c
  32 $ diff -u ascii.c utf8.c -I#line
  33 --- ascii.c     2019-01-12 08:15:35.878010093 +0100
  34 +++ utf8.c      2019-01-12 08:15:38.856495929 +0100
  35 @@ -415,9 +415,8 @@
  36     First, the terminals, then, starting at YYNTOKENS, nonterminals.  */
  37  static const char *const yytname[] =
  38  {
  39 -  "$end", "error", "$undefined", "\"\\342\\206\\246\"",
  40 -  "\"\\360\\237\\216\\205\\360\\237\\220\\203\"", "'\\n'", "$accept",
  41 -  "exp", YY_NULLPTR
  42 +  "$end", "error", "$undefined", "\"↦\"", "\"🎅🐃\"", "'\\n'",
  43 +  "$accept", "exp", YY_NULLPTR
  44  };
  45  #endif
  46
  47 $ gcc ascii.c -o ascii && ./ascii
  48 syntax error, unexpected $end, expecting "\342\206\246" or "\360\237\216\205\360\237\220\203" or '\n'
  49 $ gcc utf8.c -o utf8 && ./utf8
  50 syntax error, unexpected $end, expecting ↦ or 🎅🐃 or '\n'
  51
  52
  53 While at it, we should stop using "$end" by default, in favor of "end of
  54 file", or "end of input", whatever.  See how lalr1.java does that.
  55
  56 ** api.token.raw
  57 Maybe we should exhibit the YYUNDEFTOK token.  It could also be assigned a
  58 semantic value so that yyerror could be used to report invalid lexemes.
  59
  60 * Bison 3.6
  61 ** Unit rules
  62 Maybe we could expand unit rules (or "injections", see
  63 https://homepages.cwi.nl/~daybuild/daily-books/syntax/2-sdf/sdf.html), i.e.,
  64 transform
  65
  66         exp: arith | bool;
  67         arith: exp '+' exp;
  68         bool: exp '&' exp;
  69
  70 into
  71
  72         exp: exp '+' exp | exp '&' exp;
  73
  74 when there are no actions.  This can significantly speed up some grammars.
  75 I can't find the papers.  In particular the book 'LR parsing: Theory and
  76 Practice' is impossible to find, but according to 'Parsing Techniques: a
  77 Practical Guide', it includes information about this issue.  Does anybody
  78 have it?
  79
  80 ** Injection rules
  81 See above.
  82
  83 ** clean up
  84 *** lalr.c
  85 Introduce a goto struct, and use it in place of from_state/to_state.
  86 Rename states1 as path, length as pathlen.
  87 Introduce inline functions for things such as nullable[*rp - ntokens]
  88 where we need to map from symbol number to nterm number.
  89
  90 There are probably a significant part of the relations management that
  91 should be migrated on top of a bitsetv.
  92
  93 *** closure
  94 It should probably take a "state*" instead of two arguments.
  95
  96 *** traces
  97 The "automaton" and "set" categories are not so useful.  We should probably
  98 introduce lr(0) and lalr, just the way we have ielr categories.  The
  99 "closure" function is too verbose, it should probably have its own category.
 100
 101 "set" can still be used for summariring the important sets.  That would make
 102 tests easy to maintain.
 103
 104 *** complain.*
 105 Rename these guys as "diagnostics.*" (or "diagnose.*"), since that's the
 106 name they have in gcc, clang, etc.  Likewise for the complain_* series of
 107 functions.
 108
 109 * Modernization
 110 Fix data/skeletons/yacc.c so that it defines YYPTRDIFF_T properly for modern
 111 and older C++ compilers.  Currently the code defaults to defining it to
 112 'long' for non-GCC compilers, but it should use the proper C++ magic to
 113 define it to the same type as the C ptrdiff_t type.
 114
 115 * Completion
 116 Several features are not available in all the backends.
 117
 118 - push parsers: glr.cc, lalr1.cc
 119 - ielr: C++ and Java
 120 - glr: Java
 121 - token constructors: Java and C
 122
 123 * Bugs
 124 ** Autotest has quotation issues
 125 tests/input.at:1730:AT_SETUP([%define errors])
 126
 127 ->
 128
 129 $ ./tests/testsuite -l | grep errors | sed q
 130   38: input.at:1730      errors
 131
 132 * Short term
 133 ** consistency
 134 token vs terminal
 135
 136 ** C++
 137 Move to int everywhere instead of unsigned?  stack_size, etc.  The parser
 138 itself uses int (for yylen for instance), yet stack is based on size_t.
 139
 140 Maybe locations should also move to ints.
 141
 142 ** C
 143 Introduce state_type rather than spreading yytype_int16 everywhere?
 144
 145 ** glr.c
 146 yyspaceLeft should probably be a pointer diff.
 147
 148 ** Graphviz display code thoughts
 149 The code for the --graph option is over two files: print_graph, and
 150 graphviz. This is because Bison used to also produce VCG graphs, but since
 151 this is no longer true, maybe we could consider these files for fusion.
 152
 153 An other consideration worth noting is that print_graph.c (correct me if I
 154 am wrong) should contain generic functions, whereas graphviz.c and other
 155 potential files should contain just the specific code for that output
 156 format. It will probably prove difficult to tell if the implementation is
 157 actually generic whilst only having support for a single format, but it
 158 would be nice to keep stuff a bit tidier: right now, the construction of the
 159 bitset used to show reductions is in the graphviz-specific code, and on the
 160 opposite side we have some use of \l, which is graphviz-specific, in what
 161 should be generic code.
 162
 163 Little effort seems to have been given to factoring these files and their
 164 rint{,-xml} counterpart. We would very much like to re-use the pretty format
 165 of states from .output for the graphs, etc.
 166
 167 Also, the underscore in print_graph.[ch] isn't very fitting considering the
 168 dashes in the other filenames.
 169
 170 Since graphviz dies on medium-to-big grammars, maybe consider an other tool?
 171
 172 ** push-parser
 173 Check it too when checking the different kinds of parsers.  And be
 174 sure to check that the initial-action is performed once per parsing.
 175
 176 ** m4 names
 177 b4_shared_declarations is no longer what it is.  Make it
 178 b4_parser_declaration for instance.
 179
 180 ** yychar in lalr1.cc
 181 There is a large difference bw maint and master on the handling of
 182 yychar (which was removed in lalr1.cc).  See what needs to be
 183 back-ported.
 184
 185
 186     /* User semantic actions sometimes alter yychar, and that requires
 187        that yytoken be updated with the new translation.  We take the
 188        approach of translating immediately before every use of yytoken.
 189        One alternative is translating here after every semantic action,
 190        but that translation would be missed if the semantic action
 191        invokes YYABORT, YYACCEPT, or YYERROR immediately after altering
 192        yychar.  In the case of YYABORT or YYACCEPT, an incorrect
 193        destructor might then be invoked immediately.  In the case of
 194        YYERROR, subsequent parser actions might lead to an incorrect
 195        destructor call or verbose syntax error message before the
 196        lookahead is translated.  */
 197
 198     /* Make sure we have latest lookahead translation.  See comments at
 199        user semantic actions for why this is necessary.  */
 200     yytoken = yytranslate_ (yychar);
 201
 202
 203 ** Get rid of fake #lines [Bison: ...]
 204 Possibly as simple as checking whether the column number is nonnegative.
 205
 206 I have seen messages like the following from GCC.
 207
 208 <built-in>:0: fatal error: opening dependency file .deps/libltdl/argz.Tpo: No such file or directory
 209
 210
 211 ** Discuss about %printer/%destroy in the case of C++.
 212 It would be very nice to provide the symbol classes with an operator<<
 213 and a destructor.  Unfortunately the syntax we have chosen for
 214 %destroy and %printer make them hard to reuse.  For instance, the user
 215 is invited to write something like
 216
 217    %printer { debug_stream() << $$; } <my_type>;
 218
 219 which is hard to reuse elsewhere since it wants to use
 220 "debug_stream()" to find the stream to use.  The same applies to
 221 %destroy: we told the user she could use the members of the Parser
 222 class in the printers/destructors, which is not good for an operator<<
 223 since it is no longer bound to a particular parser, it's just a
 224 (standalone symbol).
 225
 226 * Various
 227 ** Rewrite glr.cc in C++
 228 As a matter of fact, it would be very interesting to see how much we can
 229 share between lalr1.cc and glr.cc.  Most of the skeletons should be common.
 230 It would be a very nice source of inspiration for the other languages.
 231
 232 ** YYERRCODE
 233 Defined to 256, but not used, not documented.  Probably the token
 234 number for the error token, which POSIX wants to be 256, but which
 235 Bison might renumber if the user used number 256.  Keep fix and doc?
 236 Throw away?
 237
 238 Also, why don't we output the token name of the error token in the
 239 output?  It is explicitly skipped:
 240
 241       /* Skip error token and tokens without identifier.  */
 242       if (sym != errtoken && id)
 243
 244 Of course there are issues with name spaces, but if we disable we have
 245 something which seems to be more simpler and more consistent instead
 246 of the special case YYERRCODE.
 247
 248    enum yytokentype {
 249      error = 256,
 250      // ...
 251    };
 252
 253
 254 We could (should?) also treat the case of the undef_token, which is
 255 numbered 257 for yylex, and 2 internal.  Both appear for instance in
 256 toknum:
 257
 258   const unsigned short
 259   parser::yytoken_number_[] =
 260   {
 261        0,   256,   257,   258,   259,   260,   261,   262,   263,   264,
 262
 263 while here
 264
 265    enum yytokentype {
 266      TOK_EOF = 0,
 267      TOK_EQ = 258,
 268
 269 so both 256 and 257 are "mysterious".
 270
 271   const char*
 272   const parser::yytname_[] =
 273   {
 274   "\"end of command\"", "error", "$undefined", "\"=\"", "\"break\"",
 275
 276
 277 ** yychar == yyempty_
 278 The code in yyerrlab reads:
 279
 280       if (yychar <= YYEOF)
 281         {
 282           /* Return failure if at end of input.  */
 283           if (yychar == YYEOF)
 284             YYABORT;
 285         }
 286
 287 There are only two yychar that can be <= YYEOF: YYEMPTY and YYEOF.
 288 But I can't produce the situation where yychar is YYEMPTY here, is it
 289 really possible?  The test suite does not exercise this case.
 290
 291 This shows that it would be interesting to manage to install skeleton
 292 coverage analysis to the test suite.
 293
 294 * From lalr1.cc to yacc.c
 295 ** Single stack
 296 Merging the three stacks in lalr1.cc simplified the code, prompted for
 297 other improvements and also made it faster (probably because memory
 298 management is performed once instead of three times).  I suggest that
 299 we do the same in yacc.c.
 300
 301 ** yysyntax_error
 302 The code bw glr.c and yacc.c is really alike, we can certainly factor
 303 some parts.
 304
 305
 306 * Report
 307
 308 ** Figures
 309 Some statistics about the grammar and the parser would be useful,
 310 especially when asking the user to send some information about the
 311 grammars she is working on.  We should probably also include some
 312 information about the variables (I'm not sure for instance we even
 313 specify what LR variant was used).
 314
 315 ** GLR
 316 How would Paul like to display the conflicted actions?  In particular,
 317 what when two reductions are possible on a given lookahead token, but one is
 318 part of $default.  Should we make the two reductions explicit, or just
 319 keep $default?  See the following point.
 320
 321 ** Disabled Reductions
 322 See 'tests/conflicts.at (Defaulted Conflicted Reduction)', and decide
 323 what we want to do.
 324
 325 ** Documentation
 326 Extend with error productions.  The hard part will probably be finding
 327 the right rule so that a single state does not exhibit too many yet
 328 undocumented ''features''.  Maybe an empty action ought to be
 329 presented too.  Shall we try to make a single grammar with all these
 330 features, or should we have several very small grammars?
 331
 332 ** --report=conflict-path
 333 Provide better assistance for understanding the conflicts by providing
 334 a sample text exhibiting the (LALR) ambiguity.  See the paper from
 335 DeRemer and Penello: they already provide the algorithm.
 336
 337 ** Statically check for potential ambiguities in GLR grammars
 338 See <http://www.lsv.fr/~schmitz/pub/expamb.pdf> for an approach.
 339 An Experimental Ambiguity Detection Tool ∗ Sylvain Schmitz
 340 LORIA, INRIA Nancy - Grand Est, Nancy, France
 341
 342 * Extensions
 343 ** Multiple start symbols
 344 Would be very useful when parsing closely related languages.
 345
 346 ** Better error messages
 347 The users are not provided with enough tools to forge their error messages.
 348 See for instance "Is there an option to change the message produced by
 349 YYERROR_VERBOSE?" by Simon Sobisch, on bison-help.
 350
 351 See also
 352 https://www.cs.tufts.edu/~nr/cs257/archive/clinton-jefferey/lr-error-messages.pdf
 353 and https://research.swtch.com/yyerror.
 354
 355 ** %include
 356 This is a popular demand.  We already made many changes in the parser that
 357 should make this reasonably easy to implement.
 358
 359 Bruce Mardle <marblypup@yahoo.co.uk>
 360 https://lists.gnu.org/archive/html/bison-patches/2015-09/msg00000.html
 361
 362 ** Push parsers
 363 There is demand for push parsers in Java and C++.  And GLR I guess.
 364
 365 ** Generate code instead of tables
 366 This is certainly quite a lot of work.  See
 367 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.4539.
 368
 369 ** $-1
 370 We should find a means to provide an access to values deep in the
 371 stack.  For instance, instead of
 372
 373         baz: qux { $$ = $<foo>-1 + $<bar>0 + $1; }
 374
 375 we should be able to have:
 376
 377   foo($foo) bar($bar) baz($bar): qux($qux) { $baz = $foo + $bar + $qux; }
 378
 379 Or something like this.
 380
 381 ** %if and the like
 382 It should be possible to have %if/%else/%endif.  The implementation is
 383 not clear: should it be lexical or syntactic.  Vadim Maslow thinks it
 384 must be in the scanner: we must not parse what is in a switched off
 385 part of %if.  Akim Demaille thinks it should be in the parser, so as
 386 to avoid falling into another CPP mistake.
 387
 388 ** XML Output
 389 There are couple of available extensions of Bison targeting some XML
 390 output.  Some day we should consider including them.  One issue is
 391 that they seem to be quite orthogonal to the parsing technique, and
 392 seem to depend mostly on the possibility to have some code triggered
 393 for each reduction.  As a matter of fact, such hooks could also be
 394 used to generate the yydebug traces.  Some generic scheme probably
 395 exists in there.
 396
 397 XML output for GNU Bison and gcc
 398    http://www.cs.may.ie/~jpower/Research/bisonXML/
 399
 400 XML output for GNU Bison
 401    http://yaxx.sourceforge.net/
 402
 403 ** Counterexample generation
 404 https://lists.gnu.org/archive/html/bug-bison/2016-06/msg00000.html
 405 http://www.cs.cornell.edu/andru/papers/cupex/
 406
 407 * Coding system independence
 408 Paul notes:
 409
 410         Currently Bison assumes 8-bit bytes (i.e. that UCHAR_MAX is
 411         255).  It also assumes that the 8-bit character encoding is
 412         the same for the invocation of 'bison' as it is for the
 413         invocation of 'cc', but this is not necessarily true when
 414         people run bison on an ASCII host and then use cc on an EBCDIC
 415         host.  I don't think these topics are worth our time
 416         addressing (unless we find a gung-ho volunteer for EBCDIC or
 417         PDP-10 ports :-) but they should probably be documented
 418         somewhere.
 419
 420         More importantly, Bison does not currently allow NUL bytes in
 421         tokens, either via escapes (e.g., "x\0y") or via a NUL byte in
 422         the source code.  This should get fixed.
 423
 424 * Broken options ?
 425 ** %token-table
 426 ** Skeleton strategy
 427 Must we keep %token-table?
 428
 429 * Precedence
 430
 431 ** Partial order
 432 It is unfortunate that there is a total order for precedence.  It
 433 makes it impossible to have modular precedence information.  We should
 434 move to partial orders (sounds like series/parallel orders to me).
 435
 436
 437 * $undefined
 438 From Hans:
 439 - If the Bison generated parser experiences an undefined number in the
 440 character range, that character is written out in diagnostic messages, an
 441 addition to the $undefined value.
 442
 443 Suggest: Change the name $undefined to undefined; looks better in outputs.
 444
 445
 446 * Pre and post actions.
 447 From: Florian Krohm <florian@edamail.fishkill.ibm.com>
 448 Subject: YYACT_EPILOGUE
 449 To: bug-bison@gnu.org
 450 X-Sent: 1 week, 4 days, 14 hours, 38 minutes, 11 seconds ago
 451
 452 The other day I had the need for explicitly building the parse tree. I
 453 used %locations for that and defined YYLLOC_DEFAULT to call a function
 454 that returns the tree node for the production. Easy. But I also needed
 455 to assign the S-attribute to the tree node. That cannot be done in
 456 YYLLOC_DEFAULT, because it is invoked before the action is executed.
 457 The way I solved this was to define a macro YYACT_EPILOGUE that would
 458 be invoked after the action. For reasons of symmetry I also added
 459 YYACT_PROLOGUE. Although I had no use for that I can envision how it
 460 might come in handy for debugging purposes.
 461 All is needed is to add
 462
 463 #if YYLSP_NEEDED
 464     YYACT_EPILOGUE (yyval, (yyvsp - yylen), yylen, yyloc, (yylsp - yylen));
 465 #else
 466     YYACT_EPILOGUE (yyval, (yyvsp - yylen), yylen);
 467 #endif
 468
 469 at the proper place to bison.simple. Ditto for YYACT_PROLOGUE.
 470
 471 I was wondering what you think about adding YYACT_PROLOGUE/EPILOGUE
 472 to bison. If you're interested, I'll work on a patch.
 473
 474 * Better graphics
 475 Equip the parser with a means to create the (visual) parse tree.
 476
 477 * Complaint submessage indentation.
 478 We already have an implementation that works fairly well for named
 479 reference messages, but it would be nice to use it consistently for all
 480 submessages from Bison.  For example, the "previous definition"
 481 submessage or the list of correct values for a %define variable might
 482 look better with indentation.
 483
 484 However, the current implementation makes the assumption that the
 485 location printed on the first line is not usually much shorter than the
 486 locations printed on the submessage lines that follow.  That assumption
 487 may not hold true as often for some kinds of submessages especially if
 488 we ever support multiple grammar files.
 489
 490 Here's a proposal for how a new implementation might look:
 491
 492   http://lists.gnu.org/archive/html/bison-patches/2009-09/msg00086.html
 493
 494
 495 Local Variables:
 496 mode: outline
 497 coding: utf-8
 498 fill-column: 76
 499 End:
 500
 501 -----
 502
 503 Copyright (C) 2001-2004, 2006, 2008-2015, 2018-2019 Free Software
 504 Foundation, Inc.
 505
 506 This file is part of Bison, the GNU Compiler Compiler.
 507
 508 This program is free software: you can redistribute it and/or modify
 509 it under the terms of the GNU General Public License as published by
 510 the Free Software Foundation, either version 3 of the License, or
 511 (at your option) any later version.
 512
 513 This program is distributed in the hope that it will be useful,
 514 but WITHOUT ANY WARRANTY; without even the implied warranty of
 515 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 516 GNU General Public License for more details.
 517
 518 You should have received a copy of the GNU General Public License
 519 along with this program.  If not, see <http://www.gnu.org/licenses/>.