AI01/rbp/readme

   1                      Basis of AI Backprop\r
   2                    Code from April 10, 1996\r
   3                Documentation from April 10, 1996\r
   4 \r
   5 Copyright (c) 1990-96 by Donald R. Tveter\r
   6 \r
   7 CONTENTS\r
   8 --------\r
   9 1. Introduction\r
  10 2. Making the Simulators\r
  11 3. A Simple Example\r
  12 4. Basic Facilities\r
  13 5. The Format Command\r
  14 6. Taking Training and Testing Patterns from a File\r
  15 7. Saving and Restoring Weights\r
  16 8. Initializing Weights\r
  17 9. The Seed Values\r
  18 10. The Algorithm Command\r
  19 11. The Delta-Bar-Delta Method\r
  20 12. Quickprop\r
  21 13. Making a Network\r
  22 14. Recurrent Networks\r
  23 15. Miscellaneous Commands\r
  24 16. Limitations\r
  25 17. The Pro Version Additions\r
  26 \r
  27 1. Introduction\r
  28 ---------------\r
  29    This manual describes the free version of my Basis of AI Backprop\r
  30 designed to accompany my not yet published (sigh) textbook, _The Basis\r
  31 of AI_.  This program contains enough features for students in an\r
  32 ordinary AI or Neural Networking course.  More serious users will\r
  33 probably need the professional version of this software, see:\r
  34 \r
  35 http://www.mcs.com/~drt/probp.html\r
  36 \r
  37 or send me email at: drt@mcs.com.  Other free NN software for the\r
  38 textbook is also available at:\r
  39 \r
  40 http://www.mcs.com/~drt/svbp.html\r
  41 \r
  42    For more on backprop see my "Backpropagator's Review" at:\r
  43 \r
  44 http://www.mcs.com/~drt/bprefs.html\r
  45 \r
  46    Notice: this is use at your own risk software.  There is no guarantee\r
  47 that it is bug-free.  Use of this software constitutes acceptance for\r
  48 use in an as is condition.  There are no warranties with regard to this\r
  49 software. In no event shall the author be liable for any damages\r
  50 whatsoever arising out of or in connection with the use or performance\r
  51 of this software.\r
  52 \r
  53    There are four simulators that can be constructed from the included\r
  54 files.  The program, bp, does back-propagation using real weights and\r
  55 arithmetic.  The program, ibp, does back-propagation using 16-bit\r
  56 integer weights, 16 and 32-bit integer arithmetic and some floating\r
  57 point arithmetic.  The program, sbp, uses symmetric floating point\r
  58 weights and its sole purpose is to produce weights for two-layer\r
  59 networks for use with the Hopfield and Boltzman relaxation algorithms\r
  60 (included in another package).  The program sibp does the same using\r
  61 16-bit integer weights.  The integer versions are faster on systems\r
  62 without floating point hardware however sometimes these versions don't\r
  63 have enough range or precision and then using the floating point\r
  64 versions is necessary.  DOS binaries are included here for systems with\r
  65 floating point hardware.  If you need other versions write me.\r
  66 \r
  67 2. Making the Simulators\r
  68 ------------------------\r
  69    This code has been written to use either 32-bit floating point\r
  70 (float) or 64-bit floating point (double) arithmetic.  On System V\r
  71 machines the standard seems to be that all floating point arithmetic is\r
  72 done with double precision arithmetic so double arithmetic is faster\r
  73 than float and therefore this is the default.  Other versions of C (e.g.\r
  74 ANSI C) will do single precision real arithmetic which will ordinarily\r
  75 be faster on most machines (I think).  To get 32-bit floating point set\r
  76 the compiler flag FLOAT in the makefile.  The function, exp, defined in\r
  77 real.c is double since System V specifies it as double.  If your C uses\r
  78 float, change this definition as well.\r
  79 \r
  80    For UNIX systems, use either makefile.unx or makereal.unx.\r
  81 The makefile.unx will make any of the programs and makefile will keep\r
  82 the bp object code files around while makereal.unx will only make bp\r
  83 but it keeps the bp object code files around.  Also for DOS systems\r
  84 there are two makefiles to choose from, makefile and makereal.  Makefile\r
  85 is designed to make all four programs but it only leaves around the\r
  86 object files for ibp while erasing object files for sibp, sbp and bp.\r
  87 On the other hand, makereal only makes bp and it leaves its object\r
  88 files around.  For 16-bit DOS you need to set the flag, -DDOS16 and for\r
  89 32-bit DOS, you need to set the flag -DDOS32.  The flags I have in the\r
  90 DOS makefiles are what I use with Zortech C 3.1.  The code is known not\r
  91 to compile with at least one version of Turbo C because of an oddity\r
  92 (or bug?) in the compiler.\r
  93 \r
  94    There was a problem found with the previous free student version\r
  95 where it crashed on a Sun when the program hits a call to free in the\r
  96 file bp.c.  This can be solved by removing the two calls to free and the\r
  97 amount of space you waste is minimal.  I haven't had a report of such a\r
  98 problem with this version yet but if it happens, let me know, in all\r
  99 probability removing a call or two to free in the file io.c will solve\r
 100 the problem.\r
 101 \r
 102    This code will work with basic C compilers however the libraries\r
 103 sometimes vary from system to system.  DOS systems seem to use the\r
 104 function getch in the conio library for hot key capability.  For a\r
 105 System V UNIX system the code uses a home-made function called getch for\r
 106 hot key capability.  This is the default setting for a UNIX system and\r
 107 it also works with Suns.  If you use BSD UNIX then you need to define\r
 108 the compiler variable BSD either in the cc command by adding the\r
 109 parameter, -DBSD.  To get the hotkey feature to work with a NeXT use the\r
 110 parameter -DNEXT.  At this point I don't know what other variations of\r
 111 UNIX use so you may need to adapt the ioctl function call in the file\r
 112 io.c and the files rbp.h and ibp.h to make them fit some other version.\r
 113 If your system uses some other standard then if you can send me the\r
 114 documentation I should be able to make it work as well.  If necessary\r
 115 the hot key option can be removed by removing or commenting out the\r
 116 line:\r
 117 \r
 118 #define HOTKEYS\r
 119 \r
 120 in the rbp.h and ibp.h files.\r
 121 \r
 122    There are some other more minor options that can be compiled in or\r
 123 left out but these are mentioned at other points in the documentation.\r
 124 \r
 125    To make a particular executable file use the makefile given with the\r
 126 data files and make any or all of them like so:\r
 127 \r
 128         UNIX                        DOS\r
 129 \r
 130     make -f makereal.unx bp       make -f makereal bp\r
 131     make -f makefile.unx bp       make bp\r
 132     make -f makefile.unx ibp      make ibp\r
 133     make -f makefile.unx sibp     make sibp\r
 134     make -f makefile.unx sbp      make sbp\r
 135 \r
 136 If you do get bugs on an odd system and you can let me telnet in to\r
 137 your system (preferably on a separate login, rather than your personal\r
 138 login) I will try and fix the problem for you.\r
 139 \r
 140 \r
 141 3. A Simple Example\r
 142 -------------------\r
 143    Each version would normally be called with the name of a file to read\r
 144 commands from, as in:\r
 145 \r
 146 bp xor\r
 147 \r
 148 After the data file is read commands are then taken from the keyboard.\r
 149 When no file name is specified bp will take commands from the keyboard\r
 150 (stdin file).  Normally you will find it convenient to put the commands\r
 151 you need to set up the network in a short file however it is possible to\r
 152 type them all in to the program from the keyboard.  If you have more\r
 153 than a tiny amount of data you should have the data ready in a training\r
 154 file and a testing file if you have test data.\r
 155 \r
 156    The commands are one or two or three letter commands and most of them\r
 157 have optional parameters.  The `a', `d', `f' and 'q' commands allow a\r
 158 number of sub-commands on a line.  The maximum length of any line is 256\r
 159 characters.  An `*' is a comment and it can be used to make the\r
 160 remainder of the line a comment.  In addition ctrl-R will run the\r
 161 training.\r
 162 \r
 163    Here is an example of a data file to do the xor problem:\r
 164            \r
 165 * input file for the xor problem\r
 166            \r
 167 m 2 1 1 x     * make a 2-1-1 network with extra input-output connections\r
 168 s 7           * seed the random number function\r
 169 ci            * clear and initialize the network with random weights\r
 170 \r
 171 rt {          * read training patterns into memory\r
 172 1 0 1\r
 173 0 0 0\r
 174 0 1 1\r
 175 1 1 0}\r
 176 \r
 177 e 0.5         * set eta, the learning rate to 0.5 (and eta2 to 0.5)\r
 178 a 0.9         * set alpha, the momentum to 0.9\r
 179 \r
 180 First in this example, the m command will make a network with 2 units in\r
 181 the input layer, 1 unit in the second layer and 1 unit in the third\r
 182 layer.  Much of the time a three layer network where the connections are\r
 183 only between adjacent layers is as complex as a network needs to be\r
 184 however there are problems where having additional connections between\r
 185 the input units and output units will greatly speed-up the learning\r
 186 process.  The xor problem is one of those problems where the extra\r
 187 connections help so the 'x' at the end of the command will add these\r
 188 two extra connections.  The `s' (seed) command sets the seed for the\r
 189 random number function.  The "ci" command (clear and initialize) clears\r
 190 the existing network weights and initializes the weights to random\r
 191 values between -1 and +1.  The rt (read training set) command gives four\r
 192 new patterns to be read into the program.  All of them are listed\r
 193 between the curly brackets ({}).  The input pattern comes first followed\r
 194 by the output pattern.  The command "e 0.5" sets eta, the learning\r
 195 rate for the upper layer to 0.5 and eta2 for the lower layers to 0.5 as\r
 196 well.  The last line sets alpha, the momentum parameter, to 0.9.\r
 197 \r
 198    After these commands are executed the following messages and prompt\r
 199 appears:\r
 200 \r
 201 Basis of AI Backprop (c) 1990-96 by Donald R. Tveter\r
 202    drt@mcs.com - http://www.mcs.com/~drt/home.html\r
 203                April 10, 1996 version.\r
 204 taking commands from stdin now\r
 205 [ACDFGMNPQTW?!acdefhlmopqrstw]? q\r
 206 \r
 207 The characters within the square brackets are a list of the possible\r
 208 commands.  To run 100 iterations of back-propagation and print out the\r
 209 status of the learning every 10 iterations type "r 100 10" at the\r
 210 prompt:\r
 211 \r
 212 [ACDFGMNPQTW?!acdefhlmopqrstw]? r 100 10\r
 213 \r
 214 This gives:\r
 215  \r
 216 running . . .\r
 217    10      0.00 % 0.49947   \r
 218    20      0.00 % 0.49798   \r
 219    30      0.00 % 0.48713   \r
 220    40      0.00 % 0.37061   \r
 221    50      0.00 % 0.15681   \r
 222    59    100.00 % 0.07121    DONE\r
 223 \r
 224 The program immediately prints out the "running .  .  ." message.  After\r
 225 each 10 iterations a summary of the learning process is printed giving\r
 226 the percentage of patterns that are right and the average value of the\r
 227 absolute values of the errors of the output units.  The program stops\r
 228 when the each output for each pattern has been learned to within the\r
 229 required tolerance, in this case the default value of 0.1.  Sometimes\r
 230 the integer versions will do a few extra iterations before declaring the\r
 231 problem done because of truncation errors in the arithmetic done to\r
 232 check for convergence.  Unlike the previous student version the default\r
 233 for these values is to be "up-to-date" however this can be over-ridden\r
 234 to save a little on CPU time.\r
 235 \r
 236    There are many factors that affect the number of iterations needed\r
 237 for a network to converge.  For instance if your random number function\r
 238 doesn't generate the same values as the one with the Zortech 3.1\r
 239 compiler (which is the same one used by most UNIX C compilers) the\r
 240 number of iterations it takes will be different.  The integer versions\r
 241 produce slightly different results that the floating point versions.\r
 242 \r
 243 Listing Patterns\r
 244 \r
 245    To get a listing of the status of each pattern use the `p' command\r
 246 to give:\r
 247 \r
 248 [ACDFGMNPQTW?!acdefhlmopqrstw]? p\r
 249     1  0.903  e 0.097 ok\r
 250     2  0.050  e 0.050 ok\r
 251     3  0.935  e 0.065 ok\r
 252     4  0.072  e 0.072 ok\r
 253    59    (TOL) 100.00 % (4 right  0 wrong)  0.07121 err/unit\r
 254 \r
 255 The number folloing the e (for error) is the sum of the absolute values\r
 256 of the output errors for each pattern.  An `ok' is given to every\r
 257 pattern that has been learned to within the required tolerance.  To get\r
 258 the status of one pattern, say, the fourth pattern, type "p 4" to give:\r
 259 \r
 260  0.07  (0.072) ok\r
 261 \r
 262 To get a summary without the complete listing use "p 0".  To get the\r
 263 output targets for a given pattern, say pattern 3, use "o 3".\r
 264 \r
 265    A particular test pattern can be input to the network by giving the\r
 266 pattern at the prompt:\r
 267 \r
 268 [ACDFGMNPQTW?!acdefhlmopqrstw]? 1 0\r
 269        0.903 \r
 270 \r
 271 Examining Weights\r
 272 \r
 273    It is often interesting to see the values of some particular weights\r
 274 in the network.  To see a listing of all the weights in a network you\r
 275 can use the save weights command described later on and then list the\r
 276 file containing the weights, however, to see the weights leading into a\r
 277 particular node, say the node in row 3, node 1 use the w command as in:\r
 278 \r
 279 [ACDFGMNPQTW?!acdefhlmopqrstw]? w 3 1\r
 280 \r
 281 layer unit  inuse  unit value    weight   inuse   input from unit\r
 282   1     1     1      1.00000     5.38258     1        5.38258\r
 283   1     2     1      0.00000    -4.86238     1        0.00000\r
 284   2     1     1      1.00000   -10.86713     1      -10.86710\r
 285   3     b     1      1.00000     7.71563     2        7.71563\r
 286                                               sum =   2.23111\r
 287 \r
 288 This listing also gives data on how the current activation value of the\r
 289 node is computed using the weights and the activations values of the\r
 290 nodes feeding into unit 1 of layer 3.  The `b' unit is the bias (also\r
 291 called the threshold) unit.  The inuse column to the right of the unit\r
 292 column is 1 when the unit is in use and 0 if it is not in use.  In this\r
 293 free version there are no commands to take weights out of use.  A 1\r
 294 indicates a regular weight in use and a 2 indicates a bias weight in\r
 295 use.\r
 296 \r
 297    Besides saving weights you can save all the parameters to a file\r
 298 with the save everything command as in:\r
 299 \r
 300    se saved\r
 301 \r
 302 At the same time the weights will be written to the current weights\r
 303 file.  The file saved is virtually the same as the one you get with the\r
 304 '?' command.  To start over from where you left off you can use:\r
 305 \r
 306    bp saved\r
 307 \r
 308 and this also reads in the patterns and weights.  This command DOES NOT\r
 309 save training and testing patterns since normally you would have them\r
 310 in a file of their own.\r
 311 \r
 312    To get a short online tutorial on how to use the program you can type\r
 313 T at the command prompt and get the listing:\r
 314 \r
 315    A Tutorial\r
 316 \r
 317    The following topics are designed to be read in the order listed.\r
 318 \r
 319    To get help on a topic type the code on the right at the prompt.\r
 320 \r
 321    Understanding the Menus                                         h1\r
 322    Formatting Data for a Classification Problem                    h2\r
 323    Formatting Data for Function Approximation                      h3\r
 324    Formatting Data for a Recurrent Problem                         h4\r
 325    Making a Network for Classification or Function Approximation   h5\r
 326    Making a Recurrent Network                                      h6\r
 327    Reading the Data                                                h7\r
 328    Setting Algorithms and Parameters                               h8\r
 329    Running the Program                                             h9\r
 330    Saving Almost Everything                                        h10\r
 331    To Quit the Program                                             h11\r
 332 \r
 333    To end the program the `q' (for quit) command is entered:\r
 334 \r
 335 [ACDFGMNPQTW?!acdefhlmopqrstw]? q\r
 336 \r
 337 \r
 338 4. Basic Facilities\r
 339 -------------------\r
 340    There are a very large number of parameters that can be set for\r
 341 various algorithms in these programs.  Typing a `?'  will get a compact\r
 342 listing of them all however they are packed rather tight.  To get a\r
 343 better view of the parameters there are now many upper-case letter\r
 344 commands that give a listing of parameters in a less compact form.\r
 345 These screens list parameters, generally on the left of the screen in\r
 346 the form of the commands you would need to set them.  The center of the\r
 347 screen gives a short description of the parameter.  Sometimes one or two\r
 348 lines are inadequate to describe the command so at the far right there\r
 349 may be a sequence you can type to get more help with the command.\r
 350 \r
 351    The most important screen you can look at is the C for commands\r
 352 screen that summarizes what each menu screen will show:\r
 353 \r
 354 [ACDFGMNPQTW?!acdefhlmopqrstw]? C\r
 355 \r
 356 Screen     Includes Information and Parameters on:\r
 357 \r
 358   A        algorithm parameters and tolerance\r
 359   C        this listing of major command groups\r
 360   D        delta-bar-delta parameters\r
 361   F        formats: patterns, output, paging, copying screen i/o\r
 362   G        gradient descent (plain backpropagation)\r
 363   M        miscellaneous commands: shell escape, seed values, clear,\r
 364            clear and initialize, quit, kick a network, run command,\r
 365            save almost everything\r
 366   N        network building: making a network, initializing\r
 367            a network, kicking a network\r
 368   P        pattern commands: reading patterns, testing patterns,\r
 369   Q        quickprop parameters\r
 370   T        a short tutorial\r
 371   W        weight commands: listing, saving, restoring\r
 372   ?        a compact listing of everything\r
 373 \r
 374 One typical menu screen is the A screen that lists the main algorithm\r
 375 parameters:\r
 376 \r
 377 [ACDFGMNPQTW?!acdefhlmopqrstw]? A\r
 378 \r
 379 Algorithm Parameters\r
 380 \r
 381 a a <char>     sets all act. functions to <char>; {ls}             h aa\r
 382 a ah s         hidden layer(s) act. function; {ls}                 h aa\r
 383 a ao s         output layer act. function; {ls}                    h aa\r
 384 a d d          the output layer derivative term; {cdf}             h ad\r
 385 a i -          initializes units before using the training set; {+-}\r
 386 a u p          the weight update algorithm; {Ccdpq}                h au\r
 387 t 0.100        tolerance/unit for successful learning; (0..1)\r
 388 \r
 389 f O -          allows out-of-date statistics to print; {+-}\r
 390 f u -          compute up-to-date statistics; {+-}\r
 391 \r
 392 The first of these listings is the line:\r
 393 \r
 394 a a <char>     sets all act. functions to <char>; {ls}             h aa\r
 395 \r
 396 which doesn't give a parameter value but instead it gives the pattern of\r
 397 a command designed to set the activation function for the entire\r
 398 network.  The first sequence is:\r
 399 \r
 400 a a <char>\r
 401 \r
 402 and this sequence will change the activation function but when you type\r
 403 it in you will have to substitute a character code for the activation\r
 404 function instead of the string <char>.  One other activation function is\r
 405 the linear activation function denoted by the character l, so to get\r
 406 this function you can type in the line:\r
 407 \r
 408 a a l\r
 409 \r
 410 The first `a' codes for the "algorithm" command, the second `a' codes\r
 411 for the "activation function" and s is the letter for the function.  The\r
 412 idea of putting the variable portion of the command within the angle\r
 413 brackets (<>) is a notation devised by Computer Scientists to describe\r
 414 computer languages.  The word inside these brackets describes the kind\r
 415 of thing that is the variable portion of the command.\r
 416 \r
 417    The middle part of the line:\r
 418 \r
 419 a a <char>     sets all act. functions to <char>; {ls}             h aa\r
 420 \r
 421 gives a short description of the meaning of the command and within the\r
 422 curly brackets there is a listing all the values for all the activation\r
 423 functions, {ls}.  To get a more detailed explanation of the options type\r
 424 the sequence on the right: `h aa', this gives:\r
 425 \r
 426 and the following comes up:\r
 427 \r
 428    a a <char> sets every activation function to <char>.\r
 429    a ah <char> sets the hidden layer activation function to <char>.\r
 430    a ao <char> sets the output layer activation function to <char>,\r
 431    <char> can be any of the following:\r
 432 \r
 433    <char>                  Function                           Range\r
 434 \r
 435       l    linear function, x                             (-inf..+inf)\r
 436       s    standard sigmoid, 1 / (1 + exp(-x))               (0..+1)\r
 437 \r
 438 Here you get the code for the function, the function and the range of\r
 439 values the function can take on.  This range portion following another\r
 440 standard of notation used by Mathematicians.  A ( or ) next to a number,\r
 441 say 0, means the range runs very close to 0 but never exactly to 0, in\r
 442 other cases (not shown above) a [ or ] next to a number means value can\r
 443 range up to exactly the number.  Thus the range:\r
 444 \r
 445 (0..+1]\r
 446 \r
 447 meaning that the range can run from ALMOST EXACTLY 0 up to exactly +1.\r
 448 \r
 449    If we now return to the A screen, the second line was:\r
 450 \r
 451 a ah s         hidden layer(s) act. function; {ls}                 h aa\r
 452 \r
 453 Here the idea is to indicate that the activation function for the hidden\r
 454 layer (or layers) of the network IS NOW the s (standard sigmoid\r
 455 function).  Again there is a short explanation of this, the set of codes\r
 456 for functions and information about how to get more help.  This line can\r
 457 also be taken as a direction as to how to set the hidden layer\r
 458 activation function as well.  To change it to l you can type in:\r
 459 \r
 460 a ah l\r
 461 \r
 462 (Note: normally you would only use the linear activation function in the\r
 463 output layer of a network.)\r
 464 \r
 465    The third line:\r
 466 \r
 467 a ao l         output layer act. function; {ls}                    h aa\r
 468 \r
 469 is similar except it states that the activation function for the output\r
 470 layer is l.\r
 471 \r
 472 \r
 473 Paging\r
 474 \r
 475    In the student version the paging was a simple version of the System\r
 476 V utility, pg.  Now the paging is more like the common UNIX more\r
 477 command.  The default page size is 24 lines and it can be reset to\r
 478 another value with the format command's paging size sub-command.  For\r
 479 instance to get 12 lines / page instead of 24, use:\r
 480 \r
 481 f P 12\r
 482 \r
 483 To get no paging at all use:\r
 484 \r
 485 f P 0\r
 486 \r
 487 When the page is full you get the prompt:\r
 488 \r
 489 More?\r
 490 \r
 491 At this point you can type:\r
 492 \r
 493    q                  to quit viewing the text if you are in a loop,\r
 494    a blank            to get another page,\r
 495    ^D                 to get another half a page,\r
 496    a carriage return  to get one more line and\r
 497    c                  to continue without paging.\r
 498 \r
 499 Mostly paging is needed for loops within the program, like running a\r
 500 large number of iterations and printing the results, listing the values\r
 501 of all the patterns or listing weights leading into a particular unit.\r
 502 Typing the q quits these loops, however paging can also occur with some\r
 503 of the longer screen menus that are generating lines of output without\r
 504 running a loop.  For these cases the q does not work.\r
 505 \r
 506    Every new command entered from the keyboard sets the page counting\r
 507 variable to 0 however if input is being taken from a file other than\r
 508 stdin the counter is NOT reset.  Most of the time this doesn't matter\r
 509 since the little data files like the xor example used to set up\r
 510 parameters don't produce any output anyway, however if they do paging is\r
 511 in effect.  Having paging here is helpful in case there is a problem\r
 512 with reading the files.\r
 513 \r
 514 Interrupts\r
 515 \r
 516    In UNIX entering an interrupt will stop the current command and the\r
 517 program will give the user another prompt.  With DOS entering a ctrl-C\r
 518 will generate a similar kind of interrupt however DOS only checks for\r
 519 this condition when it has to do i/o.  However when the DOS version is\r
 520 in a training loop the program also checks to see if a key has been hit\r
 521 and if that key is the the escape key, the program will break the\r
 522 training loop.\r
 523 \r
 524 Control Command\r
 525 \r
 526    One control-key command is available in this version, hitting ctrl-R\r
 527 will run the training algorithm, it is a shorthand for typing r followed\r
 528 by a carriage return.\r
 529 \r
 530 Passing Commands to the Operating System\r
 531 \r
 532    By using the '!' command you can pass commands to the operating\r
 533 system from within the program.  The kind of typical things you might\r
 534 want to do are to list the contents of a directory, list a file or after\r
 535 saving weights to a file you might want to list them or even edit them\r
 536 and read them back in.  Here is what you can say for DOS to list the\r
 537 little data file xor:\r
 538 \r
 539 ! type xor\r
 540 \r
 541 Once a string has been defined with a ! command it can be re-run simply\r
 542 by typing the ! followed immediately by a carriage return.\r
 543 \r
 544 Making a Copy of Your Session\r
 545 \r
 546    Sometimes you may want to make of copy of everything that you type in\r
 547 and the program prints out.  For instance you may get exceptionally good\r
 548 or bad results using a certain training sequence and an exact record of\r
 549 what you and the program did could be worth having.  Or you may need an\r
 550 exact copy of the training or testing set values.  Or you may need lots\r
 551 of runs where you average the results using another program.  To turn on\r
 552 the making of a copy use the format command to turn on the copying\r
 553 process:\r
 554 \r
 555 f c+\r
 556 \r
 557 and to turn it off use:\r
 558 \r
 559 f c-\r
 560 \r
 561 The text is written to the file copy.\r
 562 \r
 563 An Alphabetical Listing of the Commands\r
 564 \r
 565    The following listing is designed to give you an idea of the set of\r
 566 commands available.  Details are given in later sections.\r
 567 \r
 568 a <number>       sets the momentum parameter, alpha\r
 569 a <options>      the algorithm command\r
 570 c                clear the network\r
 571 ci               clear and initialize the network\r
 572 d <options>      set delta-bar-delta parameters\r
 573 e <number(s)>    set the learning rate eta\r
 574 f <options>      lots of formatting options\r
 575 h <string>       gives more help with certain options\r
 576 l <layer>        list the values of units on that layer\r
 577 m <numbers>      make a network\r
 578 o <number>       list the output targets of the training set pattern\r
 579 p <options>      list information about training patterns\r
 580 q                quit\r
 581 qp               set quickprop parameters\r
 582 r <options>      run the training algorithm\r
 583 rt <options>     read the training set patterns\r
 584 rw <filename>    read the weights\r
 585 rx <filename>    read the extra training set patterns\r
 586 s <seeds>        seed value\r
 587 sb <real>        set bias weights\r
 588 se <filename>    save almost everything\r
 589 sw <filename>    save weights\r
 590 t <options>      list testing file statistics of various sorts\r
 591 t <real>         tolerance per output unit that must be met\r
 592 tf <filename>    gives the file name with testing patterns\r
 593 tr <int>         special test for a recurrent network\r
 594 trp <int>        special test for a recurrent network\r
 595 w <layer> <unit> list weights leading into unit\r
 596 \r
 597 The Summary Line\r
 598 \r
 599    The default setting for the summaries you get produce up to\r
 600 date statistics on the error and on how many patterns are correct.\r
 601 Here are several lines of summaries from a problem that has training\r
 602 data and test data for a classification problem:\r
 603 \r
 604    10      0.00 %  49.04 % 0.47087       0.00 %  62.50 % 0.38234 \r
 605    20      0.00 %  73.08 % 0.38584       0.00 %  77.88 % 0.38108 \r
 606    30      2.88 %  76.92 % 0.35043       4.81 %  79.81 % 0.33285 \r
 607 \r
 608 The first column is of course the number of iterations, the next column\r
 609 gives the percentage of training patterns that are correct based on the\r
 610 tolerance.  The next column gives the percentage of correct training\r
 611 patterns based on the maximum value of the output units.  The next\r
 612 column is the average absolute value of the error per output unit.  Note\r
 613 that many other programs will report the RMS error.  The columns on the\r
 614 right list the percentage of test set patterns that are correct based on\r
 615 tolerance, the percentage correct based on maximum value and finally the\r
 616 average error on the test set.\r
 617 \r
 618    Some CPU time can be saved by altering certain parameter settings\r
 619 that skip some of the forward passes used to determine the current set\r
 620 of statistics.  The more often you print the statistics the more time\r
 621 you can save by altering these parameter settings.  The penalty is that\r
 622 the statistics will be out of date by one iteration with the quickprop\r
 623 delta-bar-delta, supersab and regular periodic update methods and only\r
 624 approximate for both the right and wrong continuous update methods.\r
 625 \r
 626    In all the update methods the training set statistics are computed\r
 627 when the program passes back the error.  However then an update of the\r
 628 weights is done and these numbers are out of date.  So if 100 iterations\r
 629 have been done the program only has the statistics on iteration 99, the\r
 630 values that were true before the weights were changed.  When it is time\r
 631 to print out the program statistics the default is to do another forward\r
 632 pass through the training set to get up to date statistics.  This can be\r
 633 stopped by setting the off by 1 option in the format command like so:\r
 634 \r
 635 f O+\r
 636 \r
 637 The results that print out will now look like this:\r
 638 \r
 639    10 -1   0.00 %  64.42 % 0.42463       0.00 %  62.50 % 0.38234 \r
 640    20 -1   0.00 %  73.08 % 0.39058       0.00 %  77.88 % 0.38108 \r
 641    30 -1   8.65 %  75.00 % 0.35052       4.81 %  79.81 % 0.33285 \r
 642 \r
 643 where the string "-1" comes right after the number of iterations done\r
 644 and of course it means the numbers shown are for the previous iteration.\r
 645 For the test set patterns one pass through the training set has to be\r
 646 made to get up to date statistics on them so they are always up to date.\r
 647 Most of the time you will probably be more interested in the test set\r
 648 results than in the training set results so setting the off by 1 option\r
 649 saves a little time and getting off by 1 results on the training set\r
 650 is not important.\r
 651 \r
 652    The situation when you use the "right" and "wrong" continuous update\r
 653 methods is even more complicated.  After the forward pass for one\r
 654 pattern is done it is checked to see if it is right and the error is\r
 655 added in to a sum of errors.  Then weights are changed.  Then another\r
 656 pattern is processed in the same way.  When the weight changes are done\r
 657 for this second pattern they may well ruin the right/wrong decision for\r
 658 all the previous patterns.  Thus the number of right and wrong patterns\r
 659 and the average error can be off by quite a lot.  With the off by 1\r
 660 option off ("f O-") the program still does a forward pass to get the\r
 661 up to date statistics on the training set.  However when the off by 1\r
 662 option is on the statistics look like this:\r
 663 \r
 664    10 -1   1.92 %  66.35 % 0.42093 ?    31.73 %  40.38 % 0.46316 \r
 665    20 -1  12.50 %  64.42 % 0.38829 ?    40.38 %  44.23 % 0.53761 \r
 666    30 -1  34.62 %  75.00 % 0.30939 ?    50.96 %  63.46 % 0.37728 \r
 667 \r
 668 where the ? after the training set error flags the fact that the numbers\r
 669 are very suspect.\r
 670 \r
 671    Another option in the program is to do an extra forward pass through\r
 672 the training set even when there are no statistics to print out.  The\r
 673 option to give you up to date statistics is:\r
 674 \r
 675 f u+\r
 676 \r
 677 If you are using the periodic update method, quickprop or dbd you don't\r
 678 need "f u+" as the program will report the correct values anyway.\r
 679 \r
 680    The one line form of the summary is the default but it can be turned\r
 681 off using:\r
 682 \r
 683 f s-\r
 684 \r
 685 With this you get nothing whatsoever and normally you won't want this\r
 686 unless perhaps you are producing your own customized output.\r
 687 \r
 688 5. The Format Command (f)\r
 689 -------------------------\r
 690    There are several ways to input and output patterns, numbers and\r
 691 other values and there is one format command, `f', that is used to set\r
 692 these options.  In the format command a number of options can be given\r
 693 on a single line as for example in:\r
 694 \r
 695 f b+ ir oc wB\r
 696 \r
 697 Input Patterns\r
 698 \r
 699    The programs are able to read pattern values in two different\r
 700 formats.  Real numbers follow the C language notation and must be\r
 701 separated by a space.  The letters `H' used in recurrent networks is\r
 702 also allowed.  The letter `x' with a default value of 0.5 is also\r
 703 allowed.  The `x' character has a default value of 0.5.  The value of\r
 704 `x' can be changed, for example to make `x' -1 use:\r
 705 \r
 706 f x -1\r
 707 \r
 708 Real input format is now the default but if you use the other format\r
 709 (a compressed binary format) you can re-set the format to real with:\r
 710 \r
 711 f ir\r
 712 \r
 713    The other format is the compressed format, a format consisting of 1s,\r
 714 0s and the letters `x' and `H'.  In compressed format each value is one\r
 715 character and it is not necessary to have blanks between the characters.\r
 716 For example, in compressed format the patterns for xor could be written\r
 717 out in either of the following ways:\r
 718 \r
 719 101      10 1\r
 720 000      00 0\r
 721 011      01 1\r
 722 110      11 0\r
 723 \r
 724 The second example is preferable because it makes it easier to see the\r
 725 input and the output patterns.\r
 726 \r
 727 To change to compressed format use:\r
 728 \r
 729 f ic\r
 730 \r
 731 Output of Patterns\r
 732 \r
 733    Output format is controlled with the `f' command as in:\r
 734 \r
 735 f or   * output node values using real (the C %f) format\r
 736 f oc   * output node values using compressed format\r
 737 f oa   * output node values using analog compressed format\r
 738 f oe   * output values with e notation\r
 739 \r
 740 The first sets the output to real numbers.  The second sets the output\r
 741 to be compressed mode where the value printed will be a `1' when the\r
 742 unit value is greater than 1.0 - tolerance, a `^' when the value is\r
 743 above 0.5 but less than 1.0 - tolerance, a `v' when the value is less\r
 744 than 0.5 but greater than the tolerance.  Below the tolerance value a\r
 745 `0' is printed.  The tolerance can be changed using the `t' command (not\r
 746 a part of the format command).  For example, to make all values greater\r
 747 than 0.8 print as `1' and all values less than 0.2 print as `0' use:\r
 748 \r
 749 t 0.2\r
 750 \r
 751 Of course this same tolerance value is also used to check to see if all\r
 752 the patterns have converged.  The third output format is meant to give\r
 753 "analog compressed" output.  In this format a `c' is printed when a\r
 754 value is close enough to its target value.  Otherwise, if the answer is\r
 755 close to 1, a `1' is printed, if the answer is close to 0, a `0' is\r
 756 printed, if the answer is above the target but not close to 1, a `^' is\r
 757 printed and if the answer is below the target but not close to 0, a `v'\r
 758 is printed.  This output format is designed for problems where the\r
 759 output is a real number, as for instance, when the problem is to make a\r
 760 network learn sin(x).  The format "e" writes out node values using\r
 761 exponential notation with four places to the right of the decimal point.\r
 762 \r
 763 Breaking up the Output Values\r
 764 \r
 765    In the compressed formats the default is to print a blank after every\r
 766 10 values.  This can be altered using the `B' (for inserting breaks)\r
 767 option within the format ('f') command.  The use for this command is to\r
 768 separate output values into logical groups to make the output more\r
 769 readable.  For instance, you may have 24 output units where it makes\r
 770 sense to insert blanks after the 4th, 7th and 19th positions.  To do\r
 771 this, specify:\r
 772 \r
 773 f B 4 7 19\r
 774 \r
 775 Then for example the output will look like:\r
 776 \r
 777   1 10^0 10^ ^000v00000v0 01000 e 0.17577\r
 778   2 1010 01v 0^0000v00000 ^1000 e 0.16341\r
 779   3 0101 10^ 00^00v00000v 00001 e 0.16887\r
 780   4 0100 0^0 000^00000v00 00^00 e 0.19880\r
 781 \r
 782 The break option allows up to 20 break positions to be specified.  The\r
 783 default output format is the real format with 10 numbers per line.  For\r
 784 the output of real values the option specifies when to print a carriage\r
 785 return rather than when to print a blank.\r
 786 \r
 787 Pattern Formats\r
 788 \r
 789    There are two different types of problems that back-propagation can\r
 790 handle, the general type of problem where every output unit can take on\r
 791 an arbitrary value and the classification type of problem where the goal\r
 792 is to turn on output unit i and turn off all the other output units when\r
 793 the pattern is of class i.  The xor problem is an example of the general\r
 794 type of problem.  For an example of a classification problem, suppose\r
 795 you have a number of data points scattered about through two-dimensional\r
 796 space and you have to classify the points as either class 1, class 2 or\r
 797 class 3.  For a pattern of class 1 you can always set up the output:\r
 798 "1 0 0", for class 2: "0 1 0" and for class 3: "0 0 1", however doing\r
 799 the translation to bit patterns can be annoying so another notation can\r
 800 be used.  Instead of specifying the bit patterns you can set the pattern\r
 801 format option to classification (as opposed to the default value of\r
 802 general) like so:\r
 803 \r
 804 f pc\r
 805 \r
 806 and then the program will read data in the form:\r
 807 \r
 808    1.33   3.61   1   *  shorthand for 1 0 0\r
 809    0.42  -2.30   2   *  shorthand for 0 1 0\r
 810   -0.31   4.30   3   *  shorthand for 0 0 1\r
 811 \r
 812 and translate it to the bit string form.  To switch to the general form\r
 813 use "f pg".  Another benefit of the classification format is that when\r
 814 the program outputs a status line it will also include the percentage of\r
 815 correct patterns based on the maximum value rather than just on\r
 816 tolerance.\r
 817 \r
 818 \r
 819 Controlling Summaries\r
 820 \r
 821    When the program is learning patterns you normally want to have it\r
 822 print out the status of the learning process at regular intervals.  The\r
 823 default is to print out a one-line summary of how learning is going\r
 824 and this is set by using "f s+".  However if you want to customize\r
 825 exactly what is printed out and you don't want the standard summary, use\r
 826 "f s-".\r
 827 \r
 828 Skipping the "running . . ." Message\r
 829 \r
 830    Normally whenever you run more training iterations the message,\r
 831 "running . . ." prints out to reassure you that something is in fact\r
 832 being done, however this can also be annoying at times.  To get rid of\r
 833 this message use "f R-" and to bring it back use "f R+".\r
 834 \r
 835 Ringing the Bell\r
 836 \r
 837    To ring the bell when the learning has been completed use "f b+" and\r
 838 to turn off the bell use "f b-".\r
 839 \r
 840 Echoing Input\r
 841 \r
 842    When you are reading commands from a file it is sometimes worthwhile\r
 843 to see those commands echoed on the screen.  To do this, use "f e+" and\r
 844 to turn off the echoing, use "f e-".\r
 845 \r
 846 Paging\r
 847 \r
 848    To set the page size to some value, say, 25, use "f P 25" or to skip\r
 849 paging use "f P 0".\r
 850 \r
 851 Making a Copy of Your Session\r
 852 \r
 853    To make a copy of what appears on the screen use "f c+" to start\r
 854 writing to the file "copy" and "f c-" to stop writing to this file.\r
 855 Ending the session automatically closes this file as well.\r
 856 \r
 857 Up-To-Date Statistics\r
 858 \r
 859    During the ith pass thru the network the program will collect\r
 860 statistics on how many patterns are correct and how much error there is.\r
 861 It does this so that it will know when to stop the training.  But it\r
 862 gets these numbers BEFORE the weights are changed in the ith pass.  In\r
 863 the case of periodic update methods (the periodic, delta-bar-delta,\r
 864 quickprop and supersab) this is not much of a problem.  If the off by 1\r
 865 flag is off ("f O-") there is another forward pass done whenever the\r
 866 statistics are printed out so you get up to date statistics anyway.  If\r
 867 the off by 1 flag is on ("f O+") you get the string "-1" after the\r
 868 number of iterations is printed on the summary line.  Getting the\r
 869 statistics in the off by 1 form is harmless and it saves a little CPU\r
 870 time.  When the network converges the "-1" flag will not be shown.\r
 871 \r
 872    However with the continuous update methods the weights are changed\r
 873 after each pattern and this skews the statistics gathered by the\r
 874 training process by quite a lot.  To get an accurate assessment of how\r
 875 well the training is going when results are printed on the summary line\r
 876 you either need to have the off by 1 flag set to "f O+" or you need to\r
 877 set the up to date statistics flag by: "f u+".  The default is to leave\r
 878 this flag off: "f u-".  Furthermore, if you are training to get an\r
 879 accurate assessment of how many iterations it takes to learn the\r
 880 training set you need to set "f u+" (NOT JUST "f O+"!).  The "u+"\r
 881 setting makes a check after every complete pass through the training\r
 882 set.  The "f O+" setting only makes a check when it is time to print\r
 883 the status line.\r
 884 \r
 885 \r
 886 6. Taking Training and Testing Patterns from Files (rt,rx,tf)\r
 887 -------------------------------------------------------------\r
 888    In the xor example given above the four patterns were part of the\r
 889 data file and to read them in the following lines were used:\r
 890 \r
 891 rt {\r
 892 1 0 1\r
 893 0 0 0\r
 894 0 1 1\r
 895 1 1 0 }\r
 896 \r
 897 However it is also convenient to take patterns from a file that contains\r
 898 nothing but a list of patterns (and possibly comments).  To read a new\r
 899 set of patterns from some file, patterns, use:\r
 900 \r
 901 rt patterns\r
 902 \r
 903 To add an extra group of patterns to the current set you can use:\r
 904 \r
 905 rx patterns\r
 906 \r
 907 To read in test patterns from say the file, xtest, do the following:\r
 908 \r
 909 tf xtest\r
 910 \r
 911 To evaluate all the test patterns without listing them do "t0".  To list\r
 912 them, use "t".  To list one particular test pattern, say pattern 3, do\r
 913 "t 3".\r
 914 \r
 915 \r
 916 7. Saving and Restoring Weights and Related Values (sw,rw,sw+,swe,swem)\r
 917 --------------------------------------------------------------\r
 918    Sometimes the amount of time and effort needed to produce a set of\r
 919 weights to solve a problem is so great that it is more convenient to\r
 920 save the weights rather than constantly recalculate them.  To save the\r
 921 weights to the current weights file use "sw".  The weights are then\r
 922 written on a file called "weights" or to the last file name you have\r
 923 specified.  The weights file looks like:\r
 924 \r
 925 59r  m 2 1 1 x aahs aos bh 1.000000 bo 1.000000 Dh 1.000000 Do 1.000000  file = ../xor3.new\r
 926  8.926291e+000  1  1 1 to 2 1\r
 927 -7.945858e+000  1  1 2 to 2 1\r
 928  3.898432e+000  2  2 b to 2 1\r
 929  5.382575e+000  1  1 1 to 3 1\r
 930 -4.862383e+000  1  1 2 to 3 1\r
 931 -1.086713e+001  1  2 1 to 3 1\r
 932  7.715632e+000  2  3 b to 3 1\r
 933 \r
 934 To write the weights the program starts with the second layer, writes\r
 935 out the weights leading into these units in order with the threshold\r
 936 weight last, then it moves on to the third layer, and so on.  In\r
 937 addition to writing out the weights the second column lists whether or\r
 938 not the weights are in use.  If the weight is in use it is marked with a\r
 939 1, if it is a bias unit weight it is marked as 2 and if it is not in use\r
 940 it is marked with a 0.  This is not used in this free version.  The last\r
 941 4 numbers on each line tell which units the weights run between.  The\r
 942 first weight listed runs from layer 1 unit 1 to layer 2 unit 1.  The\r
 943 letter b indicates the weight is a bias unit.  These last 4 values on a\r
 944 line are ignored when the file is read so in fact if you want to make up\r
 945 your own weights file you don't need to type them in.  These last four\r
 946 values are just here for human convenience.  However the inuse values\r
 947 must be present if you write your own weights file.  And you must use\r
 948 only one weight per line.\r
 949 \r
 950    To restore these weights type `rw' for restore weights.  At this time\r
 951 the program reads the header line and sets the total number of\r
 952 iterations the program has gone through to be the first number it finds\r
 953 on the header line.  It then reads the character immediately after the\r
 954 number.  The `r' indicates that the weights will be real numbers\r
 955 represented as character strings.\r
 956 \r
 957    The remaining text on the first line of a weight file is not used by\r
 958 the restore weights command at this time and it is there to give you a\r
 959 record of what size and type the network was.  The fact that the rest of\r
 960 this line is not read by the restore weights program means that before\r
 961 you read in weights you have to make the proper size network with the\r
 962 "m" command.  The "m 2 1 1 x" of course means there are 2 units in the\r
 963 first layer, one in the second, one in the third and the x means there\r
 964 are extra connections from the input units to the output unit.\r
 965 Following that the initial command file that was read in is given.\r
 966 \r
 967    To save weights to a file other than "weights" you can say: "sw\r
 968 <filename>", where, of course, <filename> is the file you want to save\r
 969 to.  To continue saving to the same file you can just do "sw".  If you\r
 970 type "rw" to restore weights they will come from this current weights\r
 971 file as well.  You can restore weights from another file by using: "rw\r
 972 <filename>".  Of course this also sets the name of the file to write to\r
 973 so if you're not careful you could lose your original weights file.\r
 974 \r
 975 \r
 976 8. Initializing Weights (c,ci)\r
 977 ------------------------------\r
 978    All the weights in the network initially start out at 0 and they are\r
 979 also set to 0 by using the clear (c) command.  In some problems where\r
 980 all the weights are 0 the weight changes may cancel themselves out so\r
 981 that no learning takes place.  Moreover, in most problems the training\r
 982 process will usually converge faster if the weights start out with small\r
 983 random values.  To do this use the clear and initialize command as in:\r
 984 \r
 985 ci 0.5\r
 986 \r
 987 where the random initial weights will run from -0.5 to +0.5.  If the\r
 988 value is omitted the last range specified will be used.  The initial\r
 989 value is 1.\r
 990 \r
 991 \r
 992 9. The Seed Value (s)\r
 993 ---------------------\r
 994    The initial seed value is set to 0 and this value is as good as any\r
 995 other value however networks often do not converge quickly or at all\r
 996 with some sets of initial weights.  To get some other initial random\r
 997 weights use the seed command as in:\r
 998 \r
 999 s 7\r
1000 \r
1001 where the seed is set to 7.  The seed value is of type unsigned.\r
1002 \r
1003 \r
1004 10. The Algorithm Command (a)\r
1005 -----------------------------\r
1006    A number of different variations on the original back-propagation\r
1007 algorithm have been proposed in order to speed up convergence and some\r
1008 of these have been built into these simulators.  These options are set\r
1009 using the `a' command and a number of options can go on the one line.\r
1010 \r
1011 Activation Functions\r
1012 \r
1013    To set the activation functions use:\r
1014 \r
1015 a a <char>  * to set the activation function for all layers to <char>.\r
1016 a ah <char> * to set the hidden layer(s) function to <char>.\r
1017 a ao <char> * to set the output layer function to <char>.\r
1018 \r
1019 where <char> can be:\r
1020 \r
1021    l  for the linear activation function:  x\r
1022    s  for the traditional smooth activation function:\r
1023       1.0 / (1.0 + exp(x))\r
1024 \r
1025    The s function is the standard smooth activation function originally\r
1026 used by researchers and it is still the most commonly used one.  In the\r
1027 bp program it is implemented by a table look-up (default) or if the\r
1028 compiler variable LOOKUP is undefined in the file ibp.h the regular\r
1029 time-consuming real valued calculations are done.\r
1030 \r
1031    The linear activation function gives networks only a very limited\r
1032 ability to learn patterns and it is therefore hardly ever used by itself\r
1033 in a network however it is often used in the output layer of networks\r
1034 with 3 or more layers so that the network can give output values beyond\r
1035 the range of the other activation functions.  For instance, suppose you\r
1036 need to train a network to compute some non-linear function but you need\r
1037 to produce outputs in the range -10 to 10.  The usual activation\r
1038 functions are restricted to the range 0 to 1 or -1 to 1 but you can\r
1039 choose a non-linear function for the network's hidden layers and with \r
1040 linear neurons in the output layer the network can produce values\r
1041 in the range -10 to 10.\r
1042 \r
1043 \r
1044 The Derivatives\r
1045 \r
1046    The correct derivative for the standard activation function is s(1-s)\r
1047 where s is the activation value of a unit however when s is near 0 or 1\r
1048 this term will give only very small weight changes during the learning\r
1049 process.  To counter this problem Fahlman proposed the following one\r
1050 for the output layer:\r
1051 \r
1052 0.1 + s(1-s)\r
1053 \r
1054 (For the original description of this method see "Faster Learning\r
1055 Variations of Back-Propagation:  An Empirical Study", by Scott E.\r
1056 Fahlman, in Proceedings of the 1988 Connectionist Models Summer School,\r
1057 Morgan Kaufmann, 1989.)\r
1058 \r
1059    Besides Fahlman's derivative and the original one the differential\r
1060 step size method (see "Stepsize Variation Methods for Accelerating the\r
1061 Back-Propagation Algorithm", by Chen and Mars, in IJCNN-90-WASH-DC,\r
1062 Lawrence Erlbaum, 1990) takes the derivative to be 1 in the layer going\r
1063 into the output units and uses the correct derivative term for all other\r
1064 layers.  The learning rate for the inner layers is normally set to some\r
1065 smaller value.  To set a value for eta2 give two values in the `e'\r
1066 command as in:\r
1067 \r
1068 e 0.1 0.01\r
1069 \r
1070 To set the derivative use the `a' command as in:\r
1071 \r
1072 a dc   * use the correct derivative for whatever function\r
1073 a dd   * use the differential step size derivative (default)\r
1074 a df   * use Fahlman's derivative in only the output layer\r
1075 a do   * use the original derivative (same as `c' above)\r
1076 \r
1077 Update Methods\r
1078 \r
1079    The choices are the periodic (batch) method, the continuous (online)\r
1080 method, delta-bar-delta and quickprop.  The following commands set the\r
1081 update methods:\r
1082 \r
1083 a uC   * for the "right" continuous update method\r
1084 a uc   * for the "wrong" continuous update method\r
1085 a ud   * for the delta-bar-delta method\r
1086 a up   * for the original periodic update method (default)\r
1087 a uq   * for the quickprop algorithm\r
1088 \r
1089 \r
1090 11. The Delta-Bar-Delta Method (d)\r
1091 ----------------------------------\r
1092    The delta-bar-delta method attempts to find a learning rate, eta, for\r
1093 each individual weight.  The parameters are the initial value for the\r
1094 etas, the amount by which to increase an eta that seems to be too small,\r
1095 the rate at which to decrease an eta that is too large, a maximum value\r
1096 for each eta and a parameter used in keeping a running average of the\r
1097 slopes.  Here are examples of setting these parameters:\r
1098 \r
1099 d d 0.5    * sets the decay rate to 0.5\r
1100 d e 0.1    * sets the initial etas to 0.1\r
1101 d k 0.25   * sets the amount to increase etas by (kappa) to 0.25\r
1102 d m 10     * sets the maximum eta to 10\r
1103 d n 0.005  * an experimental noise parameter\r
1104 d t 0.7    * sets the history parameter, theta, to 0.7\r
1105 \r
1106 These settings can all be placed on one line:\r
1107 \r
1108 d d 0.5  e 0.1  k 0.25  m 10  t 0.7\r
1109 \r
1110 The version implemented here does not use momentum.  The symmetric\r
1111 versions sbp and srbp do not implement delta-bar-delta.\r
1112 \r
1113    The idea behind the delta-bar-delta method is to let the program find\r
1114 its own learning rate for each weight.  The `e' sub-command sets the\r
1115 initial value for each of these learning rates.  When the program sees\r
1116 that the slope of the error surface averages out to be in the same\r
1117 direction for several iterations for a particular weight the program\r
1118 increases the eta value by an amount, kappa, given by the `k' parameter.\r
1119 The network will then move down this slope faster.  When the program\r
1120 finds the slope changes signs the assumption is that the program has\r
1121 stepped over to the other side of the minimum and so it cuts down the\r
1122 learning rate by the decay factor given by the `d' parameter.  For\r
1123 instance, a d value of 0.5 cuts the learning rate for the weight in\r
1124 half.  The `m' parameter specifies the maximum allowable value for an\r
1125 eta.  The `t' parameter (theta) is used to compute a running average of\r
1126 the slope of the weight and must be in the range 0 <= t < 1.  The\r
1127 running average at iteration i, a[i], is defined as:\r
1128 \r
1129 a[i] = (1 - t) * slope[i] + t * a[i-1],\r
1130 \r
1131 so small values for t make the most recent slope more important than the\r
1132 previous average of the slope.  Determining the learning rate for\r
1133 back-propagation automatically is, of course, very desirable and this\r
1134 method often speeds up convergence by quite a lot.  Unfortunately, bad\r
1135 choices for the delta-bar-delta parameters give bad results and a lot of\r
1136 experimentation may be necessary.  If you have n patterns in the\r
1137 training set try starting e and k around 1/n.  The n parameter is an\r
1138 experimental noise term that is only used in the integer version.  It\r
1139 changes a weight in the wrong direction by the amount indicated when the\r
1140 previous weight change was 0 and the new weight change would be 0 and\r
1141 the slope is non-zero.  (I found this to be effective in an integer\r
1142 version of quickprop so I tossed it into delta-bar-delta as well.  If\r
1143 you find this helps please let me know.)  For more on delta-bar-delta\r
1144 see "Increased Rates of Convergence" by Robert A. Jacobs, in Neural\r
1145 Networks, Volume 1, Number 4, 1988.\r
1146 \r
1147 \r
1148 12. Quickprop (qp)\r
1149 ------------------\r
1150     Quickprop (see "Faster-Learning Variations on Back-Propagation: An\r
1151 Empirical Study", by Scott E. Fahlman, in Proceedings of the 1988\r
1152 Connectionist Models Summer School", Morgan Kaufmann, 1989 or ftp to\r
1153 archive.cis.ohio-state.edu, look in the directory pub/neuroprose for the\r
1154 file, fahlman.quickprop-tr.ps.Z.) may be one of the fastest network\r
1155 training algorithms.  It is loosely based on Newton's method.\r
1156 \r
1157    The parameter mu is used to limit the size of the weight change to\r
1158 less than or equal to mu times the previous weight change.  Fahlman\r
1159 suggests mu = 1.75 is generally quite good so this is the initial value\r
1160 for mu but slightly larger or slightly smaller values are sometimes\r
1161 better.\r
1162 \r
1163    To get the process started quickprop makes the typical backprop\r
1164 weight change of - eta * slope.  I have found that a good value for the\r
1165 quickprop eta value is around 1 / n or 2 / n where n is the number of\r
1166 patterns in the training set.  Other sources often use much larger\r
1167 values.  In addition Fahlman uses this term at other times.  I had to\r
1168 wonder if this was a good idea so in this code I've included a\r
1169 capability to add it in or not add it in.  So far it seems to me that\r
1170 sometimes adding in this extra term helps and sometimes it doesn't.  The\r
1171 default is to use the extra term.\r
1172 \r
1173    Another factor involved in quickprop comes about from the fact that\r
1174 the weights often grow very large very quickly.  To minimize this\r
1175 problem there is a decay factor designed to keep the weights small.\r
1176 The weight decay is implemented by decreasing the value of the slope\r
1177 and it is different from the general weight decay that people use and\r
1178 which is also implemented in this software.  Fahlman recently mentioned\r
1179 that now he does not use does not use this unless the weights get very\r
1180 large.  I've found that too large a decay factor can stall\r
1181 out the learning process so that if your network isn't learning fast\r
1182 enough or isn't learning at all one possible fix is to decrease the\r
1183 decay factor.  Note:  in the old free version the value of the weight\r
1184 decay constant is the value you enter / 1000 in order to allow small\r
1185 weight decay values in the integer version however in this version the\r
1186 problem is handled differently so that what you enter is exactly what\r
1187 you get, not the value divided by 1000.\r
1188 \r
1189    I built in one additional feature for the integer version.  I found\r
1190 that by adding small amounts of noise the time to convergence can be\r
1191 brought down and the number of failures can be decreased somewhat.  This\r
1192 seems to be especially true when the weight changes get very small.  The\r
1193 noise consists of moving uphill in terms of error by a small amount when\r
1194 the previous weight change was zero.  Good values for the noise seem to\r
1195 be around 0.005.\r
1196 \r
1197    The parameters for quickprop are all set in the `qp' command like\r
1198 so:\r
1199 \r
1200 qp d <value>  * set the weight decay factor for all layers to <value>\r
1201 qp d h 0      * the default weight decay for hidden layer units\r
1202 qp d o 0.0001 * the default weight decay for output layer units\r
1203 qp e 0.5      * the default value for eta\r
1204 qp m 1.75     * the default value for mu\r
1205 qp n 0        * the default value for noise\r
1206 qp s+         * the default value is to always include the slope\r
1207 \r
1208 or a whole series can go on one line:\r
1209 \r
1210 qp d 0.1 e 0.5 m 1.75 n 0 s+\r
1211 \r
1212 \r
1213 13. Making a Network (m)\r
1214 ------------------------\r
1215    In the simplest form of the make a network command you type an `m'\r
1216 followed by the number of units in each layer as in:\r
1217 \r
1218 m 8 4 4 2\r
1219 \r
1220 Most of the time this type of network is all you will ever need but\r
1221 there are others that can be tried and which may sometimes will work\r
1222 better.  One innovation that often speeds up learning is to include\r
1223 extra connections between the input and output layers.  To get this\r
1224 type of network you add an x to the end of the m command as in:\r
1225 \r
1226 m 8 4 2 x\r
1227 \r
1228 These extra connections are said to be important when the problem to\r
1229 be solved is almost linear and then the hidden layer units provide some\r
1230 extra corrections to the output neurons to distort the results from a\r
1231 purely linear model.\r
1232 \r
1233    In the student version every time you made a network all the training\r
1234 and testing patterns were thrown out because they were attached to the\r
1235 network.  (Not true in the pro version.)\r
1236 \r
1237    To make a recurrent network with 25 regular input units, twenty\r
1238 hidden layer units (that are copied to the input layer) and 25 output\r
1239 units use:\r
1240 \r
1241 m 25+20 20 25\r
1242 \r
1243 This means that the first layer will have 45 inputs and the first 25 are\r
1244 regular input values but the next 20 come from the first hidden layer.\r
1245 These 20 units are called the short term memory units.  Then there are\r
1246 20 units in the hidden layer.  This value should match the number of\r
1247 units given for the short term memory units.  At the moment there is no\r
1248 check to see that it does.  Finally there are 25 units in the output\r
1249 layer.  This recurrent network notation also requires a change in the\r
1250 way training and testing patterns are written down for input into the\r
1251 program.  For more on this see the next section.\r
1252 \r
1253 14. Recurrent Networks\r
1254 ----------------------\r
1255    Recurrent back-propagation networks take values from hidden layer\r
1256 and/or output layer units and copy them down to the input layer for use\r
1257 with the next input.  These values that are copied down are a kind of\r
1258 coded record of what the recent inputs to the network have been and this\r
1259 gives a network a simple kind of short-term memory, possibly a little\r
1260 like human short-term memory.  For instance, suppose you want a network\r
1261 to memorize the two short sequences, "acb" and "bcd".  In the middle of\r
1262 both of these sequences is the letter, `c'.  In the first case you want\r
1263 a network to take in `a' and output `c', then take in `c' and output\r
1264 `b'.  In the second case you want a network to take in `b' and output\r
1265 `c', then take in `c' and output `d'.  To do this a network needs a\r
1266 simple memory of what came before the `c'.\r
1267 \r
1268    Let the network be an 7-3-4 network where input units 1-4 and output\r
1269 units 1-4 stand for the letters a-d and the `h' stands for the value of\r
1270 a hidden layer unit.  So the codes are:\r
1271 \r
1272 a: 1000\r
1273 b: 0100\r
1274 c: 0010\r
1275 d: 0001\r
1276 \r
1277 In action, the networks need to do the following.  When `a' is input,\r
1278 `c' must be output:\r
1279 \r
1280    0010     <- output layer\r
1281 \r
1282    hhh      <- hidden layer\r
1283 \r
1284 1000 stm    <- input layer\r
1285 \r
1286 In this context, when `c' is input, `b' should be output:\r
1287 \r
1288    0100\r
1289 \r
1290    hhh\r
1291 \r
1292 0010 stm\r
1293 \r
1294 For the other string, when `b' is input, `c' is output:\r
1295 \r
1296    0010\r
1297 \r
1298    hhh\r
1299 \r
1300 0100 stm\r
1301 \r
1302 and when `c' in input, `d' is output:\r
1303 \r
1304    0001\r
1305 \r
1306    hhh\r
1307 \r
1308 0010 stm\r
1309 \r
1310 This is easy to do if the network keeps a short-term memory of what its\r
1311 most recent inputs have been.  Suppose we input a and the output is c:\r
1312 \r
1313    0010     <- output layer\r
1314 \r
1315    hhh      <- hidden layer\r
1316 \r
1317 1000 stm    <- input layer\r
1318 \r
1319 Placing `a' on the input layer generates some kind of code (like a hash\r
1320 code) on the 3 units in the hidden layer.  On the other hand, placing\r
1321 `b' on the input units will generate a different code on the hidden\r
1322 units.  All we need to do is save these hidden unit codes and input them\r
1323 with a `c'.  In one case the network will output `b' and in the other\r
1324 case it will output `d'.  In one particular run inputting `a' produced:\r
1325 \r
1326      0  0  1  0\r
1327 \r
1328   0.993 0.973 0.020\r
1329 \r
1330  1  0  0  0  0  0  0\r
1331 \r
1332 When `c' is input the hidden layer units are copied down to input to\r
1333 give:\r
1334 \r
1335         0  1  0  0\r
1336 \r
1337     0.006 0.999 0.461\r
1338 \r
1339 0  0  1  0  0.993 0.973 0.020\r
1340 \r
1341 For the other pattern, inputting `b' gave:\r
1342 \r
1343     0  0  1  0\r
1344 \r
1345   0.986 0.870 0.020\r
1346 \r
1347 0  1  0  0  0  0  0\r
1348 \r
1349 Then the input of `c' gave:\r
1350 \r
1351           0  0  0  1\r
1352 \r
1353       0.005 0.999 0.264\r
1354 \r
1355 0  0  1  0  0.986 0.870 0.020\r
1356 \r
1357    This particular problem can be set up as follows:\r
1358 \r
1359 m 7 3 4\r
1360 s 7\r
1361 ci\r
1362 t 0.2\r
1363 rt {\r
1364 1000 H   0010\r
1365 0010 H   0100\r
1366 \r
1367 0100 H   0010\r
1368 0010 H   0001\r
1369 }\r
1370 \r
1371 where the first four values on each line are the normal input.  The H\r
1372 codes for however many hidden layer units there are.  The last four\r
1373 values are the desired outputs.\r
1374 \r
1375    By the way, this simple problem does not converge particularly fast\r
1376 and you may need to do a number of runs before you hit on initial values\r
1377 that will work quickly.  It will work more reliably with more hidden\r
1378 units.\r
1379 \r
1380    Rather than using recurrent networks to memorize sequences of letters\r
1381 they are probably more useful at predicting the value of some variable\r
1382 at time t+1 given its value at t, t-1, t-2, ... .  A very simple of this\r
1383 is to give the value of sin(t+1) given a recent history of inputs to the\r
1384 net.  Given a value of sin(t) the curve may be going up or down and the\r
1385 net needs to keep track of this in order to correctly predict the next\r
1386 value.  The following setup will do this:\r
1387 \r
1388 m 1+5 5 1\r
1389 f ir\r
1390 a aol dd uq\r
1391 qp e 0.02\r
1392 ci\r
1393 rt {\r
1394    0.00000  H   0.15636\r
1395    0.15636  H   0.30887\r
1396    0.30887  H   0.45378\r
1397 \r
1398    . . .\r
1399 \r
1400   -0.15950  H  -0.00319\r
1401   -0.00319  H   0.15321\r
1402 }\r
1403 \r
1404 and in fact it converges rather rapidly.  The complete set of data can\r
1405 be found in the example file rsin.bp.\r
1406 \r
1407    Another recurrent network included in the examples is one designed to\r
1408 memorize two lines of poetry.  The two lines were:\r
1409 \r
1410    I the heir of all the ages in the foremost files of time\r
1411 \r
1412    For I doubt not through all the ages ones increasing purpose runs\r
1413 \r
1414 but for the sake of making the problem simpler each word was shortened\r
1415 to 5 characters giving:\r
1416 \r
1417    i the heir of all the ages in the frmst files of\r
1418 \r
1419    time for i doubt not thru the ages one incre purpo runs\r
1420 \r
1421 The letters were coded by taking the last 5 bits of their ASCII codes.\r
1422 See the file poetry.bp.  \r
1423 \r
1424    Once upon a time I was wondering what would happen if the poetry\r
1425 network learned its verses and then the program was given several words\r
1426 in the middle of the verses.  Would it pick up the sequence and be able\r
1427 to complete it given 1 or 2 or 3 or n words?  So given for example, the\r
1428 short sequence "for i doubt" will it be able to "get on track" and\r
1429 finish the verse?  To test for this there are an extra pair of commands,\r
1430 tr and trp.  Given a test set (which should be the training set) they\r
1431 start at every possible place in the test set, input n words and then\r
1432 check to see if the net produces the right answer.  For this example I\r
1433 tried n = 3, 4, 5, 6 and 7 with the following results:\r
1434 \r
1435 [ACDFGMNPQTW?!acdefhlmopqrstw]? tr 3\r
1436  TOL:  81.82 %  ERROR: 0.022967\r
1437 [ACDFGMNPQTW?!acdefhlmopqrstw]? tr 4\r
1438  TOL:  90.48 %  ERROR: 0.005672\r
1439 [ACDFGMNPQTW?!acdefhlmopqrstw]? tr 5\r
1440  TOL:  90.00 %  ERROR: 0.005974\r
1441 [ACDFGMNPQTW?!acdefhlmopqrstw]? tr 6\r
1442  TOL: 100.00 %  ERROR: 0.004256\r
1443 [ACDFGMNPQTW?!acdefhlmopqrstw]? tr 7\r
1444  TOL: 100.00 %  ERROR: 0.004513\r
1445 \r
1446 So after getting just 3 words the program was 81.82% right in predicting\r
1447 the next word to within the desired tolerance.  Given 6 or 7 words it\r
1448 was getting them all right.  The trp command does the same thing except\r
1449 it also prints the final output value for each of the tests made.\r
1450 \r
1451 \r
1452 15. Miscellaneous Commands\r
1453 --------------------------\r
1454    Below is a list of some miscellaneous commands, a short example of\r
1455 each and a short description of the command.\r
1456 \r
1457 \r
1458 !   Example: ! ls\r
1459 \r
1460 Anything after `!' will be passed on to the OS as a command to execute.\r
1461 An ! followed immediately by a carriage-return will repeat the last\r
1462 command sent to the OS.\r
1463 \r
1464 l   Example: l 2\r
1465 \r
1466 Entering "l 2" will print the values of the units on layer 2, or\r
1467 whatever layer is specified.\r
1468 \r
1469 sb  Example: sb -3\r
1470 \r
1471 Entering "sb -3" will set the bias unit weight to -3.  In the symmetric\r
1472 versions the weight will be frozen at this value while in the regular\r
1473 versions it will only be the initial value and should be set after the\r
1474 other weights are initialized.\r
1475 \r
1476 \r
1477 16. Limitations\r
1478 ---------------\r
1479    Weights in the ibp and sibp programs are 16-bit integer weights where\r
1480 the real value of the weight has been multiplied by 1024.  The integer\r
1481 versions cannot handle weights less than -32 or greater than 31.999.\r
1482 The weight changes are all checked for overflow but there are other\r
1483 places in these programs where calculations can possibly overflow as\r
1484 well and none of these places are checked.  Input values for the integer\r
1485 versions can run from -31.992 to 31.999.  Due to the method used to\r
1486 implement recurrent connections, input values in the real version are\r
1487 limited to -31992.0 and above.\r
1488 \r
1489 \r
1490 17. The Pro Version Additions\r
1491 -----------------------------\r
1492    This section lists the additions to the pro version at this time.\r
1493 For a more detailed and more up-to-date description see the online pro\r
1494 version manual at:\r
1495 \r
1496 http://www.mcs.com/~drt/probp.html\r
1497 \r
1498 The additional commands are:\r
1499 \r
1500 ac <units>      add a weight connection between the units\r
1501 ah <layer>      add a hidden unit to <layer>\r
1502 b               benchmarking\r
1503 i <filename>    read input from the file\r
1504 k <numbers>     give the network a kick\r
1505 n <options>     dynamic network building parameters\r
1506 ofu <unit>      turn off a unit\r
1507 onu <unit>      turn on a unit\r
1508 ofw <weight>    turn off a weight\r
1509 onw <weight>    turn on a weight\r
1510 pw <number>     prune weights\r
1511 rp              set rprop parameters\r
1512 s <seeds>       set multiple seed values\r
1513 ss <options>    set SuperSAB parameters\r
1514 swem <option>   save weights every minimum flag\r
1515 sw+             increment the weight file suffix\r
1516 to              overall tolerance to be met (not per pattern, as with t)\r
1517 u               the same as p but for recurrent classification problems\r
1518 v               the same as t but for recurrent classification problems\r
1519 \r
1520 Benchmarking allows you to make multiple runs of a problem and find the\r
1521 mean, standard deviation and average CPU time to converge.  You can also\r
1522 use it to average the outputs of multiple runs and thereby possibly get\r
1523 a better overall answer.\r
1524 \r
1525 You can make networks in a cascade type of architecture.  You can make\r
1526 a new network with a different number of hidden layer units without\r
1527 losing the training and testing patterns.  You can add hidden layer\r
1528 units as the network is trained.  You can turn on and off individual\r
1529 units.\r
1530 \r
1531 \r
1532 The additonal options:\r
1533 \r
1534 a bh <value>     set the hidden layer bias unit value\r
1535 a bo <value>     set the output layer bias unit value\r
1536 a Dh <value>     set the hidden layer sharpness/gain\r
1537 a Do <value>     set the output layer sharpness/gain\r
1538 a wd <value>     weight decay\r
1539 f t <reals>      set target values for classification problems\r
1540 f wR             saves all weight parameters\r
1541 f wb             saves weights as binary\r
1542 f wB             saves all weight parameters as binary\r
1543 pm               print confusion matrix for training set\r
1544 tm               print confusion matrix for test set\r
1545 \r
1546 The activation functions available are:\r
1547 \r
1548 <char>                  Function                           Range\r
1549 \r
1550    a    an efficient approximation of t            [-0.96016..0.96016]\r
1551    g    Gaussian function, exp(-(D*x)**2)                 (0..+1]\r
1552    l    linear function, D*x                            (-inf..+inf)\r
1553    p    piecewise linear version of s                     [0..+1]\r
1554    s    standard sigmoid, 1 / (1 + exp(-D*x))             (0..+1)\r
1555    t    tanh(D*x)                                         (-1..+1)\r
1556    x    D * x / (1 + |D * x|)                             (-1..+1)\r
1557    y    (D * x / 2) / (1 + |D * x|) + 0.5                 (0..+1)\r
1558    z    (D*x)**2 for x >= 0 and -(D*x)**2 for x < 0     (-inf..+inf)\r