trunk/apertium-tools/apertium-utils/morph-indux/README

   1                       morph-indux
   2    A tool for matching lemmas from attested word forms
   3    in corpora to Apertium paradigms.
   4 -----------------------------------------------------------------
   5 Description:
   6
   7 This is a ghetto version of Markus Forsbergs 'Extract' tool.
   8
   9 It is used for matching lemmas to Apertium paradigms.
  10
  11 The program first loads the paradigms (and their stems) out of
  12 a .dix file, then loads a wordlist file into a Trie. It iterates
  13 through each of the stems in each of the paradigms and retrieves
  14 candidate lemmas from the Trie.
  15
  16 After building a list of candidate lemmas, it adds up the number
  17 of instances found and outputs the list.
  18
  19 For example:
  20
  21 A wordlist file has the following words:
  22
  23  knjiga, knjigama, knjigo, knjigu, knjizi
  24
  25 And the paradigm for 'knji/ga__n' in the Apertium dictionary has
  26 the following stems (compressed format for readability):
  27
  28  [knji/ga__n] (7) [ ga gama ge go gom gu zi ]
  29
  30 The score for the root 'knji' will be 5/7 = 0.71
  31
  32 Usage:
  33
  34 $ ./morph-indux -t <threshold> [dictionary] [wordlist]
  35
  36 ----------------------------------------------------------------
  37 Dependencies:
  38
  39 * libxml2
  40
  41 ----------------------------------------------------------------
  42 TODO:
  43
  44 * Perhaps a feature to increase the score of matches when stems
  45   are long... e.g. '-hoero' match scores higher than '-ro' match
  46   for Tajik.
  47 * Perhaps weight stems with 'more info' (e.g. pl, dom, indef)
  48   higher than less, e.g. 'pl'.
  49   ~ these two could be related (longer stem = more info?)
  50   ~
  51 * Trie needs fixing -- too much mangling when presented with
  52   long lists.
  53 * Add feature to deal with non-lists (e.g. web pages etc.)
  54 * File handling needs to be fixed to allow input from stdin
  55   ~ e.g. cat /tmp/foo.html | apertium-dehtml |
  56          morph-indux -t 0.9 apertium-en-ca.en.dix
  57 * Perhaps have a feature that works on lemmas -- e.g. a list
  58   of words, and then it generates all the possible forms, then
  59   looks for these -- either in a corpus or on the intanet.
  60
  61 DONE:
  62
  63 * Stop it from crapping out when presented with DOS files
  64
  65 --
  66 Issues:
  67
  68 * Contaminated input .. e.g. 'knjigih', 'knjigoj' etc.