2 A tool for matching lemmas from attested word forms
3 in corpora to Apertium paradigms.
4 -----------------------------------------------------------------
7 This is a ghetto version of Markus Forsbergs 'Extract' tool.
9 It is used for matching lemmas to Apertium paradigms.
11 The program first loads the paradigms (and their stems) out of
12 a .dix file, then loads a wordlist file into a Trie. It iterates
13 through each of the stems in each of the paradigms and retrieves
14 candidate lemmas from the Trie.
16 After building a list of candidate lemmas, it adds up the number
17 of instances found and outputs the list.
21 A wordlist file has the following words:
23 knjiga, knjigama, knjigo, knjigu, knjizi
25 And the paradigm for 'knji/ga__n' in the Apertium dictionary has
26 the following stems (compressed format for readability):
28 [knji/ga__n] (7) [ ga gama ge go gom gu zi ]
30 The score for the root 'knji' will be 5/7 = 0.71
34 $ ./morph-indux -t <threshold> [dictionary] [wordlist]
36 ----------------------------------------------------------------
41 ----------------------------------------------------------------
44 * Perhaps a feature to increase the score of matches when stems
45 are long... e.g. '-hoero' match scores higher than '-ro' match
47 * Perhaps weight stems with 'more info' (e.g. pl, dom, indef)
48 higher than less, e.g. 'pl'.
49 ~ these two could be related (longer stem = more info?)
51 * Trie needs fixing -- too much mangling when presented with
53 * Add feature to deal with non-lists (e.g. web pages etc.)
54 * File handling needs to be fixed to allow input from stdin
55 ~ e.g. cat /tmp/foo.html | apertium-dehtml |
56 morph-indux -t 0.9 apertium-en-ca.en.dix
57 * Perhaps have a feature that works on lemmas -- e.g. a list
58 of words, and then it generates all the possible forms, then
59 looks for these -- either in a corpus or on the intanet.
63 * Stop it from crapping out when presented with DOS files
68 * Contaminated input .. e.g. 'knjigih', 'knjigoj' etc.