apertium-oc-es/README

   1 TRANSLATOR
   2
   3 You need apertium and lttoolbox, either version 1.0 or 2.0, to use
   4 this language-pair package with Apertium.  To compile the linguistical
   5 data simply do:
   6
   7 $ ./configure
   8
   9 to generate a Makefile file and then
  10
  11 $ make
  12
  13 inside of this directory.
  14
  15 TAGGER
  16
  17 To use this language-pair package with Apertium YOU DO NOT NEED TO
  18 RETRAIN THE TAGGER. Probabilities and auxiliary data are provided for
  19 both the oc-ca and the ca-oc translation directions which should be
  20 acceptable for most applications, and should work even if you change
  21 the dictionaries in a reasonably way.
  22
  23 If for some reason you need to retrain the tagger (for example, you
  24 have made really extensive changes to the dictionaries such as
  25 creating new lexical categories), you have three alternatives:
  26
  27 * To perform a supervised training:
  28
  29   To this end you need the files specified in the README file inside
  30   oc-tagger-data and ca-tagger-data which are not provided. When performing
  31   a supervised training, tagged corpora(oc-tagger-data/oc.tagged and
  32   ca-tagger-data/ca.tagged) could be obsolete for some words. If this is the
  33   case, the tagger training program  will show you where the problems are and
  34   you will need to solve them by hand. Be sure to solve the problems by
  35   modifying ONLY the .tagged file, NEVER the .untagged file that is
  36   automatically generated.
  37
  38   The supervised training is done by typing:
  39   make -f  oc-ca-supervised.make (for the Occitan part-of-speech tagger)
  40   make -f  ca-oc-supervised.make (for the Catalan part-of-speech tagger)
  41
  42   This is the training method followed to train the Catalan
  43   part-of-speech tagger.
  44
  45 * To perform a classical (expectation-maximization) unsupervised training:
  46
  47   For this purpose you will need to assemble a large (hundreds of
  48   thousand of words) plain-text corpus for each language (for example,
  49   using a robot to harvest text from online newspapers) and put them in
  50   the proper place, for instance oc-tagger-data/oc.crp.txt and
  51   ca-tagger-data/ca.crp.txt. This type of training does not need human
  52   intervention but, as expected, results will be less adequate than
  53   those obtained with the supervised training.
  54
  55   The unsupervised training is done through the iterative Baum-Welch
  56   algorithm. By default the number of iterations is set to 8, but you
  57   can change this value by editing the Makefile and changing the
  58   value of TAGGER_UNSUPERVISED_ITERATIONS.
  59
  60   The unsupervised training is done by typing:
  61   make -f oc-ca-unsupervised.make (for the Occitan part-of-speech tagger)
  62   make -f ca-oc-unsupervised.make (for the Catalan part-of-speech tagger)
  63
  64 * To perform an unsupervised training by using target-language
  65   information and the rest of the modules of the Apertium MT engine:
  66
  67   To do so you need large plain-text corpora on both languages. Please
  68   download the apertium-tagger-training-tools package and follow the
  69   instructions provided there. This is the training method followed to
  70   train the Occitan part-of-speech tagger.