runtime/spell/README_tn.txt

   1 Version: 20040329
   2 tn,ZA,tn_ZA,Setswana (Africa),tn_ZA.zip
   3
   4
   5 README for Setswana MySpell dictionary
   6 ======================================
   7
   8 The MySpell spell checker was created from the aspell spell checker and
   9 wordlist which is released under the GPL.
  10
  11 1. Copyright
  12 2. Installation and setup
  13 3. Helping to improve the spellchecker
  14 4. Note on the construction of the wordlist
  15
  16
  17 1. Copyright
  18 ------------
  19
  20 Setswana wordlist.in:
  21 Copyright 2004 Kevin P. Scannell <scannell@slu.edu> and
  22                Thapelo Otlogetswe <Thapelo.Otlogetswe@itri.brighton.ac.uk>
  23
  24 Porting to MySpell and other MySpell specifics:
  25 Copyright 2004 Zuza Software Foundation <info@translate.org.za>
  26
  27 This program is free software; you can redistribute it and/or modify
  28 it under the terms of the GNU General Public License as published by
  29 the Free Software Foundation; either version 2 of the License, or
  30 (at your option) any later version.
  31
  32 This program is distributed in the hope that it will be useful,
  33 but WITHOUT ANY WARRANTY; without even the implied warranty of
  34 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  35 GNU General Public License for more details.
  36
  37 You should have received a copy of the GNU General Public License
  38 along with this program; if not, write to the Free Software
  39 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  40
  41 2. Installation and setup
  42 -------------------------
  43
  44 Automated
  45 ---------
  46 http://lingucomponent.openoffice.org/download_dictionary.html
  47
  48 Use the DicOOo.sxw file to step you through an automatic install process.  If
  49 the Setswana spellchecker is not available online then download the offline
  50 pack from:
  51 http://translate.sourceforge.net/
  52
  53
  54 Non-automated
  55 -------------
  56 For instructions on how to install the Setswana dictionary please visit the
  57 following URL.
  58
  59 http://lingucomponent.openoffice.org/download_dictionary.html#installspell
  60
  61 Spellchecker Selection
  62 ----------------------
  63 NOTE: Setswana is as yet not a recognised language in OpenOffice.org - this
  64 will change shortly - therefore we map the dictionary against Italian.
  65
  66 Tools -> Options -> Language Settings -> Writing Aids
  67
  68 Available language modules -> Edit -> Select Italian -> Ensure it is enabled
  69
  70
  71
  72 3. Contributing
  73 ---------------
  74
  75 You can help to make this software better.
  76
  77 If you find errors in the spellchecker or have wordlists that you would like to
  78 contribute to the spellchecker then contact
  79 Dwayne Bailey <dwayne@translate.org.za>
  80
  81 If you would like to assist Kevin Scannell with the automated web crawler then
  82 please read the next section and offer your assistance.
  83
  84
  85 4. Note on the construction of the wordlist
  86 -------------------------------------------
  87
  88 Note: taken from the Aspell package (doc/Crawler.tct) for your information
  89
  90 NOTES ON THE CONSTRUCTION OF THE WORD LIST
  91    A preliminary version of this spell checking dictionary was assembled
  92 with the help of my web crawler "An Crúbadán":
  93
  94   http://borel.slu.edu/crubadan/
  95
  96 BUILDING TEXT CORPORA FOR MINORITY LANGUAGES
  97 Initially a small collection of "seed" texts are fed to the crawler
  98 (a few hundred words of running text have been sufficient in practice).
  99 Queries combining words from these texts are generated and passed to
 100 the Google API which returns a list of documents potentially written
 101 in the target language.  These are downloaded, processed into plain text,
 102 and formatted.  A combination of statistical techniques bootstrapped from
 103 the initial seed texts (and refined as more texts are added to the database)
 104 is used to determine which documents (or sections thereof) are written in
 105 the target language.   The crawler then recursively follows links contained
 106 within documents that are in the target language.   When these run out,
 107 the entire process is repeated, with a new set of Google queries generated
 108 from the new, larger corpus.
 109
 110 EXTRACTING A CLEAN WORD LIST
 111 The raw texts downloaded using the scheme just described contain
 112 a lot of pollution and are unsuitable for use without further processing.
 113 I have been able to extract reasonably accurate spell checking dictionaries
 114 by applying a series of simple filters.   First, the texts are tokenized
 115 and used to generate a word list sorted by frequency and the lowest
 116 frequency words are filtered out.   Then, depending on the target language,
 117 correctly-spelled words from one or more "polluting" languages
 118 are filtered out to be checked by hand later.  Usually this means English,
 119 but I also filter Dutch from the Frisian corpus, Spanish from Chamorro, etc.
 120 The remaining words are used to generate 3-gram statistics for the target
 121 language.  These are used to flag as "suspect" any remaining words containing
 122 one or more improbable 3-grams.
 123
 124 Please contact me at the address below if you are interested in applying
 125 these techniques to a new language.
 126
 127 Kevin Scannell
 128 <scannell@slu.edu>
 129 March 2004
 130