IndicSpellchecker

From IndLinux
Jump to navigationJump to search

Building a spellchecker for Indian languages

This is also applicable to languages other than Indian ones.

Background
Some more notes on spell-checking (from web archive)

http://web.archive.org/web/20090523083416/http://cmwiki.sarai.net/index.php/SpellCheck
http://web.archive.org/web/20090523074828/http://cmwiki.sarai.net/index.php/PhoneticDetails
http://web.archive.org/web/20090523075848/http://cmwiki.sarai.net/index.php/HindiPhoneticFile

Aspell based

Making a new aspell dictionary in 8 easy steps
This is a cookbook approach to making an aspell dictionary for a new language. Most of this material is covered in the aspell manual[2], especially in the chapter on adding support for other languages. Please note that advanced features of aspell are not covered here.

  1. Install aspell, with versions above 0.60 being highly recommended for improved Unicode support. The aspell installation directory, /usr/local/lib/aspell-0.60 (or, with a different version number) by default, will henceforth be referred to as <aspell-dir>.
  2. Check at the aspell homepage[1] that a dictionary for the language does not already exist. To create a new dictionary, you will need to have a list of words, one per line, in an appropriately encoded file. Certainly for Indian languages, and also for almost any language nowadays, I would use the UTF-8 (Unicode) encoding. Call this file, say lang.wl, e.g., hi.wl for Hindi. Put it in a separate working directory, where you will also be collecting other material for creating the new dictionary.
  3. Download the latest version of aspell-lang from CVS, following the instructions at http://savannah.gnu.org/cvs/?group=aspell, viz., "cvs -z3 -d:pserver:anonymous@cvs.savannah.gnu.org:/sources/aspell co aspell-lang". This will create a sub-directory called aspell-lang. From now on, <aspell-lang-dir> will be used to refer to the directory that aspell-lang uses, e.g., /usr/local/share/aspell-lang.
  4. If needed, and only if needed, create maps/lang.txt in <aspell-lang-dir>, where "lang" is the name for the language. For Indian languages in Unicode, these maps for the character sets are already available, e.g., maps/u-deva.txt for Hindi, and the mapping is a simple one that associates the 128-character Unicode space for the language with positions 128-255. For languages that use a standardised encoding, such as iso-8859-1 there is no need to prepare such a lang.txt file. Instead, see point 6 below on how to specify a character set, and data encoding. More details on the format of the character set data file are available in the aspell manual[2] in the chapter on adding support for new languages. If you needed to use a lang.txt file, from <aspell-lang-dir>, run "perl mkchardata maps/lang.txt" which will create maps/lang.cmap and maps/lang.cset, e.g., maps/u-deva.cmap} and maps/u-deva.cset for Hindi. Copy these files to <aspell-dir>, and also to the working directory where you will be creating the new dictionary.
  5. Change to the working directory. Create the language information file, "info" required by aspell. The "info" file contains various information about the dictionary, including the author name(s), version number, source URL, copyright information, and notes about the completeness and accuracy of the dictionary. aspell requires a URL briefly describing the dictionary and allowing download of the wordlist used in making the dictionary, e.g., as in http://oriya.sarovar.org/dictionary.html. You will also need a "Copyright" file describing the terms under which the wordlist is made available. If the licence is a standard one or is more than a paragraph or so, the actual licence should be included in a separate file, COPYING. For the standard GNU licences, GPL, LGPL, etc., this file will be created for you in the processing phase. A detailed description of the "info" file is given in <aspell-lang-dir>/README, and an example "info" file can be seen in any of the language dictionaries available from the aspell homepage[1].
  6. If you needed to create new .cmap, and .cset files for your language, these will need to be included along with a dictionary distribution. To ensure that this is done, add the following lines to the "info" file:
    • data-file u-deva.cset
    • data-file u-deva.cmap
    Of course, replace the file names above with the actual file names for your language. Also,make a directory called misc, and copy maps/lang.txt from <aspell-lang-dir> to misc/
  7. Create a language data file, named lang.dat, e.g., hi.dat for Hindi. The format of this file is described in the aspell manual[2] in the chapter on adding support for new languages. It is probably a good idea to pick up the source for the dictionary distribution for a related language, and use that as a starting point. For a minimalist dictionary, it is sufficient to fill in the fields "name," and "charset". The "data-encoding" entry is useful if the script in question uses different encodings. If "data-encoding" is not specified, it defaults to the same value as "charset". Again, examples are found in the dictionary distributions for various languages.
  8. Now you are all set to make the actual dictionary. At this point, you should have at a minimum, a wordlist, the lang.cmap and lang.cset files created in step 4, the language data file described in step 6, and the "info" and "Copyright" files. Copy <aspell-lang-dir>/proc to the working directory, and use it to create the dictionary. Thus,
    • cp <aspell-lang-dir>/proc .
    • ./proc create
    which should create various files including "configure." To make a distribution suitable for release with aspell, run
    • ./configure
    • make dist
    which will create a .tar.bz2 file, e.g., aspell6-hi-0.01.tar.bz2, that can be installed by any user with the usual configure; make; make install; cycle. If you get a mysterious error about the .cwl file containing duplicates, make sure that there are no blank lines in the .wl file containing the uncompressed word list. After making the distribution, you should also check that it installs properly. Uncompress, and untar the .tar.bz2 file produced in the last step, change into the sub-directory created, e.g., aspell6-hi-0.01, and do
    • ./configure
    • make
    If this completes without errors, you know that the distribution is OK. Optionally, install (usually as root), the dictionary on your system with
    • make install
    You will probably also want to announce the availability of the new dictionary on the aspell-devel mailing list and to Kevin Atkinson, the developer and maintainer of GNU aspell.

References
[1] The GNU Aspell homepage
[2] The GNU Aspell manual

Copyright (c) 2005 Gora Mohanty.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts and no Back-Cover Texts.
A copy of the license is at http://www.gnu.org/copyleft/fdl.html

Myspell
Used by OpenOffice and Mozilla


Hunspell based

hunspell
Unicode enabled Myspell for OpenOffice

hunspell home

Hunspell doc