Creating the Technical Word List for aspell

1A1. Introduction and Overview

1B1.	Directories and Files used for aspell (at UVSI)
	.aspell.conf personal configuration file example

1C1.	Operating Instructions for aspell (example for UVSI)

1D1.	Master plan to create a technical word-list for aspell

1E1.	Extracting all words from the aspell dictionaries
	- converting to a text file for matching with user words

1E2.	uvhd hex dumps of aspell dictionaries to illustrate binary format

1E3.	Operating Instructions to extract dictionary words and sort

1F1.	Operating Instructions to extract words-used from all user documentation
	- reads all files in directory and sorts
	- writes out multiple words per line to about column 70

1G1.	Illustrated usage of wordxtrct1 and discussion of other uses, such as
	extracting words from binary programs (version# for example).

1G2.	Illustrated usage of wordsort1 and discussion of other uses, such as
	word count analysis.

1H0.	Summary of uvcopy jobs used in this part
	- listings of some jobs in case you wish to inspect the code
	or modify for your own purposes

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1A1. WORDjobs: Creating the Technical Word List for aspell

This application should appeal to anybody who has technical text documentation, that has never been spell checked, due to the false alerts caused by technical terms. Some of the components can also be used as standalone general purpose tools to extract words from binary files, or to do word count analysis on your documents.

Introduction and Overview

Both 'aspell' and 'ispell' are interactive spell checkers for text files on linux or unix systems. They highlight misspelled words & allow you to enter a replacement, or even better to pick a suggested replacement by number from a list of alternatives. Or you may enter 'I' (Ignore) to inhibit reporting an error for that word for the remainder of the current document. You may find aspell documentation at https://www.aspell.net.

The major problem, with using spell checkers for technical documentation, is the large number of technical words and acronyms that get reported as errors. You can use the 'I' command to Ignore for the remainder of any one document, but it gets very tedious when you have dozens or hundreds of files to be checked.

Aspell provides part of the solution by allowing you to specify a 'word-list' file of words that are to be added to the aspell supplied master dictionary.

However it is still tedious to prepare the word-list file. On each document you discover more technical words which you write down, for updating the word-list, before spell checking the next document.

I will later show you how to automatically generate the wordlist file of technical words from all your documents, but first I will illustrate how I use aspell at my site.

Using aspell at UV Software

I use the 'vi' editor to maintain my documentaion (over 100 text files in the /home/uvadm/doc subdirectory). After significant updates with 'vi', I automatically convert the text files to HTML files and FTP to my web site. Please see HTMLjobs.htm if you wish to see how legacy text files can be converted to HTML automatically.

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1B1. WORDjobs: Creating the Technical Word List for aspell

Directories and Files for aspell (at UVSI)

 /home/uvadm
 :------------>.aspell.conf        - personal config file (in home dir)
 :-----ctl
 :     :------>aspell_words_ok.txt - personal wordlist to add to dictionary
 :     :                           - prepared with vi & input to 'apsell create'
 :     :------>aspell_words_ok     - personal wordlist converted for aspell use
 :-----doc
 :     :------>text-documents      - maintained with 'vi'
 :     :------>...120 files...     - spell checked with aspell
 :     :------>text-documents      - converted to HTML for FTP to web site
 :-----docbak
 :     :------>...backups...
 :-----dochtml
 :     :------>...HTML versions...

aspell personal config file

 # .aspell.conf - personal configuration file for aspell
 #              - to check UVdoc spelling - by OT UVSI May 2004
 # effective file stored at: /home/uvadm/.aspell.conf
 # backup file stored at: /home/uvadm/ctl/aspell.conf
 #
 # vi ctl/aspell_words_ok.txt     <-- create text file of technical words
 # ==========================
 #
 # aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt
 # ====================================================================
 #       - convert text file to binary format requried by aspell
 #
 # aspell -x check filename.doc   <-- invoke aspell
 # ============================
 #
 ignore-case
 run-together
 personal /home/uvadm/ctl/aspell_words_ok
 #

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1C1. WORDjobs: Creating the Technical Word List for aspell

Operating Instructions for aspell


 #0. login uvadm --> /home/uvadm


 #1. cp -r doc docbak              <-- backup all doc files (see -x option below)
     ================                - once before aspell on 100 files


 #2. vi ctl/aspell_words_ok.txt    <-- create/update text file of technical words
     ==========================      - as necessary during 100 file checks


 #3. aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt
     ====================================================================
            - convert text file to binary format required by aspell


 #4. aspell -x check filename.doc  <-- invoke aspell
     ============================


 #5. diff docbak/filename.doc doc/filename.doc | more
     ================================================
            - verify aspell actions (optional)

Problem and Solution

When I first started to spell-check my documents, I used 'vi' to create 'aspell_words_ok.txt', which is then converted to the binary file version by the 'aspell create personal' command.

For the first version of aspell_words_ok.txt, I entered all the technical words I could think of, that I knew I had used in my documents.

But on each aspell session, I found more technical words, that had to be written down, for subsequent updating & recreation of the binary file version at the end of the session.

That was when I decided there had to be a better way. That 'better way' is described on the following pages.

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1D1. WORDjobs: Creating the Technical Word List for aspell

Here is an overview of the plan to automatically generate the word-list from your documents and drop out the properly spelled words by matching to words extracted from the aspell master dictionary. You then specify the resulting file as your personal word-list in the aspell personal configuration file.

We will accomplish our task using the 'uvhd' utility and three pre-programmed uvcopy jobs (wordsort1, wordxtrct1, and wordmerge1). We will provide the exact operating instructions later; first we present an overview and some explanations.

Master Plan - to create your word-list for aspell

use 'uvhd' to extract portions of the aspell dictionary binary files containing the desired words (omitting phonetic control portions).
Concatenate the multiple extracted dictionary files into one file
Use 'wordxtrct1' to convert the binary word format to an all text file of words separated by spaces, multiple words per line to column 70.
Use 'wordsort1' to sort the dictionary file. This is not strictly required (since the wordmerge1 job below includes a sort), but it makes a nice master list if you want to check on words present or absent.
use 'wordsort1' to read all files in your directory of text doc files, sort, drop duplicates, and write out tmp/aspell_words_used with blank separated multiple words per line to about column 70.
use 'wordmerge1' to sort/merge the 'words-used' file with the dictionary words, dropping duplicates and retaining only the unmatched words from the 'words-used' file, writing the output to ctl/aspell_words_ok.
edit the 'words_ok' file to remove the misspelled words, leaving only the technical words you wish to accepted without generating errors.
convert the text file of 'words-ok' to the binary file format required by aspell (using the 'aspell create personal' command).
Now you can run aspell on your dozens or hundreds of text documents without error messages being generated by your technical words.

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1E1. WORDjobs: Creating the Technical Word List for aspell

The previous page outlined our plan to automatically create the technical word-list for aspell from our documents and the aspell dictionaries. The first task is to convert the binary aspell dictionaries to a text file of words separated by spaces (not nulls), so we can match to our words-used list.

extracting words from aspell dictionaries

The aspell dictionaries can be found at /usr/lib/aspell. These are binary files with the words separated by nulls (not spaces). They also contain a lot of phonetic control codes that we will omit. Fortunately, the words are grouped in contiguous blocks near the begining of the file.

We can use 'uvhd' to locate the begining and ending block numbers. Then we can reposition to the first block of words and use the 'w'rite command to write out the calculated number of blocks. Here are the aspell dictionary files (and the begining and ending block numbers of the words) that I found on my sytem (Red Hat Enterprise Linux 3.0).

aspell files in /usr/lib/aspell

                               words   words    write
 aspell filename   total-blks  start     end    blocks
 ==============================================================
 american-med-only  |   592  |   17  |   217  |   201  |
 english-med-only   | 15824  |   17  |  5685  |  5669  |
 english-variant-0  |    96  |   17  |    30  |    14  |
 english-variant-1  |   400  |   17  |   119  |   103  |
 english-variant-2  |   400  |   17  |   125  |   109  |

Here are the uvhd operating instructions to write out the words from the biggest file. Note that uvhd writes output files to the 'tmp' subdirectory using the same name, with a date/time stamp appended.


 uvhd /usr/lib/aspell/english-med-only r256
 ==========================================
      --> 17      <-- determine start of words (block 17)
      --> 5685    <-- browse to end of words (block 5685)
      --> 17      <-- reposition to start of words
      --> w5669   <-- write out dictionary words (5685-17=5669)
      --> q       <-- quit uvhd


 ls -l tmp        <-- observe filename written by uvhd
 =========
 tmp/english-med-only.yymmddhhmmW   <-- note date/time stamped file in tmp

The next page illustrates the 'uvhd' browsing and writing --->

See the uvhd instructions for all five files two pages ahead --->

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1E2. WORDjobs: Creating the Technical Word List for aspell

sample uvhd to extract dictionary words


 uvhd /usr/lib/aspell/english-med-only r256
 ==========================================

                      10        20        30        40        50        60
 r#        1 0123456789012345678901234567890123456789012345678901234567890123
           0 aspell rowl 1.3.......=..0...G..{....p.......P.......s..........
             677666276762323001000C30031094007100071010000510AC00D7000D000000
             1305CC02F7C01E30000000D000604720BB400000E00000105D00731000508000
          64 ................english.phonet.1.1..............................
             00000000FFFF0000666667607666670323000000000000000000000000000000
             70004000FFFF10005E7C938008FE5401E1000000000000000000000000000000
         128 ................................................................
             0000000000000000000000000000000000000000000000000000000000000000
             0000000000000000000000000000000000000000000000000000000000000000

 --> 17  <-- position to block #17
                      10        20        30        40        50        60
 r#       17 0123456789012345678901234567890123456789012345678901234567890123
        4096 .Hauser's.refinances.catchall.subtrahend.Brandea's.rarebits.petr
             0467767270766666666706676666607767766666047666662707676667707677
             08153527302569E1E3530314381CC035242185E40221E4517302125294300542
          64 ologist's.tolls.fodders.allovers.hearse.disincline.millpond's.mi
             6666677270766670666667706666767706667760667666666606666766627066
             FCF79347304FCC306F4452301CCF6523085123504939E3C9E50D9CC0FE4730D9
         128 strial's.grandfathered.Grenier's.empirical.domestication.carnall
             7776662706766666766766047666672706676766660666677666766606676666
             34291C730721E4614852540725E9527305D092931C04FD53493149FE0312E1CC
         192 y.synonymy.inadequacies.Palmerston's.Mundt's.hearts.tone's.allot
             7077666767066666776666705666677766270476672706667770766627066667
             9039EFE9D909E1451513953001CD5234FE730D5E4473085124304FE57301CCF4

 --> w5669 <-- write 5669 blocks
                      10        20        30        40        50        60
 r#     5685 0123456789012345678901234567890123456789012345678901234567890123
     1455104 ract.Ryley's.countersign.annihilations.unrulier.venal.summerhous
             7667057667270667676776660666666667666707677666707666607766676677
             2134029C597303F5E452397E01EE989C149FE305E25C952065E1C035DD528F53
          64 es.telltales.Tome's.overprinted.refinanced.tomboy.thousandfold.F
             6707666766670566627067677766766076666666660766667076677666666604
             53045CC41C5304FD5730F652029E45402569E1E35404FD2F9048F531E46FC406
         128 loyd's.Cthrine's.infant's.mistress's............................
             6676270476766627066666727066777677270000000000000000000000000000
             CF9473034829E57309E61E4730D9342533730000000000000000000000000000
         192 ................................................................
             0000000000000000000000000000000000000000000000000000000000000000
             0000000000000000000000000000000000000000000000000000000000000000

 w5669 5669 written, tmp/english-med-only.0405241908W

 rec#=5685 rcount=15824 rsize=256 fsize=4050944 /usr/lib/aspell/english-med-only
 null=next,r#=rec,s=search,u=update,p=print,i=iprint,w=write,t=tally,c=checkseq
 ,R#=Recsize,h1=char,h2=hex,q=quit,?=help --> q    <-- quit

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1E3. WORDjobs: Creating the Technical Word List for aspell

using uvhd to extract words from dictionary files

Please refer back to page '1E1' for the list of aspell dictionary files and the relevant start/end block numbers where the contiguous words are found.


 #1a. rm -f tmp/*        <-- clear out the 'tmp' subdir


 #1b. mkdir tmp1 tmp2    <-- make 2 additional tmp subdirs


 #2a. uvhd /usr/lib/aspell/american-med-only r256
      ==========================================
      --> 17      <-- position to start of words
      --> w201    <-- write out dictionary words
      --> q       <-- quit uvhd


 #2b. uvhd /usr/lib/aspell/english-med-only r256
      ==========================================
      --> 17    --> w5669  --> q


 #2c. uvhd /usr/lib/aspell/english-variant-0 r256
      ==========================================
      --> 17    --> w14    --> q


 #2d. uvhd /usr/lib/aspell/english-variant-1 r256
      ==========================================
      --> 17    --> w14    --> q


 #2e. uvhd /usr/lib/aspell/english-variant-2 r256
      ==========================================
      --> 17    --> w14    --> q


 #3.  ls -l tmp   <-- observe filename written by uvhd
      =========
      tmp/american-med-only.yymmddhhmmW
      tmp/english-med-only.yymmddhhmmW
      tmp/english-variant-0.yymmddhhmmW
      tmp/english-variant-1.yymmddhhmmW
      tmp/english-variant-2.yymmddhhmmW


 #4.  cat tmp/* >tmp1/aspell_dict_raw


 #5.  uvcopy wordxtrct1,fili1=tmp1/aspell_dict_raw,filo1=tmp1/aspell_dict_text
      ========================================================================
             - convert the binary file to a text file
             - space separated multiple words per line to about column 70


 #6.  uvcopy wordsort1,fili1=tmp1/aspell_dict_text,filo1=tmp2/aspell_dict_sorted
      ==========================================================================
             - sort the dictionary words and write output in the same format
             - space separated multiple words per line to about column 70

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1F1. WORDjobs: Creating the Technical Word List for aspell

Please refer back to our master plan on page '1D1'. We have just completed the conversion of aspell dictionary files to a text file and we are now ready to extract the 'words-used' file from all files in our doc directory.

Here are the operating instructions to automatically create a technical word-list from your documents and the aspell master dictionary.

extract words_used, match to master, creating words_ok


 #1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used
     =======================================================
            - create file of all words used in all your text files
              (technical words, properly spelled words,and misspelled words)
            - sorted and duplicates removed


 #2. uvcopy wordmerge1,fili1=tmp2/aspell_words_used,fili2=tmp2/aspell_dict_sorted
     ============================================================================
                      ,filo1=tmp2/aspell_words_ok.txt
                      ===============================
            - sort/merge words_used with the aspell dictionary, drop duplicates
            - write out only the unmatched words from the words_used input file


 #3. cp tmp2/aspell_words_ok.txt ctl   <-- copy to permanent subdir (ctl)
     ===============================       for extended use


 #4. vi ctl/aspell_words_ok.txt        <-- drop the misspelled words
     ==========================            retaining the technical words
                                           (to be considered OK)


 #5. aspell create personal ctl/aspell_words_ok < ctl/aspell_words_ok.txt
     ====================================================================
            - convert the text file to the binary format required by aspell


 #6. aspell -x check doc/filename.doc  <-- invoke aspell for each file
     ================================    - repeat for 100 files

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1G1. WORDjobs: Creating the Technical Word List for aspell

wordxtrct1 - other uses

'wordxtrct1' was executed on page '1E3' as part of the plan to create as aspell technical word-list to enable easier spell checking of user documents. 'wordxtrct1 is also intended as a general purpose pre-programmed job, that you might find useful for applications other than spell-checking.

The wordxtrct1 command is re-executed here to illustrate the options and show some of the output file. We will then discuss other possible uses.


 #5. uvcopy wordxtrct1,fili1=tmp1/aspell_dict_raw,filo1=tmp1/aspell_dict_text
     ========================================================================

Option Defaults display and prompt

 uop=d1t1 - option defaults
     d1   - delete single character words
     d2   - delete 2 character words (d0=no deletes, d3=max char deletes)
       t1 - translate to lower case (t0 do not translate)
 User OPtion (uop) defaults  = q1d1t1
  null to accept or re-specify (1 or more) -->     <-- null to accept defaults

aspell_dict_text output - first few lines

psychoanalyzed nasalizing compartmentalization specializing epitomize succored leukemia glamorization fossilized temporizing's eyer polymerization's raveling flavor's criminalization vulcanization's motorizing succorer criticizinglies resymbolizations colorfastnesses sensitize homeostatic vialed unmechanizes programming colorfastness colorfully popularization organizing donutting individualizer diagonalize analogized anesthesia's analogizes revisualizes centerboard cenobites evangelizing theorizer behoove modernizations apologized mesmerizer's

This is just the first few lines of 22,000 total lines and 160,000 total words. You will notice that the words are not in sequence. I think the placement is determined by the phonetic algorithms used by aspell. The next step (#6) in the plan on page '1E1' will sort them to sequence.

other uses for wordxtrct1

You might find other uses for wordxtrct1. For example, you might want to extract text information from an executable program, such as version number, or help information. You could try this on some of the unix/linux bin programs:


 1. uvcopy wordxtrct1,fili1=/bin/cat,filo1=tmp/cat.text
    ===================================================


 2. vi cat_text
    ===========

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1G2. WORDjobs: Creating the Technical Word List for aspell

wordsort1 - sample use

'wordsort1' was executed on page '1F1' as part of the plan to create an aspell technical word-list to enable easier spell checking of user documents. 'wordsort1 is also intended as a general purpose pre-programmed job that you might find useful for applications other than spell-checking.

The wordsort1 command is re-executed here to illustrate the options and show some of the output file. We will then discuss other possible uses.


 #1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used
     =======================================================

Option Defaults display and prompt

 uop=a1c65d1m2n30s0t1 - option defaults
     a1               - accept alpha characters (default for aspell)
     a2               - accept numerics (use a3 for alphanumeric a1+a2)
     a4               - accept punctuation (use a7 for all chars a1+a2+a4)
       c65            - max column exceeded to output multi-word lines
          d1          - drop 1 character words (d0=do NOT drop)
            m2        - minimum word length (drop if less)
              n30     - maximum word length (drop if more)
                 s1   - statistics, word count appended at end each word(9)
                 s0   - statistics turned off
                   t1 - translate to lower case (t0 do not translate)
 User OPtion (uop) defaults  = q1a1c65d1m2n30s0t1
  null to accept or re-specify (1 or more) -->      <-- null to accept defaults

aspell_words_used output - first few lines

ab abab abanta abbreviated abbreviation abbreviations abc abcco abcd abcde abcdef abcdefghi abcfile abcxyz abend abended ability able abndcd abnormal abnormally abort aborted aborting about above abrasrt absence absent absolute absolutely abterm abudfil abudrec abudytd ac academic acaps acc accept accept'ed accept's acceptable accepted accepting accepts acceptu access accessed accesses accessible accessing accfix accidental accidentally accname accommodate accommodated accomodate accomodated accompanied accompany accomplish accomplished accordingly account accounted accounting accountno accounts accouts accpare accrdt accreg accross accrual acct acctest acctmas acctmstr acctounting

other uses for wordsort1

You might use wordsort1 to determine word usage statistics. Please see the example on the next page, where we will rerun the same job shown above, using the statistics option.

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1G3. WORDjobs: Creating the Technical Word List for aspell

wordsort1 - with 'statistics' option

We will re-execute 'wordsort1' with the 'statistics' option, which appends the duplicate count onto the end of each word.

We will specify the 's1' option by appending ',uop=s1' onto the command line. Alternatively we could enter 's1' at the options prompt (see previous page).


 #1. uvcopy wordsort1,fild1=doc,filo1=tmp2/aspell_words_used,uop=s1
     ========================================================******

aspell_words_used output - first few lines

ab(89) abab(2) abanta(2) abbreviated(1) abbreviation(2) abbreviations(4) abc(66) abcco(4) abcd(5) abcde(1) abcdef(4) abcdefghi(12) abcfile(1) abcxyz(10) abend(18) abended(1) ability(7) able(30) abndcd(1) abnormal(3) abnormally(14) abort(1) aborted(17) aborting(1) about(139) above(965) abrasrt(1) absence(38) absent(148) absolute(14) absolutely(1) abterm(56) abudfil(4) abudrec(2) abudytd(3) ac(6) academic(3) acaps(4) acc(72) accept(432) accept'ed(1) accept's(7) acceptable(1) accepted(9) accepting(1) accepts(13) acceptu(1) access(151) accessed(11) accesses(12) accessible(3) accessing(8) accfix(6) accidental(6) accidentally(4) accname(3) accommodate(8) accommodated(2) accomodate(2) accomodated(1) accompanied(2) accompany(2)

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1H0. WORDjobs: Creating the Technical Word List for aspell

uvcopy job Summary and Listings

wordxtrct1

extract words from binary files
translates all non-alphas to blanks (by default)
options to allow numerics or punctuations
output multiple words per line to about column 70
see uvcopy code listing at code/wordxtrct1

wordsort1

sort words from 1 text file or all files in a directory
drop duplicates and output multiple words per line to column 70
option to append count of duplicates on end of each word(9)
options to drop 1,2,or 3 char words
see uvcopy code listing at code/wordsort1

wordmerge1

sort/merge the 'words-used' file with the dictionary words
drop duplicates and retain only unmatched words from 'words-used'
output file is used to create the aspell personal word-list file
causes aspell not to report errors on your technical words.
see uvcopy code listing at code/wordmerge1

If you have the Vancouver Utilities installed on your machine, you can view or list as follows (using wordxtrct1 as an example).


 vi /home/uvadm/pf/util/wordxtrct1       <-- inspect with vi
 =================================


 uvlp12 /home/uvadm/pf/util/wordxtrct1   <-- list at 12 cpi
 =====================================

Goto: Begin this doc , End this doc , Index this doc , Contents this library , UVSI Home-Page

1A1. Introduction and Overview

1A1. WORDjobs: Creating the Technical Word List for aspell

Introduction and Overview

Using aspell at UV Software

1B1. WORDjobs: Creating the Technical Word List for aspell

Directories and Files for aspell (at UVSI)

aspell personal config file

1C1. WORDjobs: Creating the Technical Word List for aspell

Operating Instructions for aspell

Problem and Solution

1D1. WORDjobs: Creating the Technical Word List for aspell

Master Plan - to create your word-list for aspell

1E1. WORDjobs: Creating the Technical Word List for aspell

extracting words from aspell dictionaries

aspell files in /usr/lib/aspell

1E2. WORDjobs: Creating the Technical Word List for aspell

sample uvhd to extract dictionary words

1E3. WORDjobs: Creating the Technical Word List for aspell

using uvhd to extract words from dictionary files

1F1. WORDjobs: Creating the Technical Word List for aspell

extract words_used, match to master, creating words_ok

1G1. WORDjobs: Creating the Technical Word List for aspell

wordxtrct1 - other uses

Option Defaults display and prompt

aspell_dict_text output - first few lines

other uses for wordxtrct1

1G2. WORDjobs: Creating the Technical Word List for aspell

wordsort1 - sample use

Option Defaults display and prompt

aspell_words_used output - first few lines

other uses for wordsort1

1G3. WORDjobs: Creating the Technical Word List for aspell

wordsort1 - with 'statistics' option

aspell_words_used output - first few lines

1H0. WORDjobs: Creating the Technical Word List for aspell

uvcopy job Summary and Listings

Visitor Counters for ThisYear and LastYear