Can aspell output line number and not offset in pipe mode?

Asked 6/4, 2011 at 14:14 Answered 4/6, 2022 at 17:44

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).

Thresathresh answered 6/4, 2011 at 14:14 Comment(1)

I'm adding an aspell spell-check for documentation as part of my build process and also would be interested in an answer to this question, so I have started a bounty. – Thyme 13/6, 2011 at 15:59

This will output all occurrences of misspelt words with line numbers:

# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |

# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done

Where:

my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file

aspell.ignore.txt example:

personal_ws-1.1 en 500
foo
bar

example results.txt output (for an en_GB dictionary):

238:color
302:writeable
355:backends
433:dataonly

You can also print the whole line by changing the last grep -on into grep -n.

Thyme answered 19/6, 2011 at 18:57 Comment(2)

If you do not want to see duplicate occurences, you can extend this by "sort | uniq"

grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' aspell_output.txt -oh | grep '[a-zA-Z]\+' -o | sort | uniq | while read a; do grep -no "$a" my_document.txt; done > result.txt

– Selfrighteous 18/3, 2014 at 11:14

This works for text files only. If applied on HTML files, a stray th would for example list all table header elements. Also, it produces bogus output in case the misspelled word is part of another word. The grep for the misspelled word should check for word boundaries (edited the code accordingly). And finally, piping to a file if the file is read only once is unnecessary. Changed that as well. – Croat 16/8, 2015 at 12:24

This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.

cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"

Aile answered 18/6, 2011 at 4:36 Comment(1)

Ahh! Unfortunately the byte offset is actually per line than global to the document, so this unfortunately won't work after all. You get points for trying though, I thought this was a clever solution but ispell-style output is quite unintuitive. – Thyme 19/6, 2011 at 19:0

I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.

#!/bin/bash

set +o pipefail

if [ -t 1 ] ; then
    color="--color=always"
fi

! for file in "$@" ; do
    <"$file" aspell pipe list -p ./dict --mode=html |
    grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
    grep '[[:alpha:]]\+' -o |
    while read word ; do
        grep $color -n "\<$word\>" "$file"
    done
done | grep .

You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.

Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.

Croat answered 16/8, 2015 at 13:13 Comment(0)

aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).

Demonstration printing the line number with awk:

$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'

produces this output:

@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4

with testFile.txt

iinternational
I say this reelly.
hello
here is sometypo.

(Still not as nice as hunspell -u (https://mcmap.net/q/1317545/-is-it-possible-to-make-hunspell-print-the-line-numbers-of-the-misspelled-words). But hunspell misses some command line options I like.)

Dennadennard answered 13/5, 2020 at 7:34 Comment(0)

For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.

ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"

for file in "$@"; do
  for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
      grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
  done | sort -n
done

This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.

Ambrose answered 4/6, 2022 at 17:44 Comment(0)

Recommended topics

Hot tags