Here's the approach I went for after fiddling around a while. It's a combination of basic rules, "fixes", and synonyms: First, apply a char_filter to enforce a set of basic spelling rules. It's not 100% correct, but it does the job pretty well:
"char_filter": {
"en_char_filter": { "type": "mapping", "mappings": [
# fixes
"aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse",
# whole words
"armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor",
"humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor",
# generic transformations
"ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing"
] }
}
The "fixes" entry is there to prevent incorrect application of other rules. E.g. "prise=>prixse"
prevents "prise" from getting changed into "prize", which has a different meaning. You may need to adapt this according to your own needs.
Next, include a synonym filter for catching the most frequently used exceptions:
"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }
Here's our list of synonyms that includes the most important keywords for our use case. You may wish to adapt this list to your needs:
EN_SYNONYMS = (
"accolade, prize => award",
"accoutrement => accouterment",
"aching, pain => hurt",
"acw, anticlockwise, counterclockwise, counter-clockwise => ccw",
"adaptor => adapter",
"advocate, attorney, barrister, procurator, solicitor => lawyer",
"ageing => aging",
"agendas, agendum => agenda",
"almanack => almanac",
"aluminium => aluminum",
"america, united states, usa",
"amphitheatre => amphitheater",
"anti-aliased, anti-aliasing => antialiased",
"arbour => arbor",
"ardour => ardor",
"arse => ass",
"artefact => artifact",
"aubergine => eggplant",
"automobile, motorcar => car",
"axe => ax",
"bannister => banister",
"barbecue => bbq",
"battleaxe => battleax",
"baulk => balk",
"beetroot => beet",
"biassed => biased",
"biassing => biasing",
"biscuit => cookie",
"black american, african american, afro-american, negro",
"bobsleigh => bobsled",
"bonnet => hood",
"bulb, electric bulb, light bulb, lightbulb",
"burned => burnt",
"bussines, bussiness => business",
"business man, business people, businessman",
"business woman, business people, businesswoman",
"bussing => busing",
"cactus, cactuses => cacti",
"calibre => caliber",
"candour => candor",
"candy floss, candyfloss, cotton candy",
"car park, parking area, parking ground, parking lot, parking-lot, parking place, parking",
"carburettor => carburetor",
"castor => caster",
"cataloguing => cataloging",
"catboat, sailboat, sailing boat",
"champion, gainer, victor, win, winner => victory",
"chat => talk",
"chequebook => checkbook",
"chequer => checker",
"chequerboard => checkerboard",
"chequered => checkered",
"christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble",
"christmas, x-mas => xmas",
"cinema => movies",
"clangour => clangor",
"clarinettist => clarinetist",
"conditioning => conditioner",
"conference => meeting",
"coriander => cilantro",
"corporate => company",
"cosmos, universe => outer space",
"cosy, cosiness => cozy",
"criminal => crime",
"curriculums => curricula",
"cypher => cipher",
"daddy, father, pa, papa => dad",
"defence => defense",
"defenceless => defenseless",
"demeanour => demeanor",
"departure platform, station platform, train platform, train station",
"dishrag => dish cloth",
"dishtowel, dishcloth => dish towel",
"doughnut => donut",
"downspout => drainpipe",
"drugstore => pharmacy",
"e-mail => email",
"enamoured => enamored",
"england => britain",
"english => british",
"epaulette => epaulet",
"exercise, excercise, training, workout => fitness",
"expressway, motorway, highway => freeway",
"facebook => facebook, social media",
"fanny => buttocks",
"fanny pack => bum bag",
"farmyard => barnyard",
"faucet => tap",
"fervour => fervor",
"fibre => fiber",
"fibreglass => fiberglass",
"flashlight => torch",
"flautist => flutist",
"flier => flyer",
"flower fly, hoverfly, syrphid fly, syrphus fly",
"foot-walk, sidewalk, sideway => pavement",
"football, soccer",
"forums => fora",
"fourth => 4",
"freshman => fresher",
"chips, fries, french fries",
"gaol => jail",
"gaolbird => jailbird",
"gaolbreak => jailbreak",
"gaoler => jailer",
"garbage, rubbish => trash",
"gasoline => petrol",
"gases, gasses",
"gauge => gage",
"gauged => gaged",
"gauging => gaging",
"gipsy, gipsies, gypsies => gypsy",
"glamour => glamor",
"glueing => gluing",
"gravesite, sepulchre, sepulture => sepulcher",
"grey => gray",
"greyish => grayish",
"greyness => grayness",
"groyne => groin",
"gryphon, griffon => griffin",
"hand shake, shake hands, shaking hands, handshake",
"haulier => hauler",
"hobo, homeless, tramp => bum",
"new year, new year's eve, hogmanay, silvester, sylvester",
"holiday => vacation",
"holidaymaker, holiday-maker, vacationer, vacationist => tourist",
"homosexual, fag => gay",
"inbox, letterbox, outbox, postbox => mailbox",
"independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july",
"infant, suckling, toddler => baby",
"infeasible => unfeasible",
"inquire, inquiry => enquire",
"insure => ensure",
"internet, website => www",
"jelly => jam",
"jewelery, jewellery => jewelry",
"jogging => running",
"journey => travel",
"judgement => judgment",
"kerb => curb",
"kiwifruit => kiwi",
"laborer => worker",
"lacklustre => lackluster",
"ladybeetle, ladybird, ladybug => ladybird beetle",
"larrikin, scalawag, rascal, scallywag => naughty boy",
"leaf => leaves",
"licence, licenced, licencing => license",
"liquorice => licorice",
"lorry => truck",
"loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom",
"louvred => louvered",
"louvres => louver",
"lustre => luster",
"mail => post",
"mailman => postman",
"marriage, married, marry, marrying, wedding => wed",
"mayonaise => mayo",
"meagre => meager",
"misdemeanour => misdemeanor",
"mitre => miter",
"mom, momma, mummy, mother => mum",
"moonlight => moon light",
"moult => molt",
"moustache, moustached => mustache",
"nappy => diaper",
"nightlife => night life",
"normalcy => normality",
"octopus => kraken",
"odour => odor",
"odourless => odorless",
"offence => offense",
"omelette => omelet",
"# fix torres del paine",
"paine => painee",
"pajamas => pyjamas",
"pantyhose => tights",
"parenthesis, parentheses => bracket",
"parliament => congress",
"parlour => parlor",
"persnickety => pernickety",
"philtre => filter",
"phoney => phony",
"popsicle => iced-lolly",
"porch => veranda",
"pretence => pretense",
"pullover, jumper => sweater",
"pyjama => pajama",
"railway => railroad",
"rancour => rancor",
"rappel => abseil",
"row house, serial house, terrace house, terraced house, terraced housing, town house",
"rigour => rigor",
"rumour => rumor",
"sabre => saber",
"saltpetre => saltpeter",
"sanitarium => sanatorium",
"santa, santa claus, st nicholas, st nicholas day",
"sceptic, sceptical, scepticism, sceptics => skeptic",
"sceptre => scepter",
"shaikh, sheikh => sheik",
"shivaree => charivari",
"silverware, flatware => cutlery",
"simultaneous => simultanous",
"sleigh => sled",
"smoulder, smouldering => smolder",
"sombre => somber",
"speciality => specialty",
"spectre => specter",
"splendour => splendor",
"spoilt => spoiled",
"street => road",
"streetcar, tramway, tram => trolley-car",
"succour => succor",
"sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur",
"super hero, superhero => hero",
"surname => last name",
"sweets => candy",
"syphon => siphon",
"syphoning => siphoning",
"tack, thumb-tack, thumbtack => drawing pin",
"tailpipe => exhaust pipe",
"taleban => taliban",
"teenager => teen",
"television => tv",
"thank you, thanks",
"theatre => theater",
"tickbox => checkbox",
"ticked => checked",
"timetable => schedule",
"tinned => canned",
"titbit => tidbit",
"toffee => taffy",
"tonne => ton",
"transportation => transport",
"trapezium => trapezoid",
"trousers => pants",
"tumour => tumor",
"twitter => twitter, social media",
"tyre => tire",
"tyres => tires",
"undershirt => singlet",
"university => college",
"upmarket => upscale",
"valour => valor",
"vapour => vapor",
"vigour => vigor",
"waggon => wagon",
"windscreen, windshield => front shield",
"world championship, world cup, worldcup",
"worshipper, worshipping => worshiping",
"yoghourt, yoghurt => yogurt",
"zip, zip code, postal code, postcode",
"zucchini => courgette"
)