remove stop words without stemming in postgresql
Asked Answered
L

1

12

I want to remove the stop words from my data but I do not want to stem the words since the exact words matters to me. I used this query.

SELECT to_tsvector('english',colName)from tblName order by lower asc;

Is there any way that I can remove stopWords without stemming the words?

thanks

Labiovelar answered 5/2, 2017 at 12:41 Comment(0)
L
24

Create your own text search dictionary and configuration:

CREATE TEXT SEARCH DICTIONARY simple_english
   (TEMPLATE = pg_catalog.simple, STOPWORDS = english);

CREATE TEXT SEARCH CONFIGURATION simple_english
   (copy = english);
ALTER TEXT SEARCH CONFIGURATION simple_english
   ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
   WITH simple_english;

It works like this:

SELECT to_tsvector('simple_english', 'many an ox eats the houses');
┌─────────────────────────────────────┐
│             to_tsvector             │
├─────────────────────────────────────┤
│ 'eats':4 'houses':5 'many':1 'ox':3 │
└─────────────────────────────────────┘
(1 row)

You can set the parameter default_text_search_config to simple_english to make it your default text search configuration.

Lauretta answered 6/2, 2017 at 8:58 Comment(6)
I have done it, and then I made a query like this update tblName set cilName= to_tsvector('simple_english',colName); and then returns this error value too long for type character varying(254)!Labiovelar
It doesn't make much sense to store a tsvector in a varchar column. Particularly if you define it so short that it cannot hold the value. What are you trying to do?Lauretta
thanks for quick reply. I have a column of tags, they can be some characters or even sentences. the tags can be the same, but these makes them look different. I want to remove the stop words and all the characters and find the distinctive tags.Labiovelar
should I change it to text? is it possible to change it now without losing any information?Labiovelar
Then you probably want to filter out the :3 and similar. And you should define the column large enough to contain all words. Why not use type text?Lauretta
Changing to text won't lose any information since varchar and text are the same thing anyway, the only difference being the length limit.Lauretta

© 2022 - 2024 — McMap. All rights reserved.