SQL count number of words in field
Asked Answered
C

11

13

I'd like to make an SQL query where the condition is that column1 contains three or more words. Is there something to do that?

Caffeine answered 7/1, 2015 at 19:43 Comment(5)
Yes, you should first decide on the database you are using and tag your question appropriately. MySQL? SQL Server? Postgres? Something else.Linder
dev.mysql.com/doc/refman/5.0/en/string-functions.htmlMisgiving
maybe use a regular expression - then count the matchesFluecure
possible duplicate of Using SQL to determine word count stats of a text fieldTry
You will not find a single SQL statement which does this reliably on all mentioned DBMSLibrium
T
32

maybe try counting spaces ?

SELECT * 
FROM table
WHERE (LENGTH(column1) - LENGTH(replace(column1, ' ', ''))) > 1

and assume words is number of spaces + 1

Troublesome answered 7/1, 2015 at 19:47 Comment(6)
Wont handle multiple spaces between wordsPalacio
@Palacio . . . I don't think that limitation is really worth a downvote, given that the OP doesn't specify what the delimiters are. One space is a reasonable assumption (although the answers could make that explicit).Linder
len() is a microsoft invention, it's not portableWorkmanship
Well ... only if one does not care about accuracyPalacio
It may not produce 100% accurate results, but is there any better solution? I upvoted. My main gripe about this approach is it will be very slow for anything but very small tables.Resupine
it has a syntax issue, one more bracket ) is required after )).Jello
L
5

If you want a condition that a column contains three or more words and you want it to work in a bunch of databases and we assume that words are separated by single spaces, then you can use like:

where column1 like '% % %'
Linder answered 7/1, 2015 at 19:56 Comment(3)
that would match a column containing only two spaces where column1 like '%_ %_ %_' is only slightly better.Workmanship
@Workmanship . . . Read the answer. The assumption is clearly stated that words are separated by a single space. A string consisting only of spaces would not meet that condition.Linder
I must admit that your answer is the only one that will execute in all three databases.Workmanship
M
5

I think David nailed it above. However, as a more complete answer:

LENGTH(RTRIM(LTRIM(REPLACE(column1,'  ', ' ')))) - LENGTH(REPLACE(RTRIM(LTRIM(REPLACE(column1, '  ', ' '))), ' ', '')) + 1 AS number_of_words

This will remove double spaces, as well as leading and trailing spaces in your string.

Of course, you may go further by adding replacements for more than 2 spaces in a row...

Marven answered 20/3, 2019 at 1:44 Comment(0)
L
4

In Postgres you can use regexp_split_to_array() for this:

select *
from the_table
where array_length(regexp_split_to_array(the_column, '\s+'), 1) >= 3;

This will split the contents of the column the_column into array elements. One ore more whitespace are used as the delimiter. It won't respect "quoted" spaces though. The value 'one "two three" four' will be counted as four words.

Librium answered 8/1, 2015 at 7:32 Comment(2)
Can you explain to me in detail what that "1" flag at the end of regexp_split_to_array does? I can't find the documentation on itDrafty
@Tom: that's a parameter for array_length() not for regexp_split_to_array()Librium
R
3

The best way to do this, is to NOT do this.

Instead, you should use the application layer to count the words during INSERT and save the word count into its own column.

While I like, and upvoted, some of the answers here, all of them will be very slow and not 100% accurate.

I know people want a simple answer to SELECT the word count, but it just is NOT POSSIBLE with accuracy and speed.

If you want it to be 100% accurate, and very fast, then use this solution.

Steps to solve:

  1. Add a column to your table and index it: ALTER TABLE tablename ADD COLUMN wordcount INT UNSIGNED NULL, ADD INDEX idxtablename_count (wordcount ASC);.
  2. Before doing your INSERT, count the number of words using your application. For example in PHP: $count = str_word_count($somevalue);
  3. During the INSERT, include the value of $count for the column wordcount like insert into tablename (col1, col2, col3, wordcount) values (val1, val2, val3, $count);

Then your select statement becomes super easy, clean, uber-fast, and 100% accurate.

select * from tablename where wordcount >= 3;

Also remember when you are updating any rows that you will need to recount the words for that column.

Resupine answered 7/3, 2016 at 23:15 Comment(8)
This is the best answer because it is the only one that is 100% accurate and does not do a full table scan (and so it is fast, unlike the other solutions)Resupine
This is not the best answer because it does not answer the question.Shermy
@KenJohnson "I'd like to make an SQL query where the condition is that column1 contains three or more words. Is there something to do that?" was the question and select * from tablename where wordcount >= 3; is my answer, if you want to be semantically accurate. OP asked for a query. OP got a query.Resupine
@KenJohnson, I'm a little late here, but sometimes a question can be answered by asking a different question and answering that one. In my experience, some hard problems become easy when adding a layer of abstraction or solving the problem at a different stage in the code's lifecycle.Langur
@JaredMenard I totally agree with that sentiment, and i believe that this answer has a place in this thread. I merely contest the statement that this is the "best answer" since it answers it in a roundabout way, with contrived constraints (albeit constraints that others may identify with).Shermy
@KenJohnson Then "best" without further specification is a subjective term.Sweetscented
Given the potential for existing data, I feel this answer could be improved by adding a step between 1 and 2 using a slow, 100% accurate solution (Like @Madhivanan's) to update the new column for existing rows. Slow once, but then future queries will work on old data.Clipper
@IsaacReefman For existing data would it be better to run a one-time update overnight or something that populates the count for all existing rows? Rather doing it here for every call in the future?Resupine
L
1

For "n" or more words

select *
from table
where (length(column)- length(replace(column, " ", "")) + 1) >= n

PS: This would not work if words have multiple spaces between them.

Logging answered 11/4, 2018 at 18:21 Comment(0)
B
1

With ClickHouse DB You can use splitByWhitespace() function.

Refer : https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#splitbywhitespaces

Bireme answered 4/8, 2022 at 4:54 Comment(2)
Your answer could be improved by adding an example of query (relevant to the question) using this function in the answer itself.Stickpin
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewSweaty
P
1

None of the other answers seem to take multiple spaces into account. For example, a lot of people use two spaces between sentences; these space-counters would count an extra word per sentence. "Also, scenarios such as spaces around a hyphen - like that. "

For my purposes, this was far more accurate:

SELECT 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}',' ')) - 
  LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}', '')) wordCount FROM myTable;

It counts any sets of 1 or more consecutive characters from any of: [space, linefeed, tab, pipe, or hyphen] and counts it as one word.

Porch answered 3/9, 2022 at 11:20 Comment(0)
M
0

This can work:

SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1)

Where a is the string column. It will count the number of spaces, which is 1 less than the number of words.

Maeganmaelstrom answered 7/1, 2015 at 19:47 Comment(3)
won't handle multiple spaces between "words"Palacio
Microsoft's server doesn't do Length(),Workmanship
You shouldn't have SUM in there.Midday
C
0

I like @ashleydawg's answer with REGEX but stackoverflow won't let me comment or like it due to "reputation points".. so annoying!

Only thing to change there -- add 1 to the result. In this example, it returns 3 instead of 4.

set @mystring = "This is four words";
select LENGTH(REGEXP_REPLACE(@mystring, '[ \n\t\|\-]{1,}',' ')) - 
  LENGTH(REGEXP_REPLACE(@mystring, '[ \n\t\|\-]{1,}', '')) wordCount
Catholicon answered 5/6, 2024 at 21:25 Comment(0)
F
-1

To handle multiple spaces too, use the method shown here

Declare @s varchar(100)
set @s='  See      how many                        words this      has  '
set @s=ltrim(rtrim(@s))

while charindex('  ',@s)>0
Begin
    set @s=replace(@s,'  ',' ')
end

select len(@s)-len(replace(@s,' ',''))+1 as word_count

https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/

Fourpenny answered 30/9, 2018 at 17:27 Comment(1)
This is the only solution I've seen that allows for an arbitrary number of spaces, but not only does it use string manipulation (which is expensive) but also a loop, (which is unknowably expensive). However, using this together with @Evan de la Cruz's answer would get the best of all worlds IMHO.Clipper

© 2022 - 2025 — McMap. All rights reserved.