I have this text:
$text = "Başka, küskün otomobil kaçtı buraya küskün otomobil neden kaçtı
kaçtı buraya, oraya KISMEN @here #there J.J.Johanson hep.
Danny:Where is mom? I don't know! Café est weiß for 2 €uros.
My 2nd nickname is mike18.";
Recently I was using this.
$a1= array_count_values(str_word_count($text, 1, 'ÇçÖöŞşİIıĞğÜü@#é߀1234567890'));
arsort($a1);
You can check with this fiddle:
http://ideone.com/oVUGYa
But this solution doesn't solve all UTF8 problems. I can't write whole UTF8 set into str_word_count as parameter.
So I created this:
$wordsArray = explode(" ",$text);
foreach ($wordsArray as $k => $w) {
$wordsArray[$k] = str_replace(array(",","."),"",$w);
}
$wordsArray2 = array_count_values($wordsArray);
arsort($wordsArray2);
Output should be like this:
Array (
[kaçtı] => 3
[küskün] => 2
[buraya] => 2
[@here] => 1
[#there] => 1
[Danny] => 1
[mom] => 1
[don't] => 1
[know] => 1
...
...
)
This works well but it doesn't cover all sentence-word problems. For example I removed comma and dots with str_replace.
For example this solution doesn't cover the words like this: Hello Mike,how are you ?
Mike and how won't be treated as different words.
This doesn't covered in str_word_count solution: KISMEN @here #there
. At and dash sign and won't be taken into consideration.
This will not be covered J.J.Johanson
. Although it is a word, it will be treated as JJJohanson
Question, exclamation signs should be removed from words.
Is there a better way to get str_word_count
behaviour with UTF8
support ? The $text
which exists in the top of this question is reference for me.
(It would be better if you can provide a fiddle with your answer)
here
&there
instead of@here
&#there
, would this be acceptable? – Feverwort@here
& and#there
. Because mostly we analyze tweets. – Myriam