Processing Emojis in SQLite
Asked Answered
M

3

10

I am hoping to identify which emojis are used most in a text conversation using SQL Lite. I am using DB Browser and the emojis show up like they do in iMessage (see below picture), but I am stumped on how to count them.

I was thinking if there was a way to check and see if a character is not a letter/number/punctuation, then I could count the frequency of all characters that don't fit the prerequisite list. That said, I am unfamiliar with SQLite commands and how I can accomplish that.

Is there a better way to go about this? Let me know if you need more context to answer this question.

Emoji Example

Maher answered 21/4, 2018 at 16:32 Comment(3)
Related: Detecting all emojis – Szombathely
Do you have to do this once, or on an ongoing basis? – Cyn
@Cyn not 100% sure what you mean, but just once. I want to run the analysis once on a single file. – Maher
C
0

The only way I can see to do this with SQLite directly would be to compile SQLite from the source code so you could add support for regex_replace.

However, you only plan to do it once, and recompiling SQLite might be a bit overkill.

Instead, you could copy your text column into a plain text file, and run the following command:

sed 's/\(.\)/\1\n/g' temp.txt | sed 's/[[:alnum:].-]//g' | sort -r | uniq -c

This would turn the following:

Hello! Are you stuck? πŸ€” I saw 🐻🐻🐻 in the park!!!!! πŸŽ‚πŸŽ‚πŸŽ‚πŸŽ‚πŸŽ‚πŸŽ‚ - all lies. Easy as 123! 😎😎😎😎😎😎😎😎😎😎😎

into:

  1 πŸ€”
 11 😎
  3 🐻
  6 πŸŽ‚
  1 ?
  7 !
 17
 50

Which would hopefully be close enough to get you to your goal. The last two entries are for tabs and spaces.

sed is a linux command, so if you are running windows you may want to get a windows version here: https://github.com/mbuilov/sed-windows

Cyn answered 24/4, 2018 at 13:37 Comment(0)
B
0

Build emojis table, with contains emojis strings in text fields, then try this:

SELECT emojis.text, sum(length(replace(message.text, emojis.text, emojis.text || '_')) - length( message.text))
FROM message
INNER JOIN emojis ON message.text LIKE '%' || emojis.text || '%'
GROUP BY emojis.text

You can use this method to count occurrences of any text in messages...

Brittanybritte answered 26/4, 2018 at 1:15 Comment(2)
Here is a sqlfiddle validating the answer: sqlfiddle.com/#!5/eccd6/2 – Brittanybritte
I think the problem here is that "text" field will also have normal text, so I will need to differentiate somehow. – Maher
S
0

Possibly not the most efficient example but you could use the group_concat function to merge the text into a single string then use a recursive cte to turn that into a single table of chars. You can filter out all the 'normal' acii chars using the hex function greater than 7F( see asii table) you might need to play around with that depending on what your emojis are and whether you have loads of non latin unicode in there. Then it is a simple case of a count if you just want the total or a group by with a count if you want to know what the individual emoji count is.

WITH RECURSIVE theChars (
    position
    ,charValue
    )
AS (
    SELECT 2
        ,substr((
                SELECT group_concat(TEXT, '')
                FROM message
                ), 1, 1)

    UNION ALL

    SELECT position + 1
        ,substr((SELECT group_concat(TEXT, '') FROM message)
            ,position
            ,1)
    FROM theChars LIMIT length((SELECT group_concat(TEXT, '') FROM message))
    )
SELECT charValue,count(*)
FROM theChars
WHERE hex(charValue) > '7F'
group by charValue;
Sannyasi answered 27/11, 2019 at 1:42 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.