How to treat numbers inside text strings when vectorizing words?
Asked Answered
T

3

10

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?

I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?

Does converting numbers to strings weakens the information i feed the network?

Toothed answered 1/7, 2017 at 22:16 Comment(1)
In many applications words that don't exist in the dictionary, are converted as to <unknown>. In the same way, depending on your application, it could be convenient to convert all the numbers to a special token, like <number>.Selfpollination
X
6

Expanding your discussion with @user1735003 - Lets consider both ways of representing numbers:

  1. Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
  2. Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.

Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.

For example, 1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.

By treating the numbers as another word, you are not changing the context but by doing any other transformation on those numbers, you can't guarantee its for better. So, its better to leave it untouched and treat it as another word.

Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).

Xyloid answered 14/7, 2017 at 21:22 Comment(0)
W
2

The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.

EDIT

Example literally using your string and their code:

corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
  for word in tweet.split(" "):
    if word not in dictionary: dictionary[word] = i
    i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
Wizen answered 2/7, 2017 at 7:56 Comment(5)
But imagine I have a word "car" that gets the index 3 in my dictionary. If I also have the number 3 in the text (in a phrase like The car number 3) it makes no sense to feed the network a vector with false repeated numbers like [12, 3, 11, 3] which could be interpreted like The car number car.Toothed
You don't have the number 3, you have the string "3", which may be indexed by any number.Wizen
But that removes from the input the Type, being a number is information itself.Toothed
Not really. "3" is a string. What it means depend on the context. It could be a number -- maybe. So is "three" -- a good old alphabetical string, mind you. It could be some sort of ID, in which case it should probably not be considered as a number. The idea of ML is letting the computer grasp the meaning itself from the context, without resorting to hand-made rules.Wizen
More pragmatically, if you reserve id n to represent string "n", for all numbers n, you don't get to have any id left for other strings.Wizen
M
1

The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf

Specifically, page 7.

Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:

FourDigits (good for years)

I've tried to implement it and it gave great results.

Magill answered 15/7, 2017 at 9:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.