It might make sense to include all letters of the alphabet. For example, if you're interested in calculating the cosine difference between word distributions you typically require all letters.
You can use this method:
from collections import Counter
def character_distribution_of_string(pass_string):
letters = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
chars_in_string = Counter(pass_string)
res = {}
for letter in letters:
if(letter in chars_in_string):
res[letter] = chars_in_string[letter]
else:
res[letter] = 0
return(res)
Usage:
character_distribution_of_string("This is a string that I want to know about")
Full Character Distribution
{'a': 4,
'b': 1,
'c': 0,
'd': 0,
'e': 0,
'f': 0,
'g': 1,
'h': 2,
'i': 3,
'j': 0,
'k': 1,
'l': 0,
'm': 0,
'n': 3,
'o': 3,
'p': 0,
'q': 0,
'r': 1,
's': 3,
't': 6,
'u': 1,
'v': 0,
'w': 2,
'x': 0,
'y': 0,
'z': 0}
You can extract the character vector easily:
list(character_distribution_of_string("This is a string that I want to know about").values())
giving...
[4, 1, 0, 0, 0, 0, 1, 2, 3, 0, 1, 0, 0, 3, 3, 0, 0, 1, 3, 6, 1, 0, 2, 0, 0, 0]
collections.Counter
. – Kootenay