Building a Transition Matrix using words in Python/Numpy
Asked Answered
M

6

6

Im trying to build a 3x3 transition matrix with this data

days=['rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']

Currently, Im doing it with some temp dictionaries and some list that calculates the probability of each weather separately. Its not a pretty solution. Can someone please guide me with a more reasonable solution to this problem?

self.transitionMatrix=np.zeros((3,3))

#the columns are today
sun_total_count = 0
temp_dict={'sun':0, 'clouds':0, 'rain':0}
total_runs = 0
for (x, y), c in Counter(zip(data, data[1:])).items():
    #if column 0 is sun
    if x is 'sun':
        #find the sum of all the numbers in this column
        sun_total_count +=  c
        total_runs += 1
        if y is 'sun':
            temp_dict['sun'] = c
        if y is 'clouds':
            temp_dict['clouds'] = c
        if y is 'rain':
            temp_dict['rain'] = c

        if total_runs is 3:
            self.transitionMatrix[0][0] = temp_dict['sun']/sun_total_count
            self.transitionMatrix[1][0] = temp_dict['clouds']/sun_total_count
            self.transitionMatrix[2][0] = temp_dict['rain']/sun_total_count

return self.transitionMatrix

for every type of weather I need to calculate the probability for the next day

Mayday answered 15/11, 2017 at 0:36 Comment(3)
Does your solution work?Derek
@Derek Yeah it works. But as you can see it only calculates the first column, now I'll have to make two new dicts for second and third columns. Then go through a whole bunch of if statements for them. Its going to get a lot messier :( I was wondering if there is a more elegant methodMayday
Put the dict construction code in a function then iterate over the columns passing the relavent data to that function.Derek
H
6

I like a combination of pandas and itertools for this. The code block is a bit longer than the above, but don't conflate verbosity with speed. (The window func should be very fast; the pandas portion will be slower admittedly.)

First, make a "window" function. Here's one from the itertools cookbook. This gets you to a list of tuples of transitions (state1 to state2).

from itertools import islice

def window(seq, n=2):
    """Sliding window width n from seq.  From old itertools recipes."""
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

# list(window(days))
# [('rain', 'rain'),
#  ('rain', 'rain'),
#  ('rain', 'clouds'),
#  ('clouds', 'rain'),
#  ('rain', 'sun'),
# ...

Then use a pandas groupby + value counts operation to get a transition matrix from each state1 to each state2:

import pandas as pd

pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()

Your result looks like this:

print(probs)
state2  clouds  rain   sun
state1                    
clouds    0.13  0.09  0.10
rain      0.06  0.11  0.09
sun       0.13  0.06  0.23
Hassan answered 15/11, 2017 at 1:58 Comment(1)
How fast is this compared to https://mcmap.net/q/545666/-generating-markov-transition-matrix-in-python?Hoxsie
C
20

If you don't mind using pandas, there's a one-liner for extracting the transition probabilities:

pd.crosstab(pd.Series(days[1:],name='Tomorrow'),
            pd.Series(days[:-1],name='Today'),normalize=1)

Output:

Today      clouds      rain       sun
Tomorrow                             
clouds    0.40625  0.230769  0.309524
rain      0.28125  0.423077  0.142857
sun       0.31250  0.346154  0.547619

Here the (forward) probability that tomorrow will be sunny given that today it rained is found at the column 'rain', row 'sun'. If you would like to have backward probabilities (what might have been the weather yesterday given the weather today), switch the first two parameters.

If you would like to have the probabilities stored in rows rather than columns, then set normalize=0 but note that if you would do that directly on this example, you obtain backwards probabilities stored as rows. If you would like to obtain the same result as above but transposed you could a) yes, transpose or b) switch the order of the first two parameters and set normalize to 0.

If you just want to keep the results as numpy 2-d array (and not as a pandas dataframe), type .values after the last parenthesis.

Cutshall answered 10/6, 2018 at 13:33 Comment(2)
What is the purpose for when accesing the data that you use days[1:] and days[:-1] instead of just calling the days ?Reflectance
@Will-i-am, if you try as you say, you will get a three by three identity matrix. The purpose of accessing the data that was is creating two series where one series contains at position t the entry for day t while the other contains the entry for day t+1.Cutshall
H
6

I like a combination of pandas and itertools for this. The code block is a bit longer than the above, but don't conflate verbosity with speed. (The window func should be very fast; the pandas portion will be slower admittedly.)

First, make a "window" function. Here's one from the itertools cookbook. This gets you to a list of tuples of transitions (state1 to state2).

from itertools import islice

def window(seq, n=2):
    """Sliding window width n from seq.  From old itertools recipes."""
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

# list(window(days))
# [('rain', 'rain'),
#  ('rain', 'rain'),
#  ('rain', 'clouds'),
#  ('clouds', 'rain'),
#  ('rain', 'sun'),
# ...

Then use a pandas groupby + value counts operation to get a transition matrix from each state1 to each state2:

import pandas as pd

pairs = pd.DataFrame(window(days), columns=['state1', 'state2'])
counts = pairs.groupby('state1')['state2'].value_counts()
probs = (counts / counts.sum()).unstack()

Your result looks like this:

print(probs)
state2  clouds  rain   sun
state1                    
clouds    0.13  0.09  0.10
rain      0.06  0.11  0.09
sun       0.13  0.06  0.23
Hassan answered 15/11, 2017 at 1:58 Comment(1)
How fast is this compared to https://mcmap.net/q/545666/-generating-markov-transition-matrix-in-python?Hoxsie
B
3

Here is a "pure" numpy solution it creates 3x3 tables where the zeroth dim (row number) corresponds to today and the last dim (column number) corresponds to tomorrow.

The conversion from words to indices is done by truncating after the first letter and then using a lookup table.

For counting numpy.add.at is used.

This was written with efficiency in mind. It does a million words in less than a second.

import numpy as np

report = [
  'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']

# create np array, keep only first letter (by forcing dtype)
# obviously, this only works because rain, sun, clouds start with different
# letters
# cast to int type so we can use for indexing
ri = np.array(report, dtype='|S1').view(np.uint8)
# create lookup
c, r, s = 99, 114, 115 # you can verify this using chr and ord
lookup = np.empty((s+1,), dtype=int)
lookup[[c, r, s]] = np.arange(3)
# translate c, r, s to 0, 1, 2
rc = lookup[ri]
# get counts (of pairs (today, tomorrow))
cnts = np.zeros((3, 3), dtype=int)
np.add.at(cnts, (rc[:-1], rc[1:]), 1)
# or as probs
probs = cnts / cnts.sum()
# or as condional probs (if today is sun how probable is rain tomorrow etc.)
cond = cnts / cnts.sum(axis=-1, keepdims=True)

print(cnts)
print(probs)
print(cond)

# [13  9 10]
#  [ 6 11  9]
#  [13  6 23]]
# [[ 0.13  0.09  0.1 ]
#  [ 0.06  0.11  0.09]
#  [ 0.13  0.06  0.23]]
# [[ 0.40625     0.28125     0.3125    ]
#  [ 0.23076923  0.42307692  0.34615385]
#  [ 0.30952381  0.14285714  0.54761905]]
Backfill answered 15/11, 2017 at 2:37 Comment(4)
This is fast. Can you elaborate on why you mapped from the original strings to the chr of their first letter?Hassan
Also: you can map to ints with np.unique(report, return_inverse=True).Hassan
@BradSolomon 1. Truncating makes the array constructor much faster. I think if you don't force the dtype then numpy has to do two passes, one just to find the longest string, so it knows how large to make the dtype. 2. the one letter ids can be used as indices into a lookup table; since we can very cheaply (view casting in numpy is essentially free) interpret these characters as numbers which are not very large the loookup can be done by pointer arithmetic. As an added benefit this kind of lookup is precisely what numpy fancy indexing does, so it's a one-liner that loops at C speed. HTHBackfill
@BradSolomon Since np.unique is more general than what we are doing here I would expectt it to be slower. For example I don't think it can use the lookup trick I just described.Backfill
P
1
  1. Convert the reports from the days into index codes.
  2. Iterate through the array, grabbing the codes for yesterday's weather and today's.
  3. Use those indices to tally the combination in your 3x3 matrix.

Here's the coding set-up to get you started.

report = [
  'rain', 'rain', 'rain', 'clouds', 'rain', 'sun', 'clouds', 'clouds', 
  'rain', 'sun', 'rain', 'rain', 'clouds', 'clouds', 'sun', 'sun', 
  'clouds', 'clouds', 'rain', 'clouds', 'sun', 'rain', 'rain', 'sun',
  'sun', 'clouds', 'clouds', 'rain', 'rain', 'sun', 'sun', 'rain', 
  'rain', 'sun', 'clouds', 'clouds', 'sun', 'sun', 'clouds', 'rain', 
  'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 'sun', 
  'clouds', 'clouds', 'sun', 'clouds', 'rain', 'sun', 'sun', 'sun', 
  'clouds', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'clouds', 
  'rain', 'clouds', 'clouds', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 
  'clouds', 'clouds', 'clouds', 'clouds', 'clouds', 'sun', 'rain', 
  'rain', 'rain', 'clouds', 'sun', 'clouds', 'clouds', 'clouds', 'rain', 
  'clouds', 'rain', 'sun', 'sun', 'clouds', 'sun', 'sun', 'sun', 'sun',
  'sun', 'sun', 'rain']

weather_dict = {"sun":0, "clouds":1, "rain": 2}
weather_code = [weather_dict[day] for day in report]
print weather_code

for n in range(1, len(weather_code)):
    yesterday_code = weather_code[n-1]
    today_code     = weather_code[n]

# You now have the indicies you need for your 3x3 matrix.
Phonsa answered 15/11, 2017 at 0:50 Comment(0)
R
0

It seems you want to create a matrix of the probability of rain coming after sun or clouds coming after sun (or etc). You can spit out the probability matrix (not a math term) like so:

def probabilityMatrix():
    tomorrowsProbability=np.zeros((3,3))
    occurancesOfEach = Counter(data)
    myMatrix = Counter(zip(data, data[1:]))
    probabilityMatrix = {key : myMatrix[key] / occurancesOfEach[key[0]] for key in myMatrix}
    return probabilityMatrix

print(probabilityMatrix())

However, you probably want to spit out the probability for every type of weather based on today's weather:

def getTomorrowsProbability(weather):
    probMatrix = probabilityMatrix()
    return {key[1] : probMatrix[key]  for key in probMatrix if key[0] == weather}

print(getTomorrowsProbability('sun'))
Randolphrandom answered 15/11, 2017 at 0:56 Comment(0)
B
0

Below another alternative using pandas. Transitions list can be replaced with 'rain','clouds' etc.

import pandas as pd
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D'] * 2
df = pd.DataFrame(columns = ['state', 'next_state'])
for i, val in enumerate(transitions[:-1]): # We don't care about last state
    df_stg = pd.DataFrame(index=[0])
    df_stg['state'], df_stg['next_state'] = transitions[i], transitions[i+1]
    df = pd.concat([df, df_stg], axis = 0)
cross_tab = pd.crosstab(df['state'], df['next_state'])
cross_tab.div(cross_tab.sum(axis=1), axis=0)
Beauchamp answered 20/4, 2018 at 11:23 Comment(1)
This is horribly inefficient due to the pd.concat() line, especially if you are trying to read a corpus line by line. @HerrIvan's answer was much much faster.Dairymaid

© 2022 - 2024 — McMap. All rights reserved.