Machine Learning: normalize target var based on the impact of independent var
Asked Answered
S

2

7

I have a data set which has driver trip information as mentioned below. My objective is to come up with a new mileage or an adjusted mileage which takes into account the load a driver is carrying and the vehicle he/she is driving. Because we found that there is a negative correlation between mileage and load. So the more load you are carrying the less mileage you might get. Also, the type of vehicle might impact your performance as well. In a way we are trying to normalize the mileage so that a driver who is given a heavy load and gets less mileage because of that might not be punished on a mileage. So far I have used Linear regression and correlation to see the relationship between Mileage and the load a driver is carrying. The correlation was -.6. Dependent variable is Miles per Gal and Independent variables are load and Vehicle.

Drv Miles per Gal   Load(lbs)   Vehicle
A        7           1500   2016 Tundra
B        8           1300   2016 Tundra
C        8           1400   2016 Tundra
D        9           1200   2016 Tundra
E       10           1000   2016 Tundra
F        6           1500   2017 F150
G        6           1300   2017 F150
H        7           1400   2017 F150
I        9           1300   2017 F150
J       10           1100   2017 F150

The results might be like this.

Drv Result-New Mileage
A   7.8
B   8.1
C   8.3
D   8.9
E   9.1
F   8.3
G   7.8
H   8
I   8.5
J   9

So far i am little skeptical as to how should I use the slopes from LR to normalize these scores. Any other feedback on approach would be helpful.

Our ultimate goal is to rank the drivers based on Miles per gallon by taking into account the affects of load and vehicle.

Thanks Jay

Salutation answered 22/12, 2017 at 13:55 Comment(4)
What is your end goal? If you just want to take into account the impact of load to miles per gallon, why not use miles per gallon per pound as your metric?Valerievalerio
Hi Pault ! Our end goal is to provide an adjusted miles per gallon which takes in to account the impact of the load a driver is carrying. For example, if we use LR to predict MPG using load, we can use the predicted value vs the actual value. Basically, if a driver is carrying a huge load and he/she get s low mpg because of that, we want to give them credit. our ultimate goal is to rank the drivers based on MPG.Salutation
It's still not clear what your end goal is. How will you evaluate your new adjusted mpg metric? How do you know if you've built a good model? First you need to define how you will measure success. Without that or any further context, it seems to me that using LR is overkill for this case.Valerievalerio
The main objective is to improve mpg. This depends on a lot of factor like driver behavior(speed, braking), routes(miles, traffic, weather), load and equipment. The routes for drivers are static so we created clusters using miles, traffic and weather. Every cluster has a separate model. Driver stats are compared to each other within a cluster and are scored. From the data we found that load is negatively correlated with mpg and also with a few old model vehicles. So if a driver is carrying a huge load and driving a old vehicle, we want to give credit in terms of mpg. Did I answer your ques ?Salutation
A
4

There could be many ways to "normalize scores", and the best one would be highly dependent on what exactly you're trying to achieve (which isn't clear from the question). However, having said that, I'd like to suggest a simple, practical approach.

Starting with the utopian case: say you had lots of data, all of it perfectly linear - i.e., showing a neat linear relation between load and MPG per vehicle type. In that case, you would have a strong prediction regarding the expected MPG per vehicle type, given some load. You could compare the actual MPG to the expected value, and "score" based on the ratio, e.g. actual MPG / expected MPG.

Practically, however, data is never perfect. So you could build a model based on the available data, get a prediction, but instead of using a point-estimate as a basis for scoring, you could use a confidence interval. For instance: the expected MPG given a model and some load is between 9-11 MPG with 95% confidence. In some cases (where more data is available, or it's more linear) the confidence interval may be narrow; in others, it'll be wider.

Then you could take an action (e.g. "punish" as you put it), say, only if MPG is out of the expected range.

EDIT: an illustration (code in R):

#df contains the data above.

#generate a linear model (note that 'Vehicle' is not numerical)
md <- lm(data=df, Miles.per.Gal ~ Load + Vehicle)

#generate predictions based on the model; for this illustration, plotting only for 'Tundra' 
newx <- seq(min(df$Load), max(df$Load), length.out=100)
preds_df <- as.data.frame(predict(md, newdata = data.frame(Load=newx, model="Tundra"))

#plot
# fit + confidence
plt <- ggplot(data=preds_df) + geom_line(aes(x=x, y=fit)) + geom_ribbon(aes(x = x, ymin=lwr, ymax=upr), alpha=0.3) 
# points for illustration 
plt + geom_point(aes(x=1100, y=7.8), color="red", size=4) +geom_point(aes(x=1300, y=4), color="blue", size=4) + geom_point(aes(x=1400, y=9), color="green", size=4)   

enter image description here

So based on this data, the red driver's fuel consumption (7.8 MPG with 1100 load) is significantly worse than expected, the blue one (9 MPG with 1300 load) is within expected range, and the green driver (9 MPG with 1400 load) has better MPG than expected. Of course, depending on the amount of data you have and the goodness of fit, you could use more elaborate models, but the idea can remain the same.

EDIT 2: fixed the mixup between green and red (as higher MPG is better, not worse)

Also, re question in the comments regarding "scoring" drivers, a reasonable scheme may be to either use a ratio vs. predicted point, or - maybe even better - normalize it by standard-deviation (i.e. diff from expected in stdev units). So e.g. in the example above, a driver 10% above the line with load 1250 will have a better score than a driver 10% above the line with load 1500, since the uncertainty there is larger (so 10% is closer to the range of "expected").

Asteroid answered 27/12, 2017 at 22:23 Comment(7)
Thank You Etov ! That's the approach we have taken so far. We have been using LR to predict the MPG using load. I posted this question to make sure this approach is right or if there is a better way to do it. In our case we have another variable(Vehicle) which is categorical. I have provided the data above. How can we normalize the scores and penalize the drivers who are driving a better vehicle ? Should we use Naive Bayes to see the relationship between MPG and vehicles ?Salutation
LR can handle categorical variables - akin to generating a different slope (and possibly intercept) for each vehicle type. Anyway, the question is - "a better way" in what aspect? what are you after? what in the LR approach seems suboptimal with regards to your goal?Asteroid
The main objective is to improve mpg. This depends on a lot of factor like driver behavior(speed, braking), routes(miles, traffic, weather), load and equipment. The routes for drivers are static so we created clusters using miles, traffic and weather. Every cluster has a separate model. Driver stats are compared to each other within a cluster and are scored. From the data we found that load is negatively correlated with mpg and also with a few old model vehicles. So if a driver is carrying a huge load and driving a old vehicle, we want to give credit in terms of mpg.Salutation
Thanks, this does clarify your goal; to me, the approach you've taken makes sense in this context.Asteroid
Hi Etov ! I think in your example the MPG for red driver should be better than expected even though it is out of range.Because it's carrying a lesser weight than blue & green and giving a better MPG ? But based on the regression line, the prediction values will be lower for Red than blue and green. So I think a ratio or difference should be a better metric. Because, we need to give credit to red for giving a better MPG right ? So the difference between actual and prediction values can be considered their score and then we can normalize that to see the rankings.Salutation
If we use the prediction values only, than the higher the weight, the values will be lower and we are punishing the drivers rather than giving them credit. If we punish the drivers whose MPG is out of limits only then we will not be able to rank them properly. What do you think ?Salutation
added an edit to the answer; I think that even you use just the point prediction, the uncertainty should be taken into account, e.g. by using it for normalizationAsteroid
F
1

The term you are looking for is Decorrelation. You are trying to decorrelate MPG and Load. One approach to do this is to train a linear model like you have done, and subtract the predictions of this model from the original MPG values, thus removing the impact of Load (according to the linear model). The Wikipedia articel lists this as "Linear predictive coders". If you want to get fancy, you can try the same idea with more complex models if you think MPG and Load don't actually have a linear relation.

Flyblown answered 3/1, 2018 at 9:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.