Random state (Pseudo-random number) in Scikit learn
Asked Answered
B

8

208

I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it?

I also could not understand what is a Pseudo-random number.

Bathyal answered 21/1, 2015 at 10:17 Comment(0)
S
300

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:

Run 1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

Run 2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.

Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won't matter in your case, you can take a look here form more details.

Ship answered 21/1, 2015 at 14:10 Comment(14)
so what random state should I set, I commonly see this number 42.Bathyal
@ElizabethSusanJoseph, it doesn't matter much, I always use 0 if I want reproducibility or None otherwise. May be scikit guys like 42.Ship
This probably explains the number 42 being used so often: en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_GalaxyJonijonie
Good one, here are more possibilities.Ship
Is this consistency cross-machine/architecture/os/... ?Condescend
Is it possible to keep the original order without randomizing it?Chide
@Condescend That's a tough question. The core PRNG-stuff is based on numpy which is consistent (they introduced many checks for this after some problem in the past). If there are no errors in usage within sklearn it will behave consistent too. I would assume this (especially for the less-complex functions like train-test-split and co) Edit: oops, a bit late :-)Fridlund
@Chide That's a bad idea. Of course you can do this, but be aware of the consequences. If doing this for cross-validation it can mean, that you build partial-splits which all share the same time (if your original data is ordered by time).Fridlund
@Fridlund How to do this in train_test_split? From its doc, it seems feeding random state or not, the result will not be the same as the original order (let's say original data is ordered by time, so it may be sensible to keep the original order when it needs to).Chide
@Chide Don't use this function then. It's not build for this. Others, like KFold can do this if shuffle = False.Fridlund
@Fridlund KFold seems to always split evenly, not able to specify different percentage. Its shuffle option is good though, in only train_test_split has that.Chide
@Fridlund It is nice to know when sharing scripts with others, who run it on their machines. But thanks! I guess the answer would be: maybe it is not cross-machine/os/architecture/numpy-version consistent and hence you can't assume it is :)Condescend
@Condescend I would see it more like: numpy is designed exactly for this and i trust the sklearn-devs not to throw this away. But your view is legitimate, although your use-case is one of the less mission-critical. Of course there may be complications in complex scripts where e.g. liblinear (or some other external lib) is used and there are decisions based on the output of these (the only setting which scares me). Not sure if liblinear is doing the same fp-math on all architectures (but i got more hope for random-sampling which is for example used internally when doing probability-estimation).Fridlund
I wonder. For example, if I use regression tree: from sklearn.tree import DecisionTreeRegressor as DTR; model = DTR(), and I already have my train and test data, then why should model.fit(X_train, Y_train) splits the data into train and test? all it must do should only to perform best splits desired number of times.Jacqulynjactation
L
26

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Luxembourg answered 4/6, 2018 at 0:39 Comment(0)
B
15

Well the question what is "random state" and why is it used, has been answered above nicely by people above. I will try and answer the question "Why do we choose random state as 42 very often during training a machine learning model? why we dont choose 12 or 32 or 5? " Is there a scientific explanation?

Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought.

To be specific, 42 has nothing to do with AI or ML. It is actually a generic number, In Machine Learning, it doesn't matter what the actual random number is, as mentioned in scikit API doc, any INTEGER is sufficient enough for the task at hand.

42 is a reference from Hitchhikers guide to galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.

References:

  1. Wikipedia: on Hitchhikers guide to galaxy
  2. Stack Exchange: Why the Number 42 is preferred when indicating something random
  3. Why the Number 42
  4. Quora: Why the Number 42 is preferred when indicating something random
  5. YouTube: Nice Simple video explaining use of random state in train-test-split

The significance of number 42!

Brunabrunch answered 11/7, 2021 at 11:21 Comment(0)
F
5

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets. Refer below code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Doesn't matter how many times you run the code, the output will be 70.

70

Try to remove the random_state and run the code.

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Now here output will be different each time you execute the code.

Francesfrancesca answered 13/12, 2018 at 4:9 Comment(0)
H
3

random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data. So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment. To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`

Homotaxis answered 29/1, 2019 at 14:1 Comment(0)
M
3

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

If you see the Tree Classifiers - either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.

PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.

randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

From Doc:

If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Meath answered 12/9, 2019 at 16:7 Comment(1)
Good explanation. I only would add, that a reason why we would pass the random state is, that if we for example try to optimize hyperparameters, we don't want to have flucutations in the score due to different initializations based on random numbers, which could cover or hide the effect of the actual optimization and thus we could not identify what part of the score change was due to the parameter change and what was due to the different start state of the RNG.Fierce
S
0

Consider a scenario where we have a dataset of 10 numbers ranging from 1 to 10, and we want to split it into a training dataset and a testing dataset, where the size of the testing dataset is 20% of the entire dataset.

The training dataset will have 8 data samples, and the testing dataset will have 2 data samples. We ensure that a random process will output the same result every time to make the code reproducible. If we don't shuffle the dataset, it will produce different datasets every time, and it's not good to train the model with different data each time.

For all random datasets, each is assigned a random_state value. This means that one random_state value has a fixed dataset, so every time we run the code with random_state value 1, it will produce the same splitting datasets.

The image below shows everything that random_state does:

i

See also: What is random_state?

Splenetic answered 2/3, 2023 at 20:26 Comment(0)
H
-1
sklearn.model_selection.train_test_split(*arrays, **options)[source]

Split arrays or matrices into random train and test subsets

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

'''Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm's behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.''' source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune

Housebound answered 30/9, 2018 at 13:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.