Using python to generate clusters of data?

Asked 4/11, 2017 at 19:57 Answered 18/1, 2019 at 6:3

I'm working on a Python function, where I want to model a Gaussian distribution, I'm stuck though.

import numpy.random as rnd
import numpy as np

def genData(co1, co2, M):
  X = rnd.randn(2, 2M + 1)
  t = rnd.randn(1, 2M + 1)
  numpy.concatenate(X, co1)
  numpy.concatenate(X, co2)
  return(X, t)

I'm trying for two clusters of size M, cluster 1 is centered at co1, cluster 2 is centered at co2. X would return the data points I'm going to graph, and t are the target values (1 if cluster 1, 2 if cluster 2) so I can color it by cluster.

In that case, t is size 2M of 1s/2s and X is size 2M * 1, wherein t[i] is 1 if X[i] is in cluster 1 and the same for cluster 2.

I figured the best way to start doing this is generating the array array using numpys random. What I'm confused about is how to get it centered according to the cluster?

Would the best way be to generate a cluster sized M, then add co1 to each of the points? How would I make it random though, and make sure t[i] is colored in properly?

I'm using this function to graph the data:

def graphData():
    co1 = (0.5, -0.5)
    co2 = (-0.5, 0.5)
    M = 1000
    X, t = genData(co1, co2, M)
    colors = np.array(['r', 'b'])
    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], color = colors[t], s = 10)

Doomsday answered 4/11, 2017 at 19:57 Comment(2)

Use numpy.random.multivariate_normal. Give the mean argument as a vector of length 2; that will be the location of the cluster. – Hargett 4/11, 2017 at 20:32

@WarrenWeckesser Thanks Warren, but how will I make it so X is random and t will tell me which cluster it belongs to? – Doomsday 4/11, 2017 at 20:58

For your purpose, I would go for sklearn sample generator make_blobs:

from sklearn.datasets import make_blobs

centers = [(-5, -5), (5, 5)]
cluster_std = [0.8, 1]

X, y = make_blobs(n_samples=100, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], color="red", s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="blue", s=10, label="Cluster2")

You can generate multi-dimensional clusters with this. X yields data points and y is determining which cluster a corresponding point in X belongs to.

This might be too much for what you try to achieve in this case, but generally, I think it's better to rely on more general and better-tested library codes that can be used in other cases as well.

Samuelson answered 18/1, 2019 at 6:3 Comment(1)

Works great. The only minor thing is samples_generator is now deprecated. Should use from sklearn.datasets import make_blobs instead. – Jarrell 28/8, 2020 at 12:27

You can use something like following code:

center1 = (50, 60)
center2 = (80, 20)
distance = 20


x1 = np.random.uniform(center1[0], center1[0] + distance, size=(100,))
y1 = np.random.normal(center1[1], distance, size=(100,)) 

x2 = np.random.uniform(center2[0], center2[0] + distance, size=(100,))
y2 = np.random.normal(center2[1], distance, size=(100,)) 

plt.scatter(x1, y1)
plt.scatter(x2, y2)
plt.show()

Salpingotomy answered 18/1, 2019 at 4:57 Comment(0)

Recommended topics

Hot tags