I am using Spark ML to run some ML experiments, and on a small dataset of 20MB (Poker dataset) and a Random Forest with parameter grid, it takes 1h and 30 minutes to finish. Similarly with scikit-learn it takes much much less.
In terms of environment, I was testing with 2 slaves, 15GB memory each, 24 cores. I assume it was not supposed to take that long and I am wondering if the problem lies within my code, since I am fairly new to Spark.
Here it is:
df = pd.read_csv(http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-testing.data)
dataframe = sqlContext.createDataFrame(df)
train, test = dataframe.randomSplit([0.7, 0.3])
columnTypes = dataframe.dtypes
for ct in columnTypes:
if ct[1] == 'string' and ct[0] != 'label':
categoricalCols += [ct[0]]
elif ct[0] != 'label':
numericCols += [ct[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel', handleInvalid='skip')
stages += [labelIndexer]
estimator = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [estimator]
parameters = {"maxDepth" : [3, 5, 10, 15], "maxBins" : [6, 12, 24, 32], "numTrees" : [3, 5, 10]}
paramGrid = ParamGridBuilder()
for key, value in parameters.iteritems():
paramGrid.addGrid(estimator.getParam(key), value)
estimatorParamMaps = (paramGrid.build())
pipeline = Pipeline(stages=stages)
crossValidator = CrossValidator(estimator=pipeline, estimatorParamMaps=estimatorParamMaps, evaluator=MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1'), numFolds=3)
pipelineModel = crossValidator.fit(train)
predictions = pipelineModel.transform(test)
evaluator = pipeline.getEvaluator().evaluate(predictions)
Thanks in advance, any comments/suggestions are highly appreciated :)