If I understood correctly you do not want to convert 1 categorical column in several dummy columns. You want spark to understand that the column is categorical and not numerical.
I think it depends on the algorithm you want to use right now. For example random Forest and GBT have both categoricalFeaturesInfo as a parameter check it here:
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$
so for example:
categoricalFeaturesInfo = Map[Int, Int]((1,2),(2,5))
is actually saying that second column of your features (index starts in 0, so 1 is second column) is a categorical one with 2 levels, and 3rd is also a categorical feature with 5 levels. You can specify these parameters when you train your randomForest or GBT.
You need to make sure your levels are mapped to 0,1,2... so if you have something like ("good","medium","bad") map it into (0,1,2).
Now in your case you want to use LogisticRegressionWithLBFGS. In this case my suggestion is to actually transform categorical columns into dummy columns. For example a single column with 3 levels ("good","medium","bad") into 3 columns with 1/0 depending on which one hits. I do not have an example to work with so here is a sample code in scala that should work:
val dummygen = (data : DataFrame, col:Array[String]) => {
var temp = data
for(i <- 0 until col.length) {
val N = data.select(col(i)).distinct.count.toInt
for (j<- 0 until N)
temp = temp.withColumn(col(i) + "_" + j.toString, callUDF(index(j), DoubleType, data(col(i))))
}
temp
}
val index = (value:Double) => {(a:Double) => {
if (value==a) {
1
} else{
0
}
}}
That you can call it like:
val results = dummygen(data, Array("CategoricalColumn1","CategoricalColumn2"))
Here I do it for a list of Categorical Columns (just in case you have more than 1 in your features list). First "for loop" goes through each categorical column, second "for loop" goes through each level in the column and creates a number of columns equals to the number of levels for each column.
Important!!! that it assumes that you first mapped them to 0,1,2...
You can then run your LogisticRegressionWithLBFGS using this new features set. This approach also helps with SVM.