I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
The first String
key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.
I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
I know it should be as easy to directly load the data
variable from data_map
, but I don't know how.
Any help is appreciated!