How to sum the values of one column of a dataframe in spark/scala
Asked Answered
A

5

48

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps") 

the function sum cannot be resolved.

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

A simple example would be very helpful! I started writing Scala recently.

Allelomorph answered 4/5, 2016 at 15:25 Comment(0)
B
26

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19
Brambling answered 4/5, 2016 at 16:40 Comment(7)
Nice option! Is it still more efficient if he wants the sum of many columns? In a dataframe i know it would be like df.agg(sum("col1"), sum("col2"), ...)Kirtle
@DanieldePaula I know but he said one columnBrambling
Oh, I read "I want to sum the values of each column (...)" and I thought he meant many columns. Anyway, my question was more out of curiosity, to help improving our answers.Kirtle
@DanieldePaula Indeed your answer is the correct one, mine is just an alternative (for only one column), so I will vote for yours.Brambling
I like your alternative! If I may repeat, do you know if it's still efficient for multiple columns at a time?Kirtle
@DanieldePaula My answer works only for one column, if you notice I select column steps, then I get the first element of every Row, which is the unique element, and then I sum all those values using reduce. Therefore, your answer would be the correct one in case of multiple columns and its more intuitive than mine.Brambling
I set as the correct answer the second one, because I wanted the sum of the values of one column. However later I need the mean and other statistical methods, so I think I will use something similar in syntax based on answer 1.Allelomorph
K
115

You must first import the functions:

import org.apache.spark.sql.functions._

Then you can use them like this:

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

For dynamically applying the aggregations, the following options are available:

  • Applying to all numeric columns at once:
df.groupBy().sum()
  • Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
  • Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
Kirtle answered 4/5, 2016 at 16:8 Comment(0)
B
26

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19
Brambling answered 4/5, 2016 at 16:40 Comment(7)
Nice option! Is it still more efficient if he wants the sum of many columns? In a dataframe i know it would be like df.agg(sum("col1"), sum("col2"), ...)Kirtle
@DanieldePaula I know but he said one columnBrambling
Oh, I read "I want to sum the values of each column (...)" and I thought he meant many columns. Anyway, my question was more out of curiosity, to help improving our answers.Kirtle
@DanieldePaula Indeed your answer is the correct one, mine is just an alternative (for only one column), so I will vote for yours.Brambling
I like your alternative! If I may repeat, do you know if it's still efficient for multiple columns at a time?Kirtle
@DanieldePaula My answer works only for one column, if you notice I select column steps, then I get the first element of every Row, which is the unique element, and then I sum all those values using reduce. Therefore, your answer would be the correct one in case of multiple columns and its more intuitive than mine.Brambling
I set as the correct answer the second one, because I wanted the sum of the values of one column. However later I need the mean and other statistical methods, so I think I will use something similar in syntax based on answer 1.Allelomorph
B
9

Simply apply aggregation function, Sum on your column

df.groupby('steps').sum().show()

Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Batha answered 21/8, 2017 at 9:58 Comment(0)
N
4

Not sure this was around when this question was asked but:

df.describe().show("columnName")

gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()

Nahtanha answered 26/10, 2018 at 17:19 Comment(0)
F
0

Using spark sql query..just incase if it helps anyone!

import org.apache.spark.sql.SparkSession 
import org.apache.spark.SparkConf 
import org.apache.spark.sql.functions._ 
import org.apache.spark.SparkContext 
import java.util.stream.Collectors

val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()

df.createOrReplaceTempView("steps")
val sum = spark.sql("select  sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28
Fanciful answered 1/5, 2019 at 10:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.