import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
import findspark
from pyspark.sql.functions import countDistinct
spark = SparkSession.builder \
.master("local[*]") \
.appName("usres mobile related information analysis") \
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory","3g") \
.config("spark.driver.maxResultSize", "1g") \
.config("spark.executor.pyspark.memory","3g") \
.enableHiveSupport() \
.getOrCreate()
handset_info =
ora_tmp.select('some_value','some_value','some_value','some_value','some_value','some_value','some_value')
I configure the spark with 3gb execution memory and 3gb execution pyspark memory.My Database has more than 70 Million row. Show i call the
handset_info.show()
method it is showing the top 20 row in between 2-5 second. But when i try to run the following code
mobile_info_df = handset_info.limit(30)
mobile_info_df.show()
to show the top 30 rows the it takes too much time(3-4 hour). Is it logical to take that much time. Is there any problem in my configuration. Configuration of my laptop is-
- Core i7(4 core) laptop with 8gb ram
limit()
works that way? It strikes me as rather wasteful... – Miscalculate