bigdata - McMap

5

Solved

Is there a faster way to find the first value that is not NA in a large vector using base R? [closed]

Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R? The code below works, but when the vector size ...

r performance bigdata na

Disreputable asked 25/10, 2024 at 13:16

1

How to understand the min/med/max in DAG

I would like to fully understand the meaning of the information about min/med/max. for example: scan time total(min, med, max) 34m(3.1s, 10.8s, 15.1s) means of all cores, the min scan time is ...

performance apache-spark apache-spark-sql bigdata spark-ui

Gasometer asked 23/11, 2019 at 19:52

2

Solved

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sp...

apache-spark apache-spark-sql distributed-computing partitioning bigdata

Goatsbeard asked 8/9, 2016 at 0:57

3

Big data: generalized linear mixed-effects models

I'm looking for suggestions for a strategy of fitting generalized linear mixed-effects models for a relative large data-set. Consider I have data on 8 milllion US basketball passes on about 300 tea...

r performance bigdata lme4 mixed-models

Chalaza asked 23/10, 2017 at 15:11

2

Solved

How do I split / chunk Large JSON Files with AWS glueContext before converting them to JSON?

I'm trying to convert a 20GB JSON gzip file to parquet using AWS Glue. I've setup a job using Pyspark with the code below. I got this log WARN message: LOG.WARN: Loading one large unsplittable file...

json amazon-web-services apache-spark pyspark bigdata

Viscounty asked 21/1, 2022 at 17:24

1

Spark DataFrame cache keeps growing

How does spark decide how many times to replicate a cached partition? The storage level in the storage tab on the spark UI says “Disk Serialized 1x Replicated”, but it looks like partitions get re...

scala apache-spark hadoop bigdata

Lugansk asked 9/4, 2019 at 21:44

3

How to I pd.merge without creating a copy of the data?

I am trying to join two dataframes together as follows: df3 = pd.merge(df1,df2, how='inner', on='key') where df1 and df2 are large datasets with millions of rows. Basically how do I join them with...

python pandas bigdata

Horsetail asked 11/12, 2018 at 9:37

5

Solved

Is Data Lake and Big Data the same?

I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes nece...

bigdata data-lake

Nebulous asked 18/9, 2018 at 15:30

0

performant writes to apache Iceberg

I'm trying to achieve performant record writes from pandas (or ideally Polars if possible) in a Python environment to our Apache Iceberg deployment (with hive metastore) directly, or via Trino quer...

python bigdata trino apache-iceberg

Plutonian asked 6/7, 2023 at 14:34

3

Solved

Partition a large file into small files in R

I need to break a large file (14 gigabytes) into smaller files. The format of this file is txt, the tab is ";" and I know it has 70 columns (string, double). I would like to read 1million and save ...

r loops bigdata chunks

Alexei asked 13/5, 2018 at 5:48

5

Solved

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet file...

apache-spark apache-spark-sql rdd apache-spark-2.0 bigdata

Rewarding asked 28/6, 2017 at 16:49

3

Error Message: TOK_ALLCOLREF is not supported in current context - while Using DISTINCT in HIVE

I'm using the simple command: SELECT DISTINCT * FROM first_working_table; in HIVE 0.11, and I'm receiving the following error message: FAILED: SemanticException TOK_ALLCOLREF is not supported in...

sql hadoop hive distinct bigdata

Salami asked 13/1, 2014 at 10:22

2

BigQuery replaced most of my Spark jobs, am I missing something? [closed]

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such. The ...

sql apache-spark apache-spark-sql google-bigquery bigdata

Crackup asked 7/5, 2019 at 12:41

4

How to expand one column in Pandas to many columns?

As the title, I have one column (series) in pandas, and each row of it is a list like [0,1,2,3,4,5]. Each list has 6 numbers. I want to change this column into 6 columns, for example, the [0,1,2,3,...

python pandas scikit-learn bigdata

Abbreviate asked 21/3, 2017 at 7:3

1

(R error) Error: cons memory exhausted (limit reached?)

I am working with big data and I have a 70GB JSON file. I am using jsonlite library to load in the file into memory. I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but ...

r bigdata

Decipher asked 19/10, 2016 at 22:51

11

Solved

How to create a large pandas dataframe from an sql query without running out of memory?

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This wo...

python sql pandas out-of-memory bigdata

Dough asked 7/8, 2013 at 15:50

3

Using jq on a large json file (50GB)

I want to use jq on a 50GB file. Needless to say the machines memory can't handle it. It's running out of memory. I tried several options including --stream but it didn't help. Can someone tell me ...

json linux bigdata jq

Sunglasses asked 29/6, 2021 at 14:10

5

Solved

Calculating and saving space in PostgreSQL

I have a table in pg like so: CREATE TABLE t ( a BIGSERIAL NOT NULL, -- 8 b b SMALLINT, -- 2 b c SMALLINT, -- 2 b d REAL, -- 4 b e REAL, -- 4 b f REAL, -- 4 b g INTEGER, -- 4 b h REAL, -- ...

postgresql database-design storage bigdata

Cheston asked 3/6, 2010 at 13:44

7

Solved

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-...

hadoop apache-spark-sql sqoop bigdata

Marcellusmarcelo asked 10/5, 2016 at 8:41

1

Looking for a solution to speed up `pyspark.sql.GroupedData.applyInPandas` processing on a large dataset

I'm working with a dataset stored in S3 bucket (parquet files) consisting of a total of ~165 million records (with ~30 columns). Now, the requirement is to first groupby a certain ID column then ge...

python pandas amazon-web-services pyspark bigdata

Responsive asked 15/12, 2021 at 17:30

5

Solved

HIVE> FAILED: SemanticException Line 1:23 Invalid path

I tired to load the data into my table 'users' in LOCAL mode and i am using cloudera on my virtual box. I have a file placed my file inside /home/cloudera/Desktop/Hive/ directory but i am getting a...

hive bigdata

Slashing asked 22/10, 2016 at 3:16

3

Solved

Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?

The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kim...

apache-spark bigdata databricks azure-databricks databricks-sql

Fretwell asked 15/11, 2021 at 22:40

3

Solved

Query Failed Error: Resources exceeded during query execution: The query could not be executed in the allotted memory

I am using Standard SQL.Even though its a basic query it is still throwing errors. Any suggestions pls SELECT fullVisitorId, CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS ses...

google-bigquery bigdata

Sacramentarian asked 1/9, 2017 at 17:37

4

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index. The index is the customer ID. I want to group the qty_liter by the index: df = df.groupby(df.i...

python pandas grouping bigdata

Lir asked 22/6, 2017 at 16:5

8

Solved

Best way to delete millions of rows by ID

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing ...

sql postgresql bigdata sql-delete postgresql-performance

Canarese asked 28/11, 2011 at 2:29

bigdata Questions

Recommended topics

Hot tags