bigdata Questions
5
Solved
Just like the question says. Is there a faster way to do what is done below when the vector size is very large (> 10M entries) using base R?
The code below works, but when the vector size ...
Disreputable asked 25/10 at 13:16
1
I would like to fully understand the meaning of the information about min/med/max.
for example:
scan time total(min, med, max)
34m(3.1s, 10.8s, 15.1s)
means of all cores, the min scan time is ...
Gasometer asked 23/11, 2019 at 19:52
2
Solved
There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:
The Spark Driver node (sp...
Goatsbeard asked 8/9, 2016 at 0:57
3
I'm looking for suggestions for a strategy of fitting generalized linear mixed-effects models for a relative large data-set.
Consider I have data on 8 milllion US basketball passes on about 300 tea...
Chalaza asked 23/10, 2017 at 15:11
2
Solved
I'm trying to convert a 20GB JSON gzip file to parquet using AWS Glue.
I've setup a job using Pyspark with the code below.
I got this log WARN message:
LOG.WARN: Loading one large unsplittable file...
Viscounty asked 21/1, 2022 at 17:24
1
How does spark decide how many times to replicate a cached partition?
The storage level in the storage tab on the spark UI says “Disk Serialized 1x Replicated”, but it looks like partitions get re...
Lugansk asked 9/4, 2019 at 21:44
3
I am trying to join two dataframes together as follows:
df3 = pd.merge(df1,df2, how='inner', on='key')
where df1 and df2 are large datasets with millions of rows. Basically how do I join them with...
5
Solved
I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes nece...
0
I'm trying to achieve performant record writes from pandas (or ideally Polars if possible) in a Python environment to our Apache Iceberg deployment (with hive metastore) directly, or via Trino quer...
Plutonian asked 6/7, 2023 at 14:34
3
Solved
I need to break a large file (14 gigabytes) into smaller files. The format of this file is txt, the tab is ";" and I know it has 70 columns (string, double). I would like to read 1million and save ...
5
Solved
I am trying to leverage spark partitioning. I was trying to do something like
data.write.partitionBy("key").parquet("/location")
The issue here each partition creates huge number of parquet file...
Rewarding asked 28/6, 2017 at 16:49
3
I'm using the simple command: SELECT DISTINCT * FROM first_working_table; in HIVE 0.11, and I'm receiving the following error message:
FAILED: SemanticException TOK_ALLCOLREF is not supported in...
2
I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The ...
Crackup asked 7/5, 2019 at 12:41
4
As the title, I have one column (series) in pandas, and each row of it is a list like [0,1,2,3,4,5]. Each list has 6 numbers. I want to change this column into 6 columns, for example, the [0,1,2,3,...
Abbreviate asked 21/3, 2017 at 7:3
1
I am working with big data and I have a 70GB JSON file.
I am using jsonlite library to load in the file into memory.
I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but ...
11
Solved
I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This wo...
Dough asked 7/8, 2013 at 15:50
3
I want to use jq on a 50GB file. Needless to say the machines memory can't handle it. It's running out of memory.
I tried several options including --stream but it didn't help. Can someone tell me ...
5
Solved
I have a table in pg like so:
CREATE TABLE t (
a BIGSERIAL NOT NULL, -- 8 b
b SMALLINT, -- 2 b
c SMALLINT, -- 2 b
d REAL, -- 4 b
e REAL, -- 4 b
f REAL, -- 4 b
g INTEGER, -- 4 b
h REAL, -- ...
Cheston asked 3/6, 2010 at 13:44
7
Solved
I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-...
Marcellusmarcelo asked 10/5, 2016 at 8:41
1
I'm working with a dataset stored in S3 bucket (parquet files) consisting of a total of ~165 million records (with ~30 columns). Now, the requirement is to first groupby a certain ID column then ge...
Responsive asked 15/12, 2021 at 17:30
5
Solved
I tired to load the data into my table 'users' in LOCAL mode and i am using cloudera on my virtual box. I have a file placed my file inside /home/cloudera/Desktop/Hive/ directory but i am getting a...
3
Solved
The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kim...
Fretwell asked 15/11, 2021 at 22:40
3
Solved
I am using Standard SQL.Even though its a basic query it is still throwing errors. Any suggestions pls
SELECT
fullVisitorId,
CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS ses...
Sacramentarian asked 1/9, 2017 at 17:37
4
I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index.
The index is the customer ID. I want to group the qty_liter by the index:
df = df.groupby(df.i...
8
Solved
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a table and doing ...
Canarese asked 28/11, 2011 at 2:29
1 Next >
© 2022 - 2024 — McMap. All rights reserved.