hadoop-partitioning Questions

5

I am running this command -- sudo -u hdfs hadoop fs -du -h /user | sort -nr and the output is not sorted in terms of gigs, Terabytes,gb I found this command - hdfs dfs -du -s /foo/bar/*tobe...
Coagulase asked 28/6, 2016 at 21:34

1

Solved

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the ...
Anselmi asked 29/4, 2019 at 18:34

2

Solved

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = ...
Sammie asked 30/4, 2015 at 20:49

4

Solved

I am trying to create a table in Hive CREATE TABLE BUCKET_TABLE AS SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll CLUSTERED BY (key) INTO 1000 BUCKETS; This sy...
Sabinasabine asked 22/7, 2014 at 20:41

2

A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a fu...
Scofield asked 14/3, 2019 at 13:51

0

Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames, whic...
Hefty asked 22/11, 2018 at 13:24

1

Solved

I have a created two dataframes in pyspark from my hive table as: data1 = spark.sql(""" SELECT ID, MODEL_NUMBER, MODEL_YEAR ,COUNTRY_CODE from MODEL_TABLE1 where COUNTRY_CODE in ('IND','CHN','US...
Chequer asked 4/10, 2018 at 8:44

1

I created a partitioned table: create table t1 ( amount double) partitioned by ( events_partition_key string) stored as paquet; added some data to tmp_table, where column 'events_partition_key' ...
Steatopygia asked 25/2, 2018 at 12:52

1

Solved

I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar), if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output"), I get an ou...
Amor asked 10/1, 2018 at 14:54

3

I am working on a hadoop project and after many visit to various blogs and reading the documentation, I realized I need to use secondary sort feature provided by hadoop framework. My input format i...
Similarity asked 4/8, 2016 at 16:54

5

Solved

Can any one explain me how secondary sorting works in hadoop ? Why must one use GroupingComparator and how does it work in hadoop ? I was going through the link given below and got doubt on how gr...
Astronomer asked 23/8, 2013 at 6:14

1

Solved

i have an external partitioned table named employee with partition(year,month,day), everyday a new file come and seat at the particular day location call for today's date it will be at 2016/10/13. ...
Inequality asked 14/10, 2016 at 13:7

2

How to recover partitions in easy fashion. Here is the scenario : Have 'n' partitions on existing external table 't' Dropped table 't' Recreated table 't' // Note : same table but with excl...
Carriecarrier asked 26/5, 2016 at 6:15

5

Solved

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows ...
Gastronome asked 3/7, 2013 at 22:27

4

Solved

I would like to know why grouping comparator is used in secondary sort of mapreduce. According to the definitive guide example of secondary sorting We want the sort order for keys to be by year (...
Tytybald asked 6/2, 2013 at 11:54

2

Solved

Using hadoop multinode setup (1 mater , 1 salve) After starting up start-mapred.sh on master , i found below error in TT logs (Slave an) org.apache.hadoop.mapred.TaskTracker: Failed to get syst...
Overplus asked 29/7, 2013 at 5:27

1

Solved

I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files . 13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008...

1

Solved

I was able to successfully change the wordcount program in hadoop to suit my requirement. However, I have another situation where in I use the same key for 3 values. Let's say my input file is as b...
Execration asked 20/6, 2013 at 16:1

1

I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts. Algorithm:( For calculating the overlap between record pairs) Given a user defined paramet...
1

© 2022 - 2024 — McMap. All rights reserved.