data-cleaning Questions

3

Solved

I have a dummy variable like so: df <- data.frame(year = seq(1990, 1997, 1), x = c(1, 0, 0, 0, 1, 1, 0, 0)) year x 1990 1 1991 0 1992 0 1993 0 1994 1 1995 1 1996 0 1997 0 I want to create a d...
Prevailing asked 16/12, 2023 at 18:36

4

Solved

In Python, I have a pandas DataFrame similar to the following: Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Tec...
Vtarj asked 2/4, 2017 at 20:3

2

Solved

I have a data frame data_df with multiple columns, one of which is c which holds country names. How do I filter out the rows where c == None. My first attempt was to do this: countries_df = data_df...
Unawares asked 8/10, 2014 at 5:8

2

Solved

We are trying to process a tsv file. The fools who made it allowed newlines in some columns, which cause issues now. Luckily these column values with the newlines in them are always contained in do...
Prism asked 22/8, 2022 at 16:34

3

Solved

Question: How can you use R to remove all special characters from a dataframe, quickly and efficiently? Progress: This SO post details how to remove special characters. I can apply the gsub fu...
Memphis asked 17/4, 2018 at 20:18

8

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the ...
Gigantean asked 8/10, 2010 at 15:12

6

Solved

I am trying to pivot a table that has headings and sub-headings, so that the headings go into a column "date", and the subheadings are two columns instead of repeating. Here is an example...
Unwilled asked 5/1, 2022 at 20:25

3

Solved

I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converti...
Vane asked 1/12, 2021 at 20:59

2

Solved

I want to combine data frames in long format with different length because of the time variable (imbalanced panel data): set.seed(63) #function to create a data frame that includes id, time and x f...
Catercornered asked 26/11, 2021 at 18:49

7

Solved

List with attributes of persons loaded into pandas dataframe df2. For cleanup I want to replace value zero (0 or '0') by np.nan. df2.dtypes ID object Name object Weight float64 Height float64 Bo...
Svetlana asked 31/7, 2017 at 13:1

3

Solved

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on...
Mikkimiko asked 22/12, 2016 at 19:0

3

Solved

I'm still relatively new to Pyspark. I use version 2.1.0. I'm trying to clean some data on a much larger data set. I've successfully used several techniques such as "dropDuplicates" along with subs...
Torrance asked 8/4, 2017 at 15:52

5

Solved

I am having a dataframe that contains columns named id, country_name, location and total_deaths. While doing data cleaning process, I came across a value in a row that has '\r' attached. Once I com...
Stoned asked 11/5, 2016 at 11:13

7

Solved

I have a large DataFrame that I need to clean, as a sample please look at this dataframe: import pandas as pd cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'], 'P...
Pricefixing asked 19/3, 2021 at 13:50

1

Solved

I am working on a pyspark dataframe as shown below: +-------+--------------------------------------------------+ | id| words| +-------+--------------------------------------------------+ |1475569|[...
Lhasa asked 25/2, 2021 at 11:52

2

I have a CSV datafile called test_20171122 Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file. I am looking into the o...
Performance asked 22/11, 2017 at 22:48

5

Solved

I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1'...
Vibrato asked 10/7, 2015 at 19:4

3

Solved

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row w...
Autonomy asked 2/5, 2013 at 18:51

3

Solved

I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use df.fillna('unknown') #getting error "ValueError: could not con...
Pauly asked 12/2, 2014 at 6:9

5

Solved

I am working on a research project, and one of the tables is entered in a way that is not quite suitable for analysis yet, so I am trying to reorganize it. Currently, each row is a test-taker, and ...
Sprit asked 18/9, 2019 at 15:37

2

Solved

In my dataframe, I get a '2' written over my index column's name. when I check for the columns name it doesn't show up there but as df.columns give this as output. I don't know how to remove that '...
Gressorial asked 31/8, 2019 at 12:45

1

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things: Tokenize Lemmantize Remove stop words import spacy nlp = spacy.load('en_core_web_sm', parser=False, entity=F...
Conjunctiva asked 23/4, 2019 at 18:6

5

Solved

I want to remove hashtag symbol ('#') and underscore that separate between words ('_') Example: "this tweet is example #key1_key2_key3" the result I want: "this tweet is example key1 key2 key3" ...
Pistole asked 8/2, 2018 at 8:45

5

Solved

I want to sum up rows in a dataframe which have the same row key. The purpose will be to shrink the data set size down. For example if the data frame looks like this. Fruit Count Apple 10...
Gearldinegearshift asked 5/2, 2019 at 3:1

3

Solved

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by ...
Yolande asked 18/7, 2018 at 15:10

© 2022 - 2025 — McMap. All rights reserved.