data-cleaning

3

Solved

From a given row, how to select the previous 'n' rows in R?

I have a dummy variable like so: df <- data.frame(year = seq(1990, 1997, 1), x = c(1, 0, 0, 0, 1, 1, 0, 0)) year x 1990 1 1991 0 1992 0 1993 0 1994 1 1995 1 1996 0 1997 0 I want to create a d...

r variables dplyr tidyverse data-cleaning

Prevailing asked 16/12, 2023 at 18:36

4

Solved

Python pandas groupby aggregate on multiple columns, then pivot

In Python, I have a pandas DataFrame similar to the following: Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Tec...

python pandas dataframe pivot data-cleaning

Vtarj asked 2/4, 2017 at 20:3

2

Solved

Filter pandas data frame for col == None

I have a data frame data_df with multiple columns, one of which is c which holds country names. How do I filter out the rows where c == None. My first attempt was to do this: countries_df = data_df...

python pandas subset data-cleaning

Unawares asked 8/10, 2014 at 5:8

2

Solved

Regex expression to replace newlines between quotation marks, but only if there's no tabs in-between

We are trying to process a tsv file. The fools who made it allowed newlines in some columns, which cause issues now. Luckily these column values with the newlines in them are always contained in do...

regex string newline data-cleaning

Prism asked 22/8, 2022 at 16:34

3

Solved

Remove special characters from entire dataframe in R

Question: How can you use R to remove all special characters from a dataframe, quickly and efficiently? Progress: This SO post details how to remove special characters. I can apply the gsub fu...

r data-science data-cleaning

Memphis asked 17/4, 2018 at 20:18

8

Blocking '0000-00-00' from MySQL Date Fields

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the ...

mysql datetime date data-cleaning

Gigantean asked 8/10, 2010 at 15:12

6

Solved

Pivot dataframe to keep column headings and sub-headings in R

I am trying to pivot a table that has headings and sub-headings, so that the headings go into a column "date", and the subheadings are two columns instead of repeating. Here is an example...

r pivot pivot-table tidyr data-cleaning

Unwilled asked 5/1, 2022 at 20:25

3

Solved

Transforming complete age from character to numeric in R

I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converti...

r data-cleaning lubridate stringr data-wrangling

Vane asked 1/12, 2021 at 20:59

2

Solved

Combine long-format data frames with different length and convert to wide format

I want to combine data frames in long format with different length because of the time variable (imbalanced panel data): set.seed(63) #function to create a data frame that includes id, time and x f...

r time-series data-manipulation data-cleaning panel-data

Catercornered asked 26/11, 2021 at 18:49

7

Solved

Python Pandas replace multiple columns zero to Nan

List with attributes of persons loaded into pandas dataframe df2. For cleanup I want to replace value zero (0 or '0') by np.nan. df2.dtypes ID object Name object Weight float64 Height float64 Bo...

python pandas dataframe data-cleaning

Svetlana asked 31/7, 2017 at 13:1

3

Solved

Removing non-English words from text using Python

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on...

python data-science data-cleaning

Mikkimiko asked 22/12, 2016 at 19:0

3

Solved

cleaning data with dropna in Pyspark

I'm still relatively new to Pyspark. I use version 2.1.0. I'm trying to clean some data on a much larger data set. I've successfully used several techniques such as "dropDuplicates" along with subs...

pyspark data-cleaning

Torrance asked 8/4, 2017 at 15:52

5

Solved

How to remove carriage return in a dataframe

I am having a dataframe that contains columns named id, country_name, location and total_deaths. While doing data cleaning process, I came across a value in a row that has '\r' attached. Once I com...

python pandas replace carriage-return data-cleaning

Stoned asked 11/5, 2016 at 11:13

7

Solved

How to remove a character from some rows in a dataframe column?

I have a large DataFrame that I need to clean, as a sample please look at this dataframe: import pandas as pd cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'], 'P...

python regex pandas dataframe data-cleaning

Pricefixing asked 19/3, 2021 at 13:50

1

Solved

Remove non-english words from column in pyspark

I am working on a pyspark dataframe as shown below: +-------+--------------------------------------------------+ | id| words| +-------+--------------------------------------------------+ |1475569|[...

python apache-spark pyspark data-cleaning non-english

Lhasa asked 25/2, 2021 at 11:52

2

Using gsub() on a dataframe

I have a CSV datafile called test_20171122 Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file. I am looking into the o...

r dataframe formatting gsub data-cleaning

Performance asked 22/11, 2017 at 22:48

5

Solved

How do I clean twitter data in R?

I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1'...

r twitter text-mining data-cleaning

Vibrato asked 10/7, 2015 at 19:4

3

Solved

Fill in missing pandas data with previous non-missing value, grouped by key

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row w...

python pandas nan missing-data data-cleaning

Autonomy asked 2/5, 2013 at 18:51

3

Solved

Find all columns of dataframe in Pandas whose type is float, or a particular type?

I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use df.fillna('unknown') #getting error "ValueError: could not con...

python pandas dataframe data-cleaning

Pauly asked 12/2, 2014 at 6:9

5

Solved

How to loop through columns, check if a particular value exists in any of the columns, mutate a new column and enter 1 if it exists, 0 if not?

I am working on a research project, and one of the tables is entered in a way that is not quite suitable for analysis yet, so I am trying to reorganize it. Currently, each row is a test-taker, and ...

r dataframe dplyr data-cleaning

Sprit asked 18/9, 2019 at 15:37

2

Solved

How to remove the index name in pandas dataframe?

In my dataframe, I get a '2' written over my index column's name. when I check for the columns name it doesn't show up there but as df.columns give this as output. I don't know how to remove that '...

python pandas dataframe indexing data-cleaning

Gressorial asked 31/8, 2019 at 12:45

1

removing stop words using spacy

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things: Tokenize Lemmantize Remove stop words import spacy nlp = spacy.load('en_core_web_sm', parser=False, entity=F...

python nlp spacy python-3.7 data-cleaning

Conjunctiva asked 23/4, 2019 at 18:6

5

Solved

Python remove hashtag symbol and keep key words

I want to remove hashtag symbol ('#') and underscore that separate between words ('_') Example: "this tweet is example #key1_key2_key3" the result I want: "this tweet is example key1 key2 key3" ...

python data-cleaning

Pistole asked 8/2, 2018 at 8:45

5

Solved

Python Summing up Rows in Dataframe with the same Key [duplicate]

I want to sum up rows in a dataframe which have the same row key. The purpose will be to shrink the data set size down. For example if the data frame looks like this. Fruit Count Apple 10...

python pandas numpy statistics data-cleaning

Gearldinegearshift asked 5/2, 2019 at 3:1

3

Solved

Summarizing data by name separated across multiple variables

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by ...

r dplyr data-cleaning

Yolande asked 18/7, 2018 at 15:10

data-cleaning Questions

Recommended topics

Hot tags