How can I trim leading and trailing white space?
Asked Answered
B

15

421

I am having some trouble with leading and trailing white space in a data.frame.

For example, I look at a specific row in a data.frame based on a certain condition:

> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 



[1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       

[6] dummyHInonOECD dummyHIOECD    dummyOECD      

<0 rows> (or 0-length row.names)

I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame. After looking through my code history and trying to figure out what went wrong I tried:

> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
   codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18        AUT Austria        0        0        0              0           1
   dummyOECD
18         1

All I have changed in the command is an additional white space after Austria.

Further annoying problems obviously arise. For example, when I like to merge two frames based on the country column. One data.frame uses "Austria " while the other frame has "Austria". The matching doesn't work.

  1. Is there a nice way to 'show' the white space on my screen so that I am aware of the problem?
  2. And can I remove the leading and trailing white space in R?

So far I used to write a simple Perl script which removes the whites pace, but it would be nice if I can somehow do it inside R.

Benedikt answered 14/2, 2010 at 12:44 Comment(2)
I just saw that sub() uses the Perl notation as well. Sorry about that. I am going to try to use the function. But for my first question i don't have a solution yet.Benedikt
As hadley pointed it this regex "^\\s+|\\s+$" will identify leading and trailing whitespace. so x <- gsub("^\\s+|\\s+$", "", x) many of R's read functions as have this option: strip.white = FALSETanhya
Y
489

Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv or read.table you can set the parameterstrip.white=TRUE.

If you want to clean strings afterwards you could use one of these functions:

# Returns string without leading white space
trim.leading <- function (x)  sub("^\\s+", "", x)

# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)

# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

To use one of these functions on myDummy$country:

 myDummy$country <- trim(myDummy$country)

To 'show' the white space you could use:

 paste(myDummy$country)

which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.

Yasminyasmine answered 14/2, 2010 at 13:13 Comment(8)
As hadley pointed it this regex "^\\s+|\\s+$" will identify leading and trailing whitespace. so x <- gsub("^\\s+|\\s+$", "", x) many of R's read functions as have this option: strip.white = FALSETanhya
@Jay: Thanks for the hint. I changed the regexps in my answer to use the shorter "\\s" instead of "[ \t]".Yasminyasmine
See also str_trim in the stringr package.Babita
is there a trim param in read.spss? I tried trim_values = TRUE and trim.factor.names = TRUE but to no avail...Bradfield
FYI: I trimmed all trailing spaces of the entire dataframe using apply: df_trimmed <- as.data.frame(apply(df,2,function (x) sub("\\s+$", "", x)))Bradfield
Unfortunately, strip.white=TRUE only works on non-quoted strings.Sweptback
There is a much easier way to trim whitespace in R 3.2.0. See the next answer!Arundell
Also need to include stringsAsFactors = FALSE when using read.csv, as this won't work on factors. trimws() detailed below will work regardless, but by silently converting factor to character. Both useful answers though, thanks!Dispatcher
J
609

As of R 3.2.0 a new function was introduced for removing leading/trailing white spaces:

trimws()

See: Remove Leading/Trailing Whitespace

Jaddan answered 13/5, 2015 at 9:26 Comment(6)
It depends on the definition of a best answer. This answer is nice to know of (+1) but in a quick test, it wasnt as fast as some of the alternatives out there.Archdeaconry
doesn't seem to work for multi-line strings, despite \n being in the covered character class. trimws("SELECT\n blah\n FROM foo;") still contains newlines.Gestalt
@Gestalt That is the expected behaviour. In the string you pass to trimws there are no leading or trailing white spaces. If you want to remove leading and trailing white spaces from each of the lines in the string, you will first have to split it up. Like this: trimws(strsplit("SELECT\n blah\n FROM foo;", "\n")[[1]])Jaddan
Although a built-in function for recent versions of R, it does 'just' do a PERL style regex under the hood. I might have expected some fast custom C code to do this. Maybe the trimws regex is fast enough. stringr::str_trim (based on stringi) is also interesting in that it uses a completely independent internationalized string library. You'd think whitespace would be immune from problems with internationalization, but I wonder. I've never seen a comparison of results of native vs stringr/stringi or any benchmarks.Marji
For some reason I could not figure out, trimws() did not remove my leading white spaces, while Bryan's trim.strings() below (only 1 vote, mine!) did...Smithsonite
@JackWasey I've added a benchmark - the example might be somewhat simple, but it should give an idea about the performanceTwofaced
Y
489

Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv or read.table you can set the parameterstrip.white=TRUE.

If you want to clean strings afterwards you could use one of these functions:

# Returns string without leading white space
trim.leading <- function (x)  sub("^\\s+", "", x)

# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)

# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

To use one of these functions on myDummy$country:

 myDummy$country <- trim(myDummy$country)

To 'show' the white space you could use:

 paste(myDummy$country)

which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.

Yasminyasmine answered 14/2, 2010 at 13:13 Comment(8)
As hadley pointed it this regex "^\\s+|\\s+$" will identify leading and trailing whitespace. so x <- gsub("^\\s+|\\s+$", "", x) many of R's read functions as have this option: strip.white = FALSETanhya
@Jay: Thanks for the hint. I changed the regexps in my answer to use the shorter "\\s" instead of "[ \t]".Yasminyasmine
See also str_trim in the stringr package.Babita
is there a trim param in read.spss? I tried trim_values = TRUE and trim.factor.names = TRUE but to no avail...Bradfield
FYI: I trimmed all trailing spaces of the entire dataframe using apply: df_trimmed <- as.data.frame(apply(df,2,function (x) sub("\\s+$", "", x)))Bradfield
Unfortunately, strip.white=TRUE only works on non-quoted strings.Sweptback
There is a much easier way to trim whitespace in R 3.2.0. See the next answer!Arundell
Also need to include stringsAsFactors = FALSE when using read.csv, as this won't work on factors. trimws() detailed below will work regardless, but by silently converting factor to character. Both useful answers though, thanks!Dispatcher
C
96

To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15, 2013 and is in CRAN. The function can also handle string vectors.

install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)

(Credit goes to commenter: R. Cotton)

Chromic answered 21/2, 2013 at 16:30 Comment(3)
This solution removed some mutant whitespace that trimws() was unable to remove.Vicereine
@RichardTelford could you provide an example? Because that might be considered a bug in trimws.Jaddan
Thanks for the require(stringr) their documentation or examples did not have this required line of code!Mono
C
28

A simple function to remove leading and trailing whitespace:

trim <- function( x ) {
  gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}

Usage:

> text = "   foo bar  baz 3 "
> trim(text)
[1] "foo bar  baz 3"
Cartel answered 19/2, 2014 at 13:37 Comment(0)
R
13

Ad 1) To see white spaces you could directly call print.data.frame with modified arguments:

print(head(iris), quote=TRUE)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
# 1        "5.1"       "3.5"        "1.4"       "0.2" "setosa"
# 2        "4.9"       "3.0"        "1.4"       "0.2" "setosa"
# 3        "4.7"       "3.2"        "1.3"       "0.2" "setosa"
# 4        "4.6"       "3.1"        "1.5"       "0.2" "setosa"
# 5        "5.0"       "3.6"        "1.4"       "0.2" "setosa"
# 6        "5.4"       "3.9"        "1.7"       "0.4" "setosa"

See also ?print.data.frame for other options.

Roveover answered 15/2, 2010 at 10:0 Comment(0)
E
11

Use grep or grepl to find observations with white spaces and sub to get rid of them.

names<-c("Ganga Din\t", "Shyam Lal", "Bulbul ")
grep("[[:space:]]+$", names)
[1] 1 3
grepl("[[:space:]]+$", names)
[1]  TRUE FALSE  TRUE
sub("[[:space:]]+$", "", names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"
Estuarine answered 14/2, 2010 at 14:13 Comment(3)
Or, a little more succinctly, "^\\s+|\\s+$"Appendix
Just wanted to point out, that one will have to use gsub instead of sub with hadley's regexp. With sub it will strip trailing whitespace only if there is no leading whitespace...Yasminyasmine
Didn't know you could use \s etc. with perl=FALSE. The docs say that POSIX syntax is used in that case, but the syntax accepted is actually a superset defined by the TRE regex library laurikari.net/tre/documentation/regex-syntaxEstuarine
T
11

Removing leading and trailing blanks might be achieved through the trim() function from the gdata package as well:

require(gdata)
example(trim)

Usage example:

> trim("   Remove leading and trailing blanks    ")
[1] "Remove leading and trailing blanks"

I'd prefer to add the answer as comment to user56's, but I am yet unable so writing as an independent answer.

Tub answered 15/1, 2015 at 0:29 Comment(0)
C
6

Another related problem occurs if you have multiple spaces in between inputs:

> a <- "  a string         with lots   of starting, inter   mediate and trailing   whitespace     "

You can then easily split this string into "real" tokens using a regular expression to the split argument:

> strsplit(a, split=" +")
[[1]]
 [1] ""           "a"          "string"     "with"       "lots"
 [6] "of"         "starting,"  "inter"      "mediate"    "and"
[11] "trailing"   "whitespace"

Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

Cordellcorder answered 13/8, 2015 at 11:13 Comment(0)
Y
6

Another option is to use the stri_trim function from the stringi package which defaults to removing leading and trailing whitespace:

> x <- c("  leading space","trailing space   ")
> stri_trim(x)
[1] "leading space"  "trailing space"

For only removing leading whitespace, use stri_trim_left. For only removing trailing whitespace, use stri_trim_right. When you want to remove other leading or trailing characters, you have to specify that with pattern =.

See also ?stri_trim for more info.

Yuletide answered 14/1, 2016 at 16:48 Comment(0)
S
5

I created a trim.strings () function to trim leading and/or trailing whitespace as:

# Arguments:    x - character vector
#            side - side(s) on which to remove whitespace 
#                   default : "both"
#                   possible values: c("both", "leading", "trailing")

trim.strings <- function(x, side = "both") { 
    if (is.na(match(side, c("both", "leading", "trailing")))) { 
      side <- "both" 
      } 
    if (side == "leading") { 
      sub("^\\s+", "", x)
      } else {
        if (side == "trailing") {
          sub("\\s+$", "", x)
    } else gsub("^\\s+|\\s+$", "", x)
    } 
} 

For illustration,

a <- c("   ABC123 456    ", " ABC123DEF          ")

# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF" 

# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456    "      "ABC123DEF          "

# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] "   ABC123 456" " ABC123DEF"   
Samson answered 4/5, 2016 at 10:27 Comment(0)
A
4

Use dplyr/tidyverse mutate_all with str_trim to trim the entire data frame:

myDummy %>%
  mutate_all(str_trim)
library(tidyverse)
set.seed(335)
df <- mtcars %>%
        rownames_to_column("car") %>%
        mutate(car = ifelse(runif(nrow(mtcars)) > 0.4, car, paste0(car, " "))) %>%
        select(car, mpg)

print(head(df), quote = T)
#>                    car    mpg
#> 1         "Mazda RX4 " "21.0"
#> 2      "Mazda RX4 Wag" "21.0"
#> 3        "Datsun 710 " "22.8"
#> 4    "Hornet 4 Drive " "21.4"
#> 5 "Hornet Sportabout " "18.7"
#> 6           "Valiant " "18.1"

df_trim <- df %>%
  mutate_all(str_trim)

print(head(df_trim), quote = T)  
#>                   car    mpg
#> 1         "Mazda RX4"   "21"
#> 2     "Mazda RX4 Wag"   "21"
#> 3        "Datsun 710" "22.8"
#> 4    "Hornet 4 Drive" "21.4"
#> 5 "Hornet Sportabout" "18.7"
#> 6           "Valiant" "18.1"

Created on 2021-05-07 by the reprex package (v0.3.0)

Andersonandert answered 7/5, 2021 at 12:8 Comment(1)
Removes trailing \r\n in all columns, unlike any other solutions I've seen that claim to work at the data frame level. I can get rid of tons of untidy and inelegant per-column trims.Appenzell
N
2

The best method is trimws().

The following code will apply this function to the entire dataframe.

mydataframe<- data.frame(lapply(mydataframe, trimws),stringsAsFactors = FALSE)
Noctule answered 25/9, 2017 at 8:55 Comment(1)
or df[] <- lapply(df, trimws) to be more compact. But it will in both cases coerce columns to character. df[sapply(df,is.character)] <- lapply(df[sapply(df,is.character)], trimws) to be safe.Sanguine
T
1

Benchmarking of the main approaches in this thread. This is not capturing all weird cases, but so far we are still lacking the example where str_trim removes whitespace and trimws doesn't (see Richard Telford's comment to this answer). Doesn't seem to matter - the gsub option seems to be fastest :)

x <- c(" lead", "trail ", " both ", " both and middle ", " _special")
## gsub function from https://mcmap.net/q/86023/-how-can-i-trim-leading-and-trailing-white-space 
## this is NOT the function from user Bernhard Kausler, which uses 
## a much less concise regex 
gsub_trim <- function (x) gsub("^\\s+|\\s+$", "", x)

res <- microbenchmark::microbenchmark(
  gsub = gsub_trim(x),
  ## https://mcmap.net/q/86023/-how-can-i-trim-leading-and-trailing-white-space
  trimws = trimws(x),
  ## https://mcmap.net/q/86023/-how-can-i-trim-leading-and-trailing-white-space
  str_trim = stringr::str_trim(x),
  times = 10^5
)
res
#> Unit: microseconds
#>      expr    min     lq      mean median       uq       max neval cld
#>      gsub 20.201 22.788  31.43943 24.654  28.4115  5303.741 1e+05 a  
#>    trimws 38.204 41.980  61.92218 44.420  51.1810 40363.860 1e+05  b 
#>  str_trim 88.672 92.347 116.59186 94.542 105.2800 13618.673 1e+05   c
ggplot2::autoplot(res)

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  stringr_1.4.0  

Twofaced answered 14/2, 2010 at 12:44 Comment(0)
P
1
myDummy[myDummy$country == "Austria "] <- "Austria"

After this, you'll need to force R not to recognize "Austria " as a level. Let's pretend you also have "USA" and "Spain" as levels:

myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))

It is a little less intimidating than the highest voted response, but it should still work.

Phore answered 15/6, 2017 at 14:56 Comment(1)
I don't think this is a good idea, since we don't know how many countries/levels the df actually have. Additionally, R would encode the first element of Dummy$Country as "Austria", even if it were "Spain".Shorthanded
N
0

I tried trim(). It works well with white spaces as well as the '\n'.

x = '\n              Harden, J.\n              '

trim(x)
Nether answered 16/9, 2018 at 7:46 Comment(2)
From which package? This function doesn't exist by default.Paramilitary
Just not a useful answer without providing the package name, @NetherAshlaring

© 2022 - 2024 — McMap. All rights reserved.