How to replace whitespaces with underscore?
Asked Answered
I

14

312

I want to replace whitespace with underscore in a string to create nice URLs. So that for example:

"This should be connected" 

Should become

"This_should_be_connected" 

I am using Python with Django. Can this be solved using regular expressions?

Iterate answered 17/6, 2009 at 14:41 Comment(1)
How can this this be achieved in django template. Is there any way to remove white spaces. Is there any built in tag/filter to do this? Note: slugify doesn't gives the desired output.Gitagitel
E
526

You don't need regular expressions. Python has a built-in string method that does what you need:

mystring.replace(" ", "_")
Epigraph answered 17/6, 2009 at 14:44 Comment(8)
This doesn't work with other whitespace characters, such as \t or a non-breaking space.Impala
Yes you are correct, but for the purpose of the question asked, it doesn't seem necessary to take those other spaces into account.Epigraph
do I need to import anything for this to work? I get the following error: AttributeError: 'builtin_function_or_method' object has no attribute 'replace'Scatology
Probably the variable that you called replace on, was not a string type.Stithy
This answer could be confusing, better write it as mystring = mystring.replace(" ", "_") as it doesn't directly alter the string but rather returns a changed version.Neely
Thanks, This is working. Would be better if written like: mystring = mystring.replace(" ", "_")Tennilletennis
does not work with non-breaking spaces, use re.sub(r"\s+", '', content) insteadWinny
Whitespace >>>> literal space character. The question title explicitly states "whitespace"; the question content then reiterates that. As @Winny suggests, re.sub(r'\s+', '_', content) is the canonical solution. And yet again the best answer is a comment with two upvotes. </facepalm>Voiceless
P
113

Replacing spaces is fine, but I might suggest going a little further to handle other URL-hostile characters like question marks, apostrophes, exclamation points, etc.

Also note that the general consensus among SEO experts is that dashes are preferred to underscores in URLs.

import re

def urlify(s):

    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

# Prints: I-cant-get-no-satisfaction"
print(urlify("I can't get no satisfaction!"))
Pagoda answered 17/6, 2009 at 15:3 Comment(7)
This is interesting. I will definitely use this advice.Iterate
Remember to urllib.quote() the output of your urlify() - what if s contains something non-ascii?Eisk
This is nice - but the first RE with \W will also remove whitespace with the result that the subsequent RE has nothing to replace... If you want to replace your other characters with '-' between tokens have the first RE replace with a single space as indicated - i.e. s = re.sub(r"\W", '&nbsp', s) (this may be a shonky formatting issue on StackOverflow: meta.stackexchange.com/questions/105507/…)Browse
@TimTheEnchanter - good catch. Fixed. What is the airspeed velocity of an unladen swallow?Pagoda
@Triptych What do you mean? African or European swallow?Browse
Another slight problem with this is you remove any preexisting hyphens in the url, so that if the user had attempted to clean the url string before uploading to be this-is-clean, it would be stripped to thisisclean. So s = re.sub(r'[^\w\s-]', '', s). Can go one step further and remove leading and trailing whitespace so that the filename doesn't end or start with a hyphen with s = re.sub(r'[^\w\s-]', '', s).strip()Cartesian
re.sub('[%s\s]+' % '-', '-', s) this works for utf-8 charsUvular
N
56

This takes into account blank characters other than space and I think it's faster than using re module:

url = "_".join( title.split() )
Nicobarese answered 29/4, 2012 at 23:18 Comment(6)
More importantly it will work for any whitespace character or group of whitespace characters.Frayne
This solution does not handle all whitespace characters. (e.g. \x8f)Prophylactic
Good catch, @Lokal_Profil! The documentation doesn't specify which whitespace chars are taken into account.Nicobarese
This solution will also not preserve repeat delimiters, as split() does not return empty items when using the default "split on whitespace" behavior. That is, if the input is "hello,(6 spaces here)world", this will result in "hello,_world" as output, rather than "hello,______world".Cambric
regexp > split/join > replaceOverrule
This is especially helpful, if you want to replace any number of whitespace characters with 1 character. Like in "reducing all whitespace to 1 space" szenarios. Very handy to remove line breaks, tabs and multi-` ` etc. from a string.Deuteronomy
H
45

Django has a 'slugify' function which does this, as well as other URL-friendly optimisations. It's hidden away in the defaultfilters module.

>>> from django.template.defaultfilters import slugify
>>> slugify("This should be connected")

this-should-be-connected

This isn't exactly the output you asked for, but IMO it's better for use in URLs.

Halyard answered 17/6, 2009 at 15:15 Comment(6)
That is an interesting option, but is this a matter of taste or what are the benefits of using hyphens instead of underscores. I just noticed that Stackoverflow uses hyphens like you suggest. But digg.com for example uses underscores.Iterate
This happens to be the preferred option (AFAIK). Take your string, slugify it, store it in a SlugField, and make use of it in your model's get_absolute_url(). You can find examples on the net easily.Byelorussian
@Lulu people use dashes because, for a long time, search engines treated dashes as word separators and so you'd get an easier time coming up in multi-word searches.Theodore
@Daniel Roseman can I use this with dynamically variable. as I am getting dynamic websites as string in a veriableKiller
This is the right answer. You need to sanitize your URLs.Bartram
This doesn't work with utf-8 characters, I tested it with Arabic and it returned "" empty stringUvular
J
29

Using the re module:

import re
re.sub('\s+', '_', "This should be connected") # This_should_be_connected
re.sub('\s+', '_', 'And     so\tshould this')  # And_so_should_this

Unless you have multiple spaces or other whitespace possibilities as above, you may just wish to use string.replace as others have suggested.

Joceline answered 17/6, 2009 at 14:45 Comment(2)
Thank you, this was exactly what I was asking for. But I agree, the "string.replace" seems more suitable for my task.Iterate
PEP8 replace this '\s+' with this r'\s+'. Info: flake8rules.com/rules/W605.htmlJugum
T
11

use string's replace method:

"this should be connected".replace(" ", "_")

"this_should_be_disconnected".replace("_", " ")

Trstram answered 17/6, 2009 at 14:45 Comment(0)
D
9

You can try this instead:

mystring.replace(r' ','-')
Defy answered 6/5, 2018 at 15:28 Comment(2)
this should have been the correct answer not rogeriopvl response...how could he/she get 489 up vote when it does not even work !Sorghum
except it's not using underscoreRoybal
P
7

Python has a built in method on strings called replace which is used as so:

string.replace(old, new)

So you would use:

string.replace(" ", "_")

I had this problem a while ago and I wrote code to replace characters in a string. I have to start remembering to check the python documentation because they've got built in functions for everything.

Protomartyr answered 18/6, 2009 at 12:34 Comment(0)
U
7

Surprisingly this library not mentioned yet

python package named python-slugify, which does a pretty good job of slugifying:

pip install python-slugify

Works like this:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a") 
Untouched answered 28/9, 2015 at 16:1 Comment(0)
D
5

I'm using the following piece of code for my friendly urls:

from unicodedata import normalize
from re import sub

def slugify(title):
    name = normalize('NFKD', title).encode('ascii', 'ignore').replace(' ', '-').lower()
    #remove `other` characters
    name = sub('[^a-zA-Z0-9_-]', '', name)
    #nomalize dashes
    name = sub('-+', '-', name)

    return name

It works fine with unicode characters as well.

Decrescendo answered 17/6, 2009 at 15:36 Comment(1)
Could you explain where this differs from the built-in Django slugify function?Oakley
I
4
mystring.replace (" ", "_")

if you assign this value to any variable, it will work

s = mystring.replace (" ", "_")

by default mystring wont have this

Irvine answered 31/7, 2016 at 16:51 Comment(0)
M
2

OP is using python, but in javascript (something to be careful of since the syntaxes are similar.

// only replaces the first instance of ' ' with '_'
"one two three".replace(' ', '_'); 
=> "one_two three"

// replaces all instances of ' ' with '_'
"one two three".replace(/\s/g, '_');
=> "one_two_three"
Mostly answered 4/7, 2010 at 23:34 Comment(0)
L
2
x = re.sub("\s", "_", txt)
Lotze answered 28/11, 2021 at 6:25 Comment(4)
There are already 2 answers suggesting the match pattern \s+ for the replacement. Can you explain what are the differences/benefits of using your pattern (without the +)?Johnsiejohnson
When you use \s,it will matches only for a single whitespace character.But when you use \s+,it will match one or more occurrence of white spaces. Example: text = "hi all" re.sub("\s", "_", text) re.sub("\s+", "_", text) Output for \s : hi__all Output for \s+ : hi_all With this example \s will mach only for one space.So when two white spaces is given consecutive on input,it will replace _ for each of them.On other hand \s+ will match for one space.So it will replace one _ for each consecutive white spaces.Lotze
@AllenJose, all of that belongs in your answer, not in a comment. Comments can be deleted at any time for any reason. Please read How to Answer.Crosslet
@AllenJose My comment was more to encourage you to edit your answer with that information. I personally know that difference but actually am not sure what is the benefit of your way. It seems to make more sense to replace multiple consecutive spaces with a single underscore. Again, this should be explained in your answer to help other readers make the distinction according to their needsJohnsiejohnson
T
-4
perl -e 'map { $on=$_; s/ /_/; rename($on, $_) or warn $!; } <*>;'

Match et replace space > underscore of all files in current directory

Textbook answered 19/6, 2009 at 7:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.