Java code/library for generating slugs (for use in pretty URLs)
Asked Answered
Y

4

46

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A trivial slug function would be something along the lines of:

return input.toLowerCase().replaceAll("[^a-z0-9-]", "");

However, this implementation would not handle internationalization and accents (ë > e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

  • What is the most general/practical way to generate Django/Rails type slugs in Java?
Yezd answered 1/11, 2009 at 13:40 Comment(0)
T
58

Normalize your string using canonical decomposition:

  private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
  private static final Pattern WHITESPACE = Pattern.compile("[\\s]");

  public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
  }

This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).

Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use String.toLowerCase() under a Turkish locale.

Tega answered 1/11, 2009 at 14:8 Comment(3)
Looks promising, but the normalization does not appear to work: "fóòbâr" gets translated into "fbr" instead of expected "foobar". Do you know why?Yezd
Strange - when I put the string "f\u00F3\u00F2b\u00e2r" through the method, I get "foobar". You are perhaps making an encoding error in your source or data file; see illegalargumentexception.blogspot.com/2009/05/…Tega
McDowell: You're absolutely right - it was an encoding error. Thanks for an excellent answer!Yezd
R
18

http://search.maven.org/#search|ga|1|slugify

And here's the GitHub repository to take a look at the code and its usage:

https://github.com/slugify/slugify

Ruebenrueda answered 17/7, 2012 at 17:42 Comment(1)
Best and expandable solution so far.Synchroflash
M
12

The proposition of McDowel almost works, but in cases like this Hello World !! it returns hello-world-- (note the -- at the end of the string) instead of hello-world.

A fixed version could be:

private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
private static final Pattern EDGESDHASHES = Pattern.compile("(^-|-$)");

public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Normalizer.Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    slug = EDGESDHASHES.matcher(slug).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
}
Molton answered 31/5, 2016 at 17:40 Comment(0)
A
9

I've extended the answer by @McDowell to include escaping punctuation as hyphens and to remove duplicate and leading/trailing hyphens.

  private static final Pattern NONLATIN = Pattern.compile("[^\\w_-]");  
  private static final Pattern SEPARATORS = Pattern.compile("[\\s\\p{Punct}&&[^-]]");  

  public static String makeSlug(String input) {  
    String noseparators = SEPARATORS.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(noseparators, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH).replaceAll("-{2,}","-").replaceAll("^-|-$","");
  }
Aribold answered 20/11, 2015 at 16:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.