There are many different invisible characters. Many of them have the property WSpace=Y
("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim()
function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim()
and ltrim()
variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace()
instead.
Trailing
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;
The regular expression explained:
\s
... regular expression class shorthand for [[:space:]]
- which is the set of white-space characters - see limitations below
+
... 1 or more consecutive matches
$
... end of string
Demo:
SELECT regexp_replace('inner white ', '\s+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (\
). Details in this related answer:
Leading
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^\s+', '')
^
.. start of string
Both
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')
Or you can combine both in a single call with two branches.
Add 'g'
as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^\s+|\s+$', '', 'g')
But that should typically be faster with substring()
:
substring(eventdate, '\S(?:.*\S)*')
\S
... everything but white space
(?:
re
)
... non-capturing set of parentheses
.*
... any string of 0-n characters
Or one of these:
substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)') -- only works for 2+ printing characters
(
re
)
... Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
Whitespace?
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]]
.
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';
'' | '' | '' | ''
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'\u2060', E'\uFEFF';
'' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace '\s'
with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]'
or '[\s]'
(note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, '\s+$', '')
use:
regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')
or:
regexp_replace(eventdate, '[\s]+$', '') -- note invisible characters
Limitations
There is also the Posix character class [[:graph:]]
supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]
), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype
) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]]
(shorthand \s
) to catch all whitespace characters. Like: \u2007
, \u202f
and \u00a0
seem to also be missing for @XiCoN JFS.
The manual:
Within a bracket expression, the name of a character class enclosed in
[:
and :]
stands for the list of all characters belonging to that
class. Standard character class names are: alnum
, alpha
, blank
, cntrl
,
digit
, graph
, lower
, print
, punct
, space
, upper
, xdigit
.
These stand for the character classes defined in ctype.
A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character
codes, particularly Unicode characters above U+7FF
(Tom Lane)
Previously, such characters were never recognized as belonging to
locale-dependent character classes such as [[:alpha:]]
.
space
character and not some other non-visible whitespace character(s)? – Elston