Remove all punctuation AND the values after it at end of string in R

Asked 13/9, 2024 at 3:20 Answered 13/9, 2024 at 7:26

I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number - e.g. -1, /a, _1 etc.

I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.

I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don't need to check for different arrangements?

On someone else's question I managed to find a way to remove the brackets and all the text within the brackets, but I can't seem to figure out how to manipulate it for my purposes

df$patid<- gsub("\\s*\$[^\$]+\\)","",df$patid)

I tried these two codes without success

df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)

I also tried the clean function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn't it.

example of my current code (not all possible iterations) - These do work

df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)

Am not tied to gsub, am open to different methods.

Example of ID numbers

patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

Apologies if this has been answered somewhere already

Enigmatic answered 13/9, 2024 at 3:20 Comment(6)

Just wonder if your real intention was actually removing a punctuation other than - + numbers at the end of the string. Was it? I see MB-13-169454 in your input, and Tim's solution returns this "code" unchanged, is it expected? – Prussiate 13/9, 2024 at 7:35

Tim's solution works exactly as I had hoped. I put the MB-13-169454 as an example of an ID that I didn't want manipulated, as this would be a root patient ID, rather than a duplicate – Enigmatic 13/9, 2024 at 8:18

So, if there is 8253015N/21, you want to keep it as is or return 8253015N? – Prussiate 13/9, 2024 at 9:8

Ideally return it as 8253015N – Enigmatic 13/9, 2024 at 9:10

Then how do we tell MB-13-169454 from 8253015N/21? A number of alphanumerics at the end of the string? – Prussiate 13/9, 2024 at 9:11

I don't know sorry - it's a really messy database – Enigmatic 16/9, 2024 at 0:42

A literal regex for what you described would be:

[[:punct:]][^[:punct:]]*$

This would match a final punctuation character, followed by anything which follows it, until the end of the string.

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
output <- sub("[[:punct:]][^[:punct:]]*$", "", patid)
output

 [1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"

Helbonna answered 13/9, 2024 at 3:34 Comment(2)

Thank you! I knew it was a regex thing, I just couldn't see it. Can I ask why did you use sub instead of gsub? – Enigmatic 13/9, 2024 at 3:56

@Enigmatic Well gsub is short for "global sub," which would apply the pattern repeatedly. But the pattern I used can only ever match at most once, so we can use single sub() instead. It is a matter of syntax, but also possibly efficiency. – Helbonna 13/9, 2024 at 4:8

What you ask for is to remove any punctuation and then one or two alphanumeric characters at the end of the string.

gsub("[[:punct:]][[:alnum:]]{1,2}$", "", x)

See the R demo. The [[:punct:]][[:alnum:]]{1,2}$ TRE compliant pattern matches a punctuation character ([[:punct:]]), then one or two alphanumerics ([[:alnum:]]{1,2}), and then asserts if there is an end of string ($) right after that alphanumeric char. See the regex demo.

To remove any punctuation AND the text after it at end of string, you can use

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", x, perl=TRUE)

NOTE: You can also use the same pattern with stringr::str_replace_all function. Also, you must use perl=TRUE in gsub to make this pattern work since it is PCRE compliant, not TRE-compliant.

See the regex demo.

Details:

[\p{S}\p{P}]+ - one or more math symbols or punctuation proper characters (note that the default engine uses a POSIX compliant version of [:punct:] that includes these two Unicode category classes, but ICU regex engine used in stringr regex functions is not POSIX compliant and behaves differently, that is why I am suggesting this pattern)
[^\p{S}\p{P}]* - zero or more characters other than math symbols or punctuation proper characters
$ - end of string.

See the R demo online:

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", patid, perl=TRUE)

Output:

 [1] "MB-13"        "MB-13"        "MB-13-212235" "MB-13-212235" "MB-13"       
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"

Additional info that you may be confused about:

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

Terresaterrestrial answered 13/9, 2024 at 7:26 Comment(13)

Hi Wiktor I can see how your code works, but it isn't exactly what I was after. But thanks anyway – Enigmatic 13/9, 2024 at 8:18

@Enigmatic This solution does what you asked for in the title. I added what you actually intended to ask for at the top of the answer. It looks like you simply wanted to remove a punctuation and then a single alphanumeric char at the end of the string. – Prussiate 13/9, 2024 at 8:32

No, some times in my dataset there were several characters after the punctuation, but they all appear at the end of the string. I'm not sure why you're arguing with me about what it is I want as a solution for my problem... – Enigmatic 13/9, 2024 at 8:35

@Enigmatic I am not arguing, I just want to improve the SO post by clarifying the real requirements so that those who come here can use the solution to solve their similar problems. If you have such examples with multiple characters, I am not sure Tim's solution will cover them, can you please share at least one example string and the expected result? – Prussiate 13/9, 2024 at 8:41

@Enigmatic May I intervene... Tim's [[:punct:]][^[[:punct:]]]*$ looked to me confusing, indeed [^[[:punct:]]]* would match a character that is not a [ (redundant) nor a punctuation followed by ]* any amount of characters that are not a closing bracket. The construct more looks like a typo and will just match one non-punctuation character at the end, see this demo. If this regex from Tim is what you want, Wiktor's regex is basically doing the same without typo (target one alnum at the end of the string). Both do not match multiple at the end. – Gripe 13/9, 2024 at 9:52

Thanks @bobblebubble - so how would I have multiples at the end? – Enigmatic 13/9, 2024 at 10:4

@Enigmatic To make it more clear, see this demo - Tim's "fixed" pattern would imho rather be [[:punct:]][^[:punct:]]$ being very similar to the answer of Wiktor here. If you wanted to target multiple non-punctutations, Tim's pattern would be e.g. [[:punct:]][^[:punct:]]*$ ( the \n in demo is jut for multiline showcase) or Wiktor's [[:punct:]][[:alnum:]]*$ – Gripe 13/9, 2024 at 10:8

@bobblebubble I assumed (perhaps wrongly) in my answer that maybe the OP would sometimes have data where e.g. 2 digits followed the final punctuation character. So I tried to give a more general pattern. – Helbonna 13/9, 2024 at 10:12

@TimBiegeleisen Yes, however [^[[:punct:]]]* looks to me like a typo where you used too many brackets, so it would match e.g. MB-13-212235.2]]]]] but does not match 570548260-12 (multiple). Though it causes a visual illusion to be supposed to match multiple non-punctuations at the end because of that little slip. 🙃 – Gripe 13/9, 2024 at 10:16

@Enigmatic I have previously overlooked that Wiktor's second pattern does what seems to be desired: Target multiple non-punctuations following one or more punctuations. So this would work as desired if multiple shuold be removed. However, always good to have different views, options and opinions! – Gripe 13/9, 2024 at 10:27

Thanks everyone for working together to get me an answer! – Enigmatic 16/9, 2024 at 0:41

@Enigmatic Just want to mention that this solution is fully Unicode friendly. From my former experience, you should use Unicode category classes with PCRE/ICU regex engine if you plan to support languages like Japanese, with non-ASCII letters, rather than POSIX character classes with the default base R TRE regex engine. – Prussiate 19/9, 2024 at 8:38

@WiktorStribiżew thanks for this, I'll keep it in mind if I use international data sources! – Enigmatic 20/9, 2024 at 1:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags