Remove all punctuation AND the values after it at end of string in R
Asked Answered
E

2

6

I have a ID variable that comes from 35 different hospitals, so has varying different arrangements of the variable, and sometimes it has the same root ID number with a secondary line number - e.g. -1, /a, _1 etc.

I want to remove the punctuation, and whatever comes after that punctuation, leaving just the root ID number.

I have currently managed to write out individual lines of code for each different iteration, but I was wondering if there was a more elegant way so that next year when the data comes in I don't need to check for different arrangements?

On someone else's question I managed to find a way to remove the brackets and all the text within the brackets, but I can't seem to figure out how to manipulate it for my purposes

df$patid<- gsub("\\s*\\([^\\)]+\\)","",df$patid)

I tried these two codes without success

df$patid<- gsub("\\[:punct:]s*$","", df$patid)
df$patid<- gsub("\\[:alnum:]s*$","", df$patid)

I also tried the clean function, which removed all the punctuation, but kept the numbers/characters after them, so that wasn't it.

example of my current code (not all possible iterations) - These do work

df$patid<- gsub("\\-1$", "", df$patid)
df$patid<- gsub("\\-2$", "", df$patid)
df$patid<- gsub("\\-3$", "", df$patid)
df$patid<- gsub("\\-a$", "", df$patid)
df$patid<- gsub("\\-A$", "", df$patid)
df$patid<- gsub("\\-b$", "", df$patid)
df$patid<- gsub("\\-B$", "", df$patid)
df$patid<- gsub("\\b", "", df$patid)
df$patid<- gsub("\\/dd", "", df$patid)

Am not tied to gsub, am open to different methods.

Example of ID numbers

patid<- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

Apologies if this has been answered somewhere already

Enigmatic answered 13/9, 2024 at 3:20 Comment(6)
Just wonder if your real intention was actually removing a punctuation other than - + numbers at the end of the string. Was it? I see MB-13-169454 in your input, and Tim's solution returns this "code" unchanged, is it expected?Prussiate
Tim's solution works exactly as I had hoped. I put the MB-13-169454 as an example of an ID that I didn't want manipulated, as this would be a root patient ID, rather than a duplicateEnigmatic
So, if there is 8253015N/21, you want to keep it as is or return 8253015N?Prussiate
Ideally return it as 8253015NEnigmatic
Then how do we tell MB-13-169454 from 8253015N/21? A number of alphanumerics at the end of the string?Prussiate
I don't know sorry - it's a really messy databaseEnigmatic
H
8

A literal regex for what you described would be:

[[:punct:]][^[:punct:]]*$

This would match a final punctuation character, followed by anything which follows it, until the end of the string.

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")
output <- sub("[[:punct:]][^[:punct:]]*$", "", patid)
output

 [1] "MB-13-169454" "MB-13-179455" "MB-13-212235" "MB-13-212235" "MB-13-224683"
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"  
Helbonna answered 13/9, 2024 at 3:34 Comment(2)
Thank you! I knew it was a regex thing, I just couldn't see it. Can I ask why did you use sub instead of gsub?Enigmatic
@Enigmatic Well gsub is short for "global sub," which would apply the pattern repeatedly. But the pattern I used can only ever match at most once, so we can use single sub() instead. It is a matter of syntax, but also possibly efficiency.Helbonna
T
1

What you ask for is to remove any punctuation and then one or two alphanumeric characters at the end of the string.

gsub("[[:punct:]][[:alnum:]]{1,2}$", "", x)

See the R demo. The [[:punct:]][[:alnum:]]{1,2}$ TRE compliant pattern matches a punctuation character ([[:punct:]]), then one or two alphanumerics ([[:alnum:]]{1,2}), and then asserts if there is an end of string ($) right after that alphanumeric char. See the regex demo.

To remove any punctuation AND the text after it at end of string, you can use

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", x, perl=TRUE)

NOTE: You can also use the same pattern with stringr::str_replace_all function. Also, you must use perl=TRUE in gsub to make this pattern work since it is PCRE compliant, not TRE-compliant.

See the regex demo.

Details:

  • [\p{S}\p{P}]+ - one or more math symbols or punctuation proper characters (note that the default engine uses a POSIX compliant version of [:punct:] that includes these two Unicode category classes, but ICU regex engine used in stringr regex functions is not POSIX compliant and behaves differently, that is why I am suggesting this pattern)
  • [^\p{S}\p{P}]* - zero or more characters other than math symbols or punctuation proper characters
  • $ - end of string.

See the R demo online:

patid <- c("MB-13-169454", "MB-13-179455", "MB-13-212235.1", "MB-13-212235.2", "MB-13-224683", "570548260-2", "570548260-3", "1458629P-2", "1139093D-2", "8253015N/2", "8253015N/3", "M255858/1", "M255858/2", "8494392Q/2", "9296741B/2", "04152341421/A", "04152341421/B", "04152640475/B", "04152821164/A", "G140381883_1", "G140381883_2", "G140880774_1", "G140880774_2")

gsub("[\\p{S}\\p{P}]+[^\\p{S}\\p{P}]*$", "", patid, perl=TRUE)

Output:

 [1] "MB-13"        "MB-13"        "MB-13-212235" "MB-13-212235" "MB-13"       
 [6] "570548260"    "570548260"    "1458629P"     "1139093D"     "8253015N"    
[11] "8253015N"     "M255858"      "M255858"      "8494392Q"     "9296741B"    
[16] "04152341421"  "04152341421"  "04152640475"  "04152821164"  "G140381883"  
[21] "G140381883"   "G140880774"   "G140880774"  

Additional info that you may be confused about:

Terresaterrestrial answered 13/9, 2024 at 7:26 Comment(13)
Hi Wiktor I can see how your code works, but it isn't exactly what I was after. But thanks anywayEnigmatic
@Enigmatic This solution does what you asked for in the title. I added what you actually intended to ask for at the top of the answer. It looks like you simply wanted to remove a punctuation and then a single alphanumeric char at the end of the string.Prussiate
No, some times in my dataset there were several characters after the punctuation, but they all appear at the end of the string. I'm not sure why you're arguing with me about what it is I want as a solution for my problem...Enigmatic
@Enigmatic I am not arguing, I just want to improve the SO post by clarifying the real requirements so that those who come here can use the solution to solve their similar problems. If you have such examples with multiple characters, I am not sure Tim's solution will cover them, can you please share at least one example string and the expected result?Prussiate
@Enigmatic May I intervene... Tim's [[:punct:]][^[[:punct:]]]*$ looked to me confusing, indeed [^[[:punct:]]]* would match a character that is not a [ (redundant) nor a punctuation followed by ]* any amount of characters that are not a closing bracket. The construct more looks like a typo and will just match one non-punctuation character at the end, see this demo. If this regex from Tim is what you want, Wiktor's regex is basically doing the same without typo (target one alnum at the end of the string). Both do not match multiple at the end.Gripe
Thanks @bobblebubble - so how would I have multiples at the end?Enigmatic
@Enigmatic To make it more clear, see this demo - Tim's "fixed" pattern would imho rather be [[:punct:]][^[:punct:]]$ being very similar to the answer of Wiktor here. If you wanted to target multiple non-punctutations, Tim's pattern would be e.g. [[:punct:]][^[:punct:]]*$ ( the \n in demo is jut for multiline showcase) or Wiktor's [[:punct:]][[:alnum:]]*$Gripe
@bobblebubble I assumed (perhaps wrongly) in my answer that maybe the OP would sometimes have data where e.g. 2 digits followed the final punctuation character. So I tried to give a more general pattern.Helbonna
@TimBiegeleisen Yes, however [^[[:punct:]]]* looks to me like a typo where you used too many brackets, so it would match e.g. MB-13-212235.2]]]]] but does not match 570548260-12 (multiple). Though it causes a visual illusion to be supposed to match multiple non-punctuations at the end because of that little slip. 🙃Gripe
@Enigmatic I have previously overlooked that Wiktor's second pattern does what seems to be desired: Target multiple non-punctuations following one or more punctuations. So this would work as desired if multiple shuold be removed. However, always good to have different views, options and opinions!Gripe
Thanks everyone for working together to get me an answer!Enigmatic
@Enigmatic Just want to mention that this solution is fully Unicode friendly. From my former experience, you should use Unicode category classes with PCRE/ICU regex engine if you plan to support languages like Japanese, with non-ASCII letters, rather than POSIX character classes with the default base R TRE regex engine.Prussiate
@WiktorStribiżew thanks for this, I'll keep it in mind if I use international data sources!Enigmatic

© 2022 - 2025 — McMap. All rights reserved.