Isolate alphabetical strings within a larger string

Asked 15/3, 2017 at 20:24 Answered 15/3, 2017 at 20:53

Is there a way to isolate parts of a string that are in alphabetical order?

In other words, if you have a string like this: hjubcdepyvb

Could you just pull out the portion in alphabetical order?: bcde

I have thought about using the is.unsorted() function, but I'm not sure how to apply this to only a portion of a string.

Forwarder answered 15/3, 2017 at 20:24 Comment(0)

Here's one way by converting to ASCII and back:

input <- "hjubcdepyvb"
spl_asc <- as.integer(charToRaw(input))       # Convert to ASCII
d1 <- diff(spl_asc) == 1                      # Find sequences
filt <- spl_asc[c(FALSE, d1) | c(d1, FALSE)]  # Only keep sequences (incl start and end)
rawToChar(as.raw(filt))                       # Convert back to character

#[1] "bcde"

Note that this will concatenate any parts that are in alphabetical order.

i.e. If input is "abcxasdicfgaqwe" then output would be abcfg.

If you wanted to get separate vectors for each sequential string, you could do the following

input <- "abcxasdicfgaqwe"
spl_asc <- as.integer(charToRaw(input))
d1 <- diff(spl_asc) == 1
r <- rle(c(FALSE, d1) | c(d1, FALSE))                   # Find boundaries
cm <- cumsum(c(1, r$lengths))                           # Map these to string positions
substring(input, cm[-length(cm)], cm[-1] - 1)[r$values] # Extract matching strings

Finally, I had to come up with a way to use regex:

input <- c("abcxasdicfgaqwe", "xufasiuxaboqdasdij", "abcikmcapnoploDEFgnm",
           "acfhgik")
(rg <- paste0("(", paste0(c(letters[-26], LETTERS[-26]),
                           "(?=", c(letters[-1], LETTERS[-1]), ")", collapse = "|"), ")+."))

#[1] "(a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|
#k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|
#v(?=w)|w(?=x)|x(?=y)|y(?=z)|A(?=B)|B(?=C)|C(?=D)|D(?=E)|E(?=F)|F(?=G)|G(?=H)|
#H(?=I)|I(?=J)|J(?=K)|K(?=L)|L(?=M)|M(?=N)|N(?=O)|O(?=P)|P(?=Q)|Q(?=R)|R(?=S)|
#S(?=T)|T(?=U)|U(?=V)|V(?=W)|W(?=X)|X(?=Y)|Y(?=Z))+."

regmatches(input, gregexpr(rg, input, perl = TRUE))
#[[1]]
#[1] "abc" "fg" 
#
#[[2]]
#[1] "ab" "ij"
#
#[[3]]
#[1] "abc" "nop" "DEF"
#
#[[4]]
#character(0)

This regular expression will identify consecutive upper or lower case letters (but not mixed case). As demonstrated, it works for character vectors and produces a list of vectors with all the matches identified. If no match is found, the output is character(0).

Stadium answered 15/3, 2017 at 20:33 Comment(0)

Using factor integer conversion:

input <- "hjubcdepyvb"
d1 <- diff(as.integer(factor(unlist(strsplit(input, "")), levels = letters))) == 1
filt <- c(FALSE, d1) | c(d1, FALSE)
paste(unlist(strsplit(input, ""))[filt], collapse = "")
# [1] "bcde"

Pompey answered 15/3, 2017 at 20:53 Comment(0)

myf = function(x){
    x = unlist(strsplit(x, ""))
    ind = charmatch(x, letters)
    d = c(0, diff(ind))
    d[d !=1] = 0
    d = d + c(sapply(1:(length(d)-1), function(i) {
        ifelse(d[i] == 0 & d[i+1] == 1, 1, 0)
    }
    ), 0)
    d = split(seq_along(d)[d!=0], with(rle(d), rep(seq_along(values), lengths))[d!=0])
    return(sapply(d, function(a) paste(x[a], collapse = "")))
}

myf(x = "hjubcdepyvblltpqrs")
#     2      4 
#"bcde" "pqrs"

Deherrera answered 15/3, 2017 at 20:48 Comment(0)

Recommended topics

Hot tags