Formula to eliminate all but alpha characters
Asked Answered
K

8

10

I need to scrub a column of names in Excel to eliminate all non-Alpha characters including periods, commas, spaces, hyphens and apostrophes.

EXAMPLE: Change O'Malley-Smith, Tom, Jr. to OMALLEYSMITHTOMJR

The client requires this to be an Excel function, otherwise I'd make it easy with a quick Java program similar to replaceAll("[^a-zA-Z]", "").toUpperCase(). I cannot seem to find anything that looks like an off-the-shelf function to do this outside of a whole mess of SUBSTITUTE functions - which only seem to be available one-per-cell.

I'm not terribly fluent with developing custom macros if that's what I need.

Kowalski answered 19/3, 2015 at 16:23 Comment(1)
Fyi To include only characters contained in list c.f. Remove special characters from rangeZilpah
W
12

I had a similar need sometime ago and found something that worked great.

Press Alt+F11 to open the Visual Basic editor. Insert a new Module and paste the following code.

Function CleanCode(Rng As Range)
    Dim strTemp As String
    Dim n As Long

    For n = 1 To Len(Rng)
        Select Case Asc(Mid(UCase(Rng), n, 1))
            Case 48 To 57, 65 To 90
                strTemp = strTemp & Mid(UCase(Rng), n, 1)
        End Select
    Next
    CleanCode = strTemp
End Function

CleanCode now is new function and you can use it as a formula.

So next to the cell with the string you want to manipulate just copy =CleanCode(yourcell)

Whichsoever answered 19/3, 2015 at 16:39 Comment(0)
T
12

Indeed a mess of SUBSTITUTEs but within a single cell is possible, eg:

=UPPER(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1," ",""),",",""),"'",""),".",""),"-",""))   

Of course may need to be 'extended' to cover other non-alpha characters.

Tremblay answered 19/3, 2015 at 16:31 Comment(4)
Clever formulaic approach.Fingered
kind of you, but formulas aren't really my thing. Whereas I don't think Barry has any peers, he is simply the best fullstop.Fingered
Great solution for removing limited unique non-numeric strings. Worked for me when I had to convert keys such as "14D", "8M", "9hr", and "23min" to numeric values, by simply using "D", "M", "hr", "min" as the subs.Oblivion
Excellent solution for eliminating few characters. ThanksLamellibranch
W
12

I had a similar need sometime ago and found something that worked great.

Press Alt+F11 to open the Visual Basic editor. Insert a new Module and paste the following code.

Function CleanCode(Rng As Range)
    Dim strTemp As String
    Dim n As Long

    For n = 1 To Len(Rng)
        Select Case Asc(Mid(UCase(Rng), n, 1))
            Case 48 To 57, 65 To 90
                strTemp = strTemp & Mid(UCase(Rng), n, 1)
        End Select
    Next
    CleanCode = strTemp
End Function

CleanCode now is new function and you can use it as a formula.

So next to the cell with the string you want to manipulate just copy =CleanCode(yourcell)

Whichsoever answered 19/3, 2015 at 16:39 Comment(0)
W
8

If you wanted to go down the VBA route - you couldn't use a User-Defined Function (UDF) to change the value of the cell you're entering the function into - but you could use a simple macro and take advantage of Microsoft's VBScript RegEx Engine:

Sub SO()

Dim searchRange     As Excel.Range
Dim cell            As Variant
Dim RegEx           As Object

Set RegEx = CreateObject("VBScript.RegExp")

With RegEx
    .Pattern = "[^a-zA-Z]"
    .Global = True
    .MultiLine = True
End With

Set searchRange = ActiveSheet.Range("A1:D5") '// Change as required

    For Each cell In searchRange.Cells
        If RegEx.test(cell) Then cell.Value = RegEx.Replace(cell.Value, vbNullString)
        cell.Value = UCase(cell.Value)
    Next cell

Set searchRange = Nothing
Set RegEx = Nothing

End Sub
Woodworth answered 19/3, 2015 at 16:44 Comment(1)
Regexp is most efficient VBA method. Though would use this in a variant array rather than rangeFingered
L
4

This can be done with a single formula in Excel 2016 onwards.

While pnuts' solution lists explicit characters to strip, this solution lists explicitly valid characters.

Assume your dirty data is in column A. Assume you want your clean data in column B.

Use the very last formula below, in cell B1. To enter the formula into cell B1, do the following:

  • Click on cell B1
  • Click into the formula bar
  • Paste the formula
  • Press CTRL+Shift+Enter <-- important step

Copy cell B1 and paste it down column B as far as you need.

First, here is a short example to explain what's going on:

=TEXTJOIN("",TRUE,

IFs(
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "t", "t",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "e", "e",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "s", "s",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "T", "T",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "E", "E",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "S", "S",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = "2", "2",
    MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1) = " ", " ",
    true, ""
  )

)

In this case I have specified the following characters as valid: t, e, s, T, E, S, 2 and the space character.

Obviously you need to extend the formula out to the full alphabet, listing each upper and lowercase character separately. Likewise, extend it out to include numbers if you want to keep those. Note that numbers are coded as strings.

How this works:

In a nutshell we are splitting the source string into an array of separate characters, then for each character, checking whether it is in our set of valid characters and including it in the result if it is, otherwise replacing it with an empty string if it is not.

The 'IFS' function goes through its arguments a pair at a time. If the first argument evaluates as true, then the second argument is returned. If not, it proceeds with the next pair of arguments - this is why you see the letters listed twice on each row. The last pair of values in the IFS function is the set 'true' and the empty string. This is saying if we get to the end of the set of valid values (ie and haven't matched a valid value) then return the empty string.

More background on why this works:

This is a variation on a solution given at ExcelJet. In that solution the TEXTJOIN function is used (to concatenate the values of an array) with the INDIRECT function (which splits the string into an array) together with a mathematical operator (the plus symbol) to force the evaluation of a calculation between every character in the string with a numerical value. Numerical characters in the string will return numerical values while other characters will return an error. That solution uses the function ISERR to check for an error to decide whether or not to include a given character in the final output. A similar article exists there to work the other way around - to exclude the numbers and keep the letters.

The problem I wanted solved is for the coder to decide which values are valid and which are not. I went through trying to incorporate VLOOKUP and INDEX functions with the INDIRECT function but they will only work on the first character in the string. The trick is that not all functions will act on the output of INDIRECT in such a way as to evaluate every element in the array (ie. every character in the string). The secret was that ExcelJet used a mathematical operator. If you check Microsoft's full function reference, IFS is categorised as a "logic" function. I suspect logic functions can be used with INDIRECT in this way.

(Note: I also attempted to use AND, and OR in various combinations. However, the INDIRECT evaluates all characters in the string. So for example, using the CODE function to gain each character's ASCII value and asserting all characters must have values between 65 - 90 (uppercase) or 97 - 122 (lowercase letters) will work only if all characters in the string are uppercase, or all are lowercase, but not if there is a mixture.)

I don't know how the performance of this solution compares with the earlier suggestion using SUBSTITUTE. If you only want to strip out a few characters, I recommend the SUBSTITUTE solution. If you want to specify explicitly the valid characters to keep (which was the original question), use this one.

Finally, here is the exact answer you need, including the conversion to uppercase which you didn't note in the question, but displayed in your example. (For anyone else not wanting the uppercase conversion, remove the instances of 'UPPER' from this example, then add the alphabet again to the list, in lowercase, and be sure to leave the 'true'/empty string pair as the last entry in the list.)

=TEXTJOIN("",TRUE,
IFs(
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "A", "A",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "B", "B",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "C", "C",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "D", "D",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "E", "E",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "F", "F",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "G", "G",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "H", "H",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "I", "I",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "J", "J",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "K", "K",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "L", "L",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "M", "M",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "N", "N",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "O", "O",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "P", "P",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Q", "Q",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "R", "R",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "S", "S",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "T", "T",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "U", "U",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "V", "V",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "W", "W",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "X", "X",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Y", "Y",
    upper(MID(a1,ROW(INDIRECT(CONCATENATE("1:",LEN(a1)))),1)) = "Z", "Z",
    true, ""
)
)

With the original question being "eliminate all but alpha characters" - this answer does the trick in a formula without the need for VBA.

Leakey answered 15/5, 2018 at 22:54 Comment(0)
F
3

further to youcantryreachingme's answer above, and consolidating it with the advice from https://exceljet.net/formula/strip-numeric-characters-from-cell using SEQUENCE & LET functions (available in the more recent editions of Excel), then further experimenting, I've been able to get the function down to...

=LET(character,MID(a1,SEQUENCE(LEN(a1)),1),TEXTJOIN("",TRUE,
 IFS(
     character=" "," ",
     CODE(UPPER(character))>90,"",
     CODE(UPPER(character))<65,"",
     TRUE,character)
 ))

As IFS seems to only execute the 1st statement matched, it is important to place any exceptions you want to include that are outside the range excluded (>90, <65), before that range is declared. It seems more intuitive to use an 'AND' statement to combine the range limits, but this doesn't work for the reasons stated by youcantryreachingme above.

Flirtatious answered 8/11, 2021 at 19:26 Comment(0)
J
2

Another VBA solution

Sub RemoveCrap()
    Dim varRange As Range
    Dim varWorkRange As Range

    Set varWorkRange = Range("A1:A10")

    For Each varRange In varWorkRange
        varVal = ""
        For i = 1 To Len(varRange.Value)
            varTemp = Mid(varRange.Value, i, 1)
            If Not (varTemp Like "[a-z]" Or varTemp Like "[A-Z]") Then
                varStr = ""
            Else
                varStr = UCase(varTemp)
            End If
            varVal = varVal & varStr
        Next i
        varRange.Value = varVal
    Next
End Sub
Juarez answered 19/3, 2015 at 16:49 Comment(1)
@user3415869 - The concept will work but not the code because Google Sheets uses Google Apps Script whereas MS Excel uses VBAJuarez
Z
2

Can't respond directly to aurezio as just joined and not enough reputation. I really like this solution and found it very efficient and short/simple for my requirements -- which is to clean up text fields to output a filtered character set to columns for filename and web coding output needs.

Although not directly answering the question, this page was the closest I found to the solution I needed so I wanted to give others more context and expand on aurezio's solution for a more generic character selection - in case anyone is interested.

Row 7 relates to the character number or number series to allow. Use excel function CODE() to identify a char's code and CHAR() to post the char code in.

My criteria (as shown below) was to filter all but 45 (dash), 48 To 57 (numeric series), 65 To 90 (alphanumeric uppercase series), 97 To 122 (alphanumeric lowercase series). Also, not requiring uppercase means UCase() can be removed.

'based off aurezio's solution
Function CleanCode(Rng As Range)
    Dim strTemp As String
    Dim n As Long

    For n = 1 To Len(Rng)
        Select Case Asc(Mid(Rng, n, 1))
            Case 45, 48 To 57, 65 To 90, 97 To 122
                strTemp = strTemp & Mid(Rng, n, 1)
        End Select
    Next
    CleanCode = strTemp
End Function

Eventually I'd like to improve it to make it dynamic eventually and allow input arguments of the filter(s) required. eg ClearCode(Range,"45", "48-57", "65-90", "97-122")

Zeeba answered 20/8, 2018 at 20:12 Comment(0)
G
1

My preference in a simple set of conditions is more basic than Phil's, above, in that it directly makes the tests rather than listing out an IFS, which has its own issues in layout. I'll use LET for clarity, though I don't think it was available at the time of the question, since it can be removed with simple substitution:

=LET(Character,  MID(A1,SEQUENCE(LEN(A1)),1),  CharCode,  CODE(Character),

      TEXTJOIN("",TRUE,  IF( (CharCode>=65) * (CharCode<=90) + ((CharCode>=97) * (CharCode<=122)), UPPER(Character), "" )  )
      )

Basically, the innermost tests are the AND's in the middle, first testing whether the character is in the set of uppercase letters, the second testing whether it is in the set of lowercase letters, the two collections that are acceptable. An array or TRUE/FALSE is generated for each and since there is a multiplication, Excel converts from literal TRUE/FALSE to 1/0 for each. (If there were only a single array and no required multiplication, that would not happen, so one would need a "*1" to coerce the conversion.)

The two arrays are then added which preserves all successful results from the individual arrays. So, if the character passes either set of the tests, it makes it into the final array as a 1 (TRUE). The IF uses that array to keep a character and subject it to UPPER, or to not keep it at all and place "" into the output array instead. Two concerns:

  1. UPPER can be used, simplifying things, since the only results of interest are letter characters. If a wider set of characters were acceptable, it might not suit. But for most characters in the set of characters likely used by the poster, it would simply return the character. The return would be text, so numerals would not be numbers, but since the desired output is a string, that would not matter.

  2. What to place in the output array needs to work with TEXTJOIN so that function is all that is needed. If that cannot be done, then more complexity can't be avoided. But it is willing to ignore a certain type of input when creating the string, and that is blanks. So if the output to it has blanks for unacceptable characters, it can be told to ignore them, dropping them out as specifically desired. That makes "" the natural choice, but if one wishes to, or is willing to, use a different character and path afterward in the formula, it is certainly possible. Just not the easiest, least complex way.

Then TEXTJOIN simply combines the successful characters using no delimiter, as desired.

The most basic level of the formula is in the setup of the LET's variables. One uses MID to break apart the input strings to single characters using SEQUENCE to look at each one in turn. SEQUENCE might not have been available when posted either, but there were "old-school" methods to do the same in the MID function. So fair to the question, but in any case, for anyone with a current version 2022-onwards, completely cut-and-pastable.

Other approaches could certainly include all the answers given already, as well as a careful IFS setup (which would perhaps be the most flexible of all the non-VBA solutions, as well as a SWITCH approach. If one were concerned with more, especially far more, characters than just the lower 256 code characters, one should remember that there are numerals spotted all throughout the Unicode system. If for no other reason, one could end up with a very Balkanized list. After a point, formulas might become nightmarish. But UDF's would still be very straightforward-ish.

Grapeshot answered 6/11, 2022 at 5:5 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.