Removing all fraction symbols like “¼” and “½” from a string
Asked Answered
D

2

64

I need to modify strings similar to "¼ cups of sugar" to "cups of sugar", meaning replacing all fraction symbols with "".

I have referred to this post and managed to remove ¼ using this line:

itemName = itemName.replaceAll("\u00BC", "");

but how do I replace every possible fraction symbol out there?

Diarrhea answered 12/4, 2017 at 2:41 Comment(8)
what about removing all non alphanumeric character except space: using: itemName.replaceAll("[^A-Za-z0-9 ]", "");Hoover
Java is not AndroidArchaeopteryx
@Archaeopteryx got it. tag removed.Diarrhea
Perhaps I spend too long on cooking.se but I wonder why you're doing this (as opposed to replacing "¼ cups of sugar" with " 1/4 cups of sugar").Britishism
May I ask why you would want to completely remove things that will change the semantic meaning of the string? I'm curious.Carden
@ChrisH and Matti - I'm building an app for recipes and shopping lists - and I'm using an API which returns a JSON with ingredients combined with their quantity needed. I am still keeping the original string, but giving the user an option to see items grouped by their 'clean names' (so they only see one item) instead of seeing 5 rows of different quantities of garlic. Did I explain that right? Sorry, I'm a total novice.Diarrhea
@Diarrhea that sounds reasonable if tricky to get just right (I could imagine a recipe calling for "1 cup of sugar" as well as "sugar (for dusting)" so the grouping could be a challenge. Good luckBritishism
If it's for a cooking app I'd suggest just hard coding the replacements for a limited number of fractions, maybe 1/2 to 1/10. I've never seen a recipe which called for 1/1076...Voe
D
97

Fraction symbols like ¼ and ½ belong to Unicode Category Number, Other [No]. If you are ok with eliminating all 676 characters in that group, you can use the following regular expression:

itemName = itemName.replaceAll("\\p{No}+", "");

If not, you can always list them explicitly:

// As characters (requires UTF-8 source file encoding)
itemName = itemName.replaceAll("[¼½¾⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞↉]+", "");

// As ranges using unicode escapes
itemName = itemName.replaceAll("[\u00BC-\u00BE\u2150-\u215E\u2189]+", "");
Davila answered 12/4, 2017 at 2:49 Comment(7)
Note that fonts may render any sequence like 23/12 as fractions, thus enabling any fraction to be shown like that, not just the pre-composed ones. If that happens you may need to remove a lot more than just a list of characters.Counterstamp
Why the + in the regex'es ? Can't you just simply leave it out or does it do anything for efficiency ?Dutton
@Dutton In this case the + operator causes the character set ([...]) to repeat multiple times. See this answer for more details: https://mcmap.net/q/303176/-what-is-the-meaning-of-in-a-regexBonnibelle
@Dutton yes they aren't necessary, and yes they should improve efficiency. One should probably not draw conclusions from it, but if you add a + at the end of the expression in this regex101 sample execution time will go down from 1 to 0ms and the number of steps will fall from 32 to 14. On an input without any repeats it only adds one stepAnticatalyst
@Anticatalyst I would refute that conclusion with regex101.com/r/9Md35x/1, the change seems marginal and I would attribute it to the javascript implementation potentially and maybe flow predictionDutton
@Dutton heh? Testing it on my side, it seems to behave marginally better with +, going down from 148305 steps to 139377 and from ~375ms to ~350ms. Thanks for taking the time to make a good data set in any case ! You're right that it probably depends on regex engines specificsAnticatalyst
@I tested it with a larger sample and it's a 3% increase, but I would expect it to be dependant on the language and the code. Javascript is a slow scripting language so the prediction that another one might come aswell, could boost it to a larger % than for c or java. Would be interesting to test though.Dutton
D
2

You can use below regex to replace all fraction with empty string.

str = str.replaceAll("(([\\xbc-\\xbe])?)", "")
Dissonancy answered 12/4, 2017 at 2:55 Comment(2)
Why the additional capturing groups () and the optional ? match?Regality
You know, just in case, you wanted to replace "" with ""Dutton

© 2022 - 2024 — McMap. All rights reserved.