RegEx to remove repeated start of line using TextWrangler

Asked 14/8, 2014 at 0:40 Answered 4/8, 2015 at 7:20

Trying to turn

a: 1, 2, 3
a: a, b, v
b: 5, 6, 7
b: 10, 1543, 1345
b: e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
e1: 1, 3, 2
e1: 9, 8, 7, 6

into

a: 1, 2, 3
   a, b, v
b: 5, 6, 7
   10, 1543, 1345
   e, fe, sdf
cd: asdf, asdfas dfasdfa,asdfasdfa,afdsfa sdf
e1: asdfas, dafasd, adsf, asdfasd
    1, 3, 2
    9, 8, 7, 6

So, the lines are sorted. If consecutive lines start with the same sequence of characters up to / including some separator (here the colon (and the blank following it)), only the first instance should be preserved - as should be the remainder of all lines. There could be up to about a dozen (and a half) lines starting with the identical sequence of characters. The input holds about 4,500 lines…

Tried in TextWrangler.

Whilst the search pattern

^([[:alnum:]]+): (.+)\r((\1:) (.+)\r)*

matches correctly, neither the replacement

\1:\t\2\r\t\3\r

nor

\1:\t\2\r\t\4\r

gets me anywhere close to what I'm looking for.

The search pattern

^(.+): (.+)\r((?<=\1:) (.+)\r)*

is rejected for the lookbehind not being fixed length. - Not sure, it's going into the right direction anyway, though.

Looking at How to merge lines that start with the same items in a text file I wonder, whether there is an elegant (say: one search pattern, one replacement, run once) solution at all.

On the other hand, I might just not be able to come up with the right question to search the net for. If you know better, please, point me into the right direction.

Keeping the remainder of the rows aligned is, of course, sugar on the cake…

Thank you for your time.

Retool answered 14/8, 2014 at 0:40 Comment(5)

Is there a "sensible" limit that can be placed on the maximum number of consecutive lines with the same prefix? If so, what is it? – Inquisitive 2/8, 2015 at 17:11

@Inquisitive Originally I said something like "a dozen (and a half)" - so anything beyond should be fair. – Retool 2/8, 2015 at 18:56

My pcre try: (?<=(\w\w:)|(\w:))\h(.*\R?)\1?\2? replace with \t\3. See test at regex101. Max prefix length is 2, further max length can be added. Unclear if you want to have a tab for each line or replace each character of the prefix with one space. – Lannylanolin 3/8, 2015 at 7:34

@Jonny5 Adjusted like (?<=(\w\w:)|(\w:))\s(.*\n?)\1?\2? it does the job as wanted. You might want to turn that into an answer. If you do, please, expand on the mechanics. – Retool 3/8, 2015 at 21:19

@sln As far as I can tell, \G is not available. You might want to double check: TextWrangler User Manual. – Retool 3/8, 2015 at 21:58

As a workaround for variable length lookbehind: PCRE allows alternatives of variable length

PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the same length, PCRE allows alternatives of variable length.

An idea that requires to add a pipe for each character of max prefix length:

(?<=(\w\w:)|(\w:)) (.*\n?)\1?\2?

And replace with \t\3. See test at regex101. Capturing inside the lookbehind is important for not consuming / not skipping a match. Same pattern variable eg .NET: (?<=(\w+:)) (.*\n?)\1?

(?<=(\w\w:)|(\w:)) first two capture groups inside lookbehind for capturing prefix: Two or one word characters followed by a colon. \w is a shorthand for [A-Za-z0-9_]
(.*\n?) third capture group for stuff between prefixes. Optional newline to get the last match.
\1?\2? will optionally replace the same prefix if in the following line. Only one of both can be set: \1 xor \2. Also space after colon would always be matched - regardless prefix.

Summary: Space after each prefix is converted to tab. Prefix of following line only if matches current.
To match and replace multiple spaces and tabs: (?<=(\w\w:)|(\w:))[ \t]+(.*\n?)\1?\2?

Lannylanolin answered 4/8, 2015 at 4:22 Comment(3)

Interesting approach to replace the start of the following line! - The tabulator is sure sufficient for the alignment. But would it be possible to use spaces for alignment - and still do it in one go? – Retool 4/8, 2015 at 8:32

@Retool If \G (end of previous match) and \K (reset beginning of match) are supported, see test at regex101: (?<=(\w\w:)|(\w:)).*\n(?=\1|\2)\K\w|\G(?!^)[\w:] and replace with a space. – Lannylanolin 4/8, 2015 at 8:58

Tried earlier to find \G mentioned in the current TextWrangler User Manual - with no success. - Testing the regex in TextWrangler, it just does nothing at all. -- Still appreciate the attempt. – Retool 4/8, 2015 at 9:19

The problem with the substitution is the uncertain number of matches. When you limit that number e.g. to 12, you could use a regex like this:

^([^:]+): ([^\n]+[\n]*)(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?(\1: ([^\n]+[\n]*))?

with this replacement:

\n\1:\t\2\t\4\t\6\t\8\t\10\t\12\t\14\t\16\t\18\t\20\t\22\t\24

Explanation: it contains basically just two sub-regexes

^([^:]+): ([^\n]+[\n]*) = matches on the first line of a group
(\1: ([^\n]+[\n]*))? = optional matches on consecutive lines, belonging to the same group. You have to copy this regex as often as needed to match all lines (i.e. in this case 12x). The ? (= optional) match won't give you an error if there aren't enough matches for all substitutions.
the \n at the beginning of the substitution is needed for a formatting issue
the result will contain a few empty lines, but I'm sure, you can solve that... ;-)

DEMO 1

However, since I'm not a fan of over-sized regexes - and for the case that you have a bigger number of potential matches - I would prefer a solution like this:

combine all lines, belonging to the same group (as you already mentioned: How to merge lines that start with the same items in a text file). Within these steps, you can replace the group item by something unique (e.g. :@:).
replace this unique item with \n\t

DEMO 2

Uribe answered 3/8, 2015 at 2:17 Comment(1)

Yeah - that's getting pretty close. - But either one of your solutions requires two steps. – Retool 3/8, 2015 at 19:20

The awk one-liner below will do what you want

awk -F: 'NR==1 {print $0} NR != 1 {if ($1 != prev) print $0; else {for (i=0; i<=length($1); ++i) printf " "; print $2;}} {prev=$1}' < input_file.txt

(put the original text into input_file.txt)

I believe it is possible to write a nicer code, but it is time to go to bed)

Guilbert answered 2/8, 2015 at 3:34 Comment(3)

It does the job - but unfortunately outside TextWrangler. (And it doesn't even need a RegEx... Should that be considered a hint? ;-) ) – Retool 2/8, 2015 at 19:25

sorry, did not pay enough attention. but imho you can't do it purely with regexp since the decision sequence depends on the knowledge of what the beginning of the previous line was. – Guilbert 3/8, 2015 at 9:4

It's still a neat solution - good to know if nothing became available inside TextWrangler. – Retool 3/8, 2015 at 18:43

I tried your sample in Bare Bones Software Inc.'s TextWrangler and I came up with a two pass solution which is limited to n consecutive lines, and it uses a tab instead of trying to magically match the length of the prefix. Also note that the last line of the file should be an empty line (add a newline after , 6 in your example)

For our purposes I'm showing you where n=4:

Find: ^([[:alnum:]]+\:)(.+\r)(?:\1(.+\r))?\1(.+)\r
Replace: \1\2\t\3\t\4\t\5\r

You can add one to any n by duplicating a (?:\1(.+\r))? in Find and adding on \t\n before \r in Replace where *n* is the increment after the last number that was before that \r.

Replacing all with this, you can follow it up with:

Find: ^\t+
Replace: \t

To mostly get the result you want.

Arnold answered 4/8, 2015 at 7:20 Comment(1)

Yes - this one is getting very close as well. It correctly addresses the main concern (the repeated start of line). With the trailing \r made optional in the to be found expression, it would not be necessary to add a newline to the input. Output looks a bit cleaner when replacing ^\t+ (which has an additional trailing blank) in the second pass. It still leaves a new tab at the end of each group, for which replacements were made. (Which could, of course, be fixed in a third pass.) – Retool 4/8, 2015 at 11:18

So since you would like to replace all further instances aside from the first one, I'd assume you need regex to match everything but the first so you can replace them. Regular Expression as you know can not moddify or alter the original string, only return a specific match, which itself can be used to specify parts of the string to moddify.

The best regex I could come up with is /(\b[a-zA-Z0-9]+: )[^\n]+(?:\n|$)(?!\1)/g.

This will capture every unique instance of xx: and match the last instances of it. Only issue with this is that it'll still match the last instance even if it's the only instance.

My conclusion is that I don't believe you can do this all with regex. I may be wrong, if someone can find an online regex debugger that supports lookbehind AND backreferencing, let me know and I'll see if I can write an expression to work. I could not personally find any regex debuggers that accept backreferencing and lookbehind. In my example I use lookahead instead so it checks if there are any instances of it ahead, if so ignore the current match (so it selects only the last instance).

If you really wanted to find a way to automate this to make it work, use /(\b[a-zA-Z0-9]+: )/g to match every instance of xx:, store them all in an array and if there is a duplicate, run the original regex on that specific one to continue trimming it down until there are no more duplicates. Again you may be able to use it to store all unique instances and utilize that somehow.

Hope this helps or clarifies your problem, apologies if it doesn't.

Plumbiferous answered 2/8, 2015 at 3:13 Comment(1)

I'd thought, RegEx101 does support backreferences as well as lookbehind. (Please remember, though, TextWrangler does not support variable length lookbehind. Checking your suggestion RegEx101 - it confirms your description. And basically the reason for my question: How to get the job done with a TextWrangler compatible RegEx? – Retool 2/8, 2015 at 19:30

-1

Do not have Textwrangler to test, but I test this in other Regex Tool, it works well, please try:

(?<=(?:(?:.+\n)|^)(\w+?:).+\n)\1(?=\s)

Midwife answered 14/8, 2014 at 2:19 Comment(3)

Unfortunately, this does raise the "The search cannot proceed, because of a syntax error in the Grep pattern: lookbehind assertion is not fixed lenght" as well. – Retool 14/8, 2014 at 7:54

Then: Would this catch more than one occurence? Finally: Sorry, but my main issue not is not finding the respective groups of lines but replacing the repetition at the beginning of the 2+ lines of each group. – Retool 14/8, 2014 at 8:6

As the asker said, the original search pattern is fine; the problem is replacing an unknown number of instances per match. – Seagoing 1/8, 2015 at 6:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags