Extract All Unique Lines
Asked Answered
L

4

12

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file:

AAAAA
AAAAA
AAAAA
BB
BBBBB
BBBBB
CCC
CCC
CCC

I would only need the following four lines from it:

AAAAA
BB
BBBBB
CCC

I'm using a text editor (EmEditor or Notepad++), that supports RegEx, not a programming language, so I must use a purely Regular Expression.

Any help?

EDIT: I checked the other thread that hsz mentioned and I'd like to make it clear that this one is not the same. Although both need to remove duplicate lines, the way to achieve it is different. I need pure RegEx, but the best answer from the other thread relies on a specific Notepad++ plug-in (which doesn't even come with it any more), so it's not even a regex solution. The second case there, is a regex and it does work on Notepad++, but not on EmEditor at all, which I also need. So I don't think my question is a repetition of that one, although that link is useful, an so I thank hsz for it.

Lais answered 14/7, 2014 at 10:46 Comment(4)
possible duplicate of Removing duplicate rows in Notepad++Overtrump
Are repeated lines grouped together? That is, can the file be AAAA BBBB AAAA BBBB so that you want make it AAAA BBBB?Annabelleannabergite
Answer to Gelbukh: The lines must be on the exact same order as they were originally.Lais
Possible duplicate of find duplicate lines and remove using regular expression with replace featureInchoation
M
14

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1
Merodach answered 14/7, 2014 at 11:0 Comment(4)
Added an option, Replace All Repeated Lines, that will work better in a text editor since you want to "extract" the lines.Merodach
Thank you very much. Your second RegEx (Replace All Repeated Lines) is what I need. The first one does the opposite (but might be useful, so let it be). It works equally on both EmEditor and Notepad++ as I need, however it does not remove the empty lines. :( I already tried adding '|^\n$' to the end, but it does nothing. If you could just help me with that, this would be the best answer. :)Lais
Please see revised answer. If this works for you, please consider accepting the answer by clicking the checkmark on the left as this is now the rep system works on the site. Thanks!Merodach
Perfect! Works well in both editors, exactly what I needed. I'm voting this for the best answer (hope the system accepts it. Last time it didn't because I'm new here). One simple last request: please switch the order of your answers, since the second is what the thread is all about. I fear some people might not vote you up because of that. ;-)Lais
T
4

You can use the following regular expression to remove both repeated and empty lines.

Find: ^(.*)(\r?\n\1)+$
Replace: \1
Taskwork answered 14/7, 2014 at 11:56 Comment(2)
Thank you. Good solution but only works on Notepad++, as it is. I removed the question mark '?' to make it work on EmEditor, but still it only removes a few lines. I think this might be a bug of EmEditor (the program itself) not a fault of your code, so I consider this answer correct. However since I had to choose only one as the best, I chose the one from zx81, because his answer is detailed, it doesn't require any replacement (more practical) and also removes any empty line that might be in the original file (something I also needed), and of course, it works as is in both editors.Lais
In VS Code use replace: $1 and then "replace all".Intellectualize
A
0

Provided that the equal lines go in groups, that is, AAAA AAAA BBBB BBBB and not AAAA BBBB AAAA BBBB, in Perl notation, the following works:

s/(^.*$)(\r?\n\1$)*/$1/gm;

which means substitute /(^.$)(\r?\n\1$)/ for $1 globally and in multiline mode (^ and $ match internal \n).

This expression means that any complete line followed by any number of equal lines is substituted by a single occurrence.

See help on your particular editor for how to apply such a regex.

Annabelleannabergite answered 14/7, 2014 at 11:5 Comment(1)
thanks, but this is not for a simple text editor as I requested. I've tried it without the final parts, but it still doesn't work either.Lais
A
-1

I don't know will it work in Notepad++ or EmEditor but working fine in PHP/JavaScript/Python with substitution.

^(.+)(\n(\1))*$

Here is Demo

Simply copy your text and get the final result from the link that I shared you.

Angulo answered 14/7, 2014 at 11:8 Comment(1)
Thanks for the link, the debuuger is useful. However, the regex needs to replace any char not just letters, and so it didn't do I actually needed. So I replaced the \w by . but now it cleasr everything in both EmEditor and Notepad++, although it "works" fine on the debugger... Maybe it's using a different regex standard...Lais

© 2022 - 2024 — McMap. All rights reserved.