How to merge lines that start with the same items in a text file
Asked Answered
S

3

1

I have a text file containing some thousand lines as follows:

File:

abc: bla1 bla1 bla1... 
cde: bla bla bla... 
ghk: bla1 bla1 bla1... 
lmn: bla bla bla...
abc: bla2 bla2 bla2... 
bcd: bla bla bla... 
ghk: bla2 bla2 bla2... 
xyz: bla bla bla...

I want to merge all the lines that start with the same items (as 1 and 5, 3 and 7) so that I have a new text file like this:

New File:

abc: bla1 bla1 bla1... * abc: bla2 bla2 bla2... 
cde: bla bla bla... 
ghk: bla1 bla1 bla1... * ghk: bla2 bla2 bla2...
lmn: bla bla bla...
bcd: bla bla bla...   
xyz: bla bla bla...

I wonder if this is possible to be solved using regex and/or grep, and if yes then how can I solve it?

I'm quite familiar with grep because I'm on TextWrangler, but also OK with other text editors.

Help much appreciated.

Slusher answered 11/8, 2014 at 18:12 Comment(2)
I don't think there is an elegant solution for this. Try a Perl approach. First pass, populate a hash with key's being the start items where the hash value is an array of line numbers containing the start item. Duplicate the file. Second pass, from one file copy to next (merge) based on hash. Third pass, delete lines based on hash.Hugh
Does order matter? If not, sort first. Then you'll have an 'xyz' line followed by another 'xyz' and you can use a regex that will merge those lines into one.Noise
N
2

If order doesn't matter, I suggest first sorting the text. That will place

abc: ...
abc: ...

next to one another. Then you'll run this regex through a few passes:

Search:
  ^(\w+): (.*)\n\1: 
Replace:
  \1: \2 

Result:
   abc: bla1 bla1 bla1... bla2 bla2 bla2...
   bcd: bla bla bla...
   cde: bla bla bla...
   ghk: bla1 bla1 bla1... bla2 bla2 bla2...
   lmn: bla bla bla...
   xyz: bla bla bla...

If order DOES matter, then this regex can be run through a few times:

Search:
  ^(\w+): (.*)\n((?:(?!\1).*\n)+)\1: (.*\n)
Replace:
  \1: \2 \4\3

Result (1st pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1...
  lmn: bla bla bla...
  bcd: bla bla bla...
  ghk: bla2 bla2 bla2...
  xyz: bla bla bla...

Result (2nd pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1... bla2 bla2 bla2...
  lmn: bla bla bla...
  bcd: bla bla bla...
  xyz: bla bla bla...
Noise answered 11/8, 2014 at 21:47 Comment(1)
This is absolutely what I'm looking for. Thank you so much.Slusher
B
3

With GNU bash. If the order does not matter.

declare -A A      # declare associative array A
# fill array
while read I L; do 
  [ ${#A[$I]} -gt 0 ] && A[$I]+=" * $L"
  [ ${#A[$I]} -eq 0 ] && A[$I]+=" $L"
done < filename
# print array
for J in "${!A[@]}"; do echo "$J${A[$J]}"; done

Output:

xyz: bla bla bla...
lmn: bla bla bla...
abc: bla1 bla1 bla1... * bla2 bla2 bla2...
ghk: bla1 bla1 bla1... * bla2 bla2 bla2...
bcd: bla bla bla...
cde: bla bla bla...
Baucis answered 11/8, 2014 at 19:50 Comment(0)
N
2

If order doesn't matter, I suggest first sorting the text. That will place

abc: ...
abc: ...

next to one another. Then you'll run this regex through a few passes:

Search:
  ^(\w+): (.*)\n\1: 
Replace:
  \1: \2 

Result:
   abc: bla1 bla1 bla1... bla2 bla2 bla2...
   bcd: bla bla bla...
   cde: bla bla bla...
   ghk: bla1 bla1 bla1... bla2 bla2 bla2...
   lmn: bla bla bla...
   xyz: bla bla bla...

If order DOES matter, then this regex can be run through a few times:

Search:
  ^(\w+): (.*)\n((?:(?!\1).*\n)+)\1: (.*\n)
Replace:
  \1: \2 \4\3

Result (1st pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1...
  lmn: bla bla bla...
  bcd: bla bla bla...
  ghk: bla2 bla2 bla2...
  xyz: bla bla bla...

Result (2nd pass):
  abc: bla1 bla1 bla1... bla2 bla2 bla2...
  cde: bla bla bla...
  ghk: bla1 bla1 bla1... bla2 bla2 bla2...
  lmn: bla bla bla...
  bcd: bla bla bla...
  xyz: bla bla bla...
Noise answered 11/8, 2014 at 21:47 Comment(1)
This is absolutely what I'm looking for. Thank you so much.Slusher
G
0

If you can use awk, this should work:

awk '{a[$1]=a[$1]?a[$1]"* "$0:$0} END {for (i in a) print a[i]}' file
ghk: bla1 bla1 bla1... * ghk: bla2 bla2 bla2...
lmn: bla bla bla...
cde: bla bla bla...
xyz: bla bla bla...
bcd: bla bla bla...
abc: bla1 bla1 bla1... * abc: bla2 bla2 bla2..

.

Gerhardine answered 12/8, 2014 at 5:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.