unix tr find and replace

S

4

16

This is the command I'm using on a standard web page I wget from a web site.

tr '<' '\n<' < index.html

however it giving me newlines, but not adding the left broket in again. e.g.

echo "<hello><world>" | tr '<' '\n<' | cat -e

returns

$
hello>$
world>$

instead of

$
<hello>$
<world>$

What's wrong?

Sibeal answered 1/12, 2011 at 23:19 Comment(0)

H

34

That's because tr only does character-for-character substitution (or deletion).

Try sed instead.

echo '<hello><world>' | sed -e 's/</\n&/g'

Or awk.

echo '<hello><world>' | awk '{gsub(/</,"\n<",$0)}1'

Or perl.

echo '<hello><world>' | perl -pe 's/</\n</g'

Or ruby.

echo '<hello><world>' | ruby -pe '$_.gsub!(/</,"\n<")'

Or python.

echo '<hello><world>' \
| python -c 'for l in __import__("fileinput").input():print l.replace("<","\n<")'

Heartily answered 1/12, 2011 at 23:23 Comment(7)

I tried that but I get n<hello>n<world>. I don't know what the sed newline character is – Sibeal 1/12, 2011 at 23:26

@Sibeal This works for me but try: echo -e '<hello><world>' | sed -e 's/</\n&/g' – Revisory 1/12, 2011 at 23:29

@Sibeal \n is a GNU sed extension. What system are you on? – Heartily 1/12, 2011 at 23:36

@Heartily SunOS (afs system on my campus) – Sibeal 1/12, 2011 at 23:43

On SunOS you will have to put the new line manually. In substitution field, hit enter and continue with your replacement stuff. For tab you will have to manually hit spaces (8 times) or whatever is the default tab limit on your machine. – Chilpancingo 1/12, 2011 at 23:48

@Jaypal A string of 8 spaces does not equal a tab; you need a literal tab character. The 8-space thing is about tab stops, not tabs. – Nickeliferous 4/12, 2011 at 7:27

Use perl when you are on an unspecified Unix machine. Using sed or tr on those machines can reveal they don't support expected features. – Underprop 29/3, 2019 at 9:43

N

3

If you have GNU grep, this may work for you:

grep -Po '<.*?>[^<]*' index.html

which should pass through all of the HTML, but each tag should start at the beginning of the line with possible non-tag text following on the same line.

If you want nothing but tags:

grep -Po '<.*?>' index.html

You should know, however, that it's not a good idea to parse HTML with regexes.

Nonexistence answered 4/12, 2011 at 6:30 Comment(0)

R

3

The order of where you put your newline is important. Also you can escape the "<".

tr '<' '<\n' < index.html

works as well.

Rubenstein answered 3/10, 2013 at 21:27 Comment(0)

C

2

Does this work for you?

awk -F"><" -v OFS=">\n<" '{print $1,$2}'

[jaypal:~/Temp] echo "<hello><world>" | awk -F"><" -v OFS=">\n<" '{$1=$1}1';
<hello>
<world>

You can put a regex / / (lines you want this to happen for) in front of the awk {} action.

Chilpancingo answered 1/12, 2011 at 23:38 Comment(2)

'{$1=$1}1' is shorter and will work if there is more than >< on a line. – Heartily 2/12, 2011 at 0:10

This would replace fewer of the < characters than in the question. – Nickeliferous 4/12, 2011 at 7:29

Recommended topics

Hot tags