Code golf: "Color highlighting" of repeated text

Asked 3/7, 2010 at 12:58 Answered 3/7, 2010 at 13:29

string language-agnostic code-golf rosetta-stone

(Thanks to greg0ire below for helping with key concepts)

The challenge: Build a program that finds all substrings and "tags" them with color attributes (effectively highlighting them in XML).

The rules:

This should only be done for substrings of length 2 or more.
Substrings are just strings of consecutive characters, which may include non-alphabetic characters. Note that spaces and other punctuation do not delimit substrings.
Character casing cannot be ignored.
The "highlight" should be done by tagging the substring in XML. Your tagging should be of the form <TAG#>theSubstring</TAG#> where # is a positive number unique to that substring and identical substrings.
The priority of the algorithm is to find the longest substring, not how many times it matches within the text.

Note: The order of the tagging shown in the example below is not important. Its just used by the OP for clarity.

An example input:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

A partially correct output (OP may NOT have completely replaced perfectly in this example)

<TAG1>LoremIpsum</TAG1>issimply<TAG2>dummytext</TAG2>of<TAG5>the</TAG5><TAG3>print</TAG3>ingand<TAG4>type</TAG4>setting<TAG6>industry</TAG6>.<TAG1>LoremIpsum</TAG1>hasbeen<TAG5>the</TAG5><TAG6>industry</TAG6>'sstandard<TAG2>dummytext</TAG2>eversince<TAG5>the</TAG5>1500s,whenanunknown<TAG3>print</TAG3>ertookagalleyof<TAG4>type</TAG4>andscrambledittomakea<TAG4>type</TAG4>specimenbook.

Your code should be able to handle edge cases, such as the following:

Example Input 2:

hello!TAG!</hello.TAG.</

Example Output 2:

<TAG1>hello</TAG1>!<TAG2>TAG</TAG2>!<TAG3></</TAG3><TAG1>hello</TAG1>.<TAG2>TAG</TAG2>.<TAG3></</TAG3>

The winner:

Most elegant solution wins (judged by others comments, upvotes)
Bonus points/consideration for solutions utilizing shell scripting

Minor clarifications:

Input can be hard coded or read from a file
The criteria remains "elegance", which admittedly IS slightly vague, but it also encapsulates simple character/line counts as well. Comments by others and/or upvotes are also indicative of how the SO community views the challenge

Diallage answered 3/7, 2010 at 12:58 Comment(4)

+1 Very interesting question. Are you going to look for the largest repeated pieces of text, or for the most repeated pieces of text? How are you going to name a "piece of text" (to me, it is a group of consecutive words)? An "expression"? – Booklover 3/7, 2010 at 13:19

Hi greg - thanks for askiing... i actually need it for ANY string of bytes actually and the "largest" such string... :) – Diallage 3/7, 2010 at 13:59

Hi RubiCon10. If you accept answers in all languages you might want to add the tags language-agnostic and rosetta-stone. Also: most elegant solution is not really code-golfing in my definition ;-) – – Kulp 6/7, 2010 at 20:18

Testing for eloquence is 1. not [code-golf] 2. highly subjective and 3. not covered by the special remit that has been made for code-golf. In general, and despite the exception for [code-golf], Stack Overflow is not a place for contests. – Takahashi 7/7, 2010 at 0:50

Perl 206, 189, 188, 199, 157 chars

excluding original string and last print.

 #New algorithm that produces correct ouputs for many cases



    push@z,q/LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook/;

    push@z,q/oktooktobookokm/;
    push@z,q!dino1</dino2</!;
    push@z,q!dino1TAG2dino3TAG!;

    ## loop for tests doesn't count
    for $z(@z) {
    print "input : $z\n";
    $i=0;@r=();
    #### begin count
    $c=127;$l=length($_=$z);while($l>1){if(/(.{$l}).*\1/){push@r,$1;++$c;s/$1/chr($c)/eg}else{$l--}}$z=$_;map{++$i;$x=chr(127+$i);$z=~s:$x:<TAG$i>$_</TAG$i>:g}@r
    #### end count 157 chars
    ## output doesn't count
    ;print "output : $z\n","="x80,"\n"
    }

__END__
perl tags2.pl
input : LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe15
00s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook

output : <TAG1>LoremIpsum</TAG1>i<TAG11>ss</TAG11><TAG12>im</TAG12>ply<TAG2>dummytext</TAG2><TAG6>oft</TAG6><TAG13>he</TAG13><TAG4>p
rint</TAG4><TAG7>ing</TAG7><TAG8>and</TAG8><TAG5>types</TAG5>e<TAG14>tt</TAG14><TAG7>ing</TAG7><TAG3>industry</TAG3>.<TAG1>LoremIpsu
m</TAG1>hasbe<TAG15>en</TAG15><TAG9>the</TAG9><TAG3>industry</TAG3>'<TAG11>ss</TAG11>t<TAG8>and</TAG8>ard<TAG2>dummytext</TAG2>ev<TA
G16>er</TAG16>since<TAG9>the</TAG9>1500s,w<TAG13>he</TAG13>nanunknown<TAG4>print</TAG4><TAG16>er</TAG16>t<TAG10>ook</TAG10>agal<TAG1
7>le</TAG17>y<TAG6>oft</TAG6>y<TAG18>pe</TAG18><TAG8>and</TAG8>scramb<TAG17>le</TAG17>di<TAG14>tt</TAG14>omakea<TAG5>types</TAG5><TA
G18>pe</TAG18>c<TAG12>im</TAG12><TAG15>en</TAG15>b<TAG10>ook</TAG10>
================================================================================
input : oktooktobookokm
output : <TAG1>okto</TAG1><TAG1>okto</TAG1>bo<TAG2>ok</TAG2><TAG2>ok</TAG2>m
================================================================================
input : dino1</dino2</
output : <TAG1>dino</TAG1>1<TAG2></</TAG2><TAG1>dino</TAG1>2<TAG2></</TAG2>
================================================================================
input : dino1TAG2dino3TAG
output : <TAG1>dino</TAG1>1<TAG2>TAG</TAG2>2<TAG1>dino</TAG1>3<TAG2>TAG</TAG2>
================================================================================

Corposant answered 3/7, 2010 at 12:58 Comment(2)

This does not cover strings of length 2, only 3 and above – Evadne 7/7, 2010 at 6:1

Tagged "ok" instead of "ook" in Loremipsum (see last word) – Nicolette 10/7, 2010 at 7:16

Python, 236 206 chars

s="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
### ------------------------------------------------------------
import re
c=o=127
l={}
i=len(s)/2
while i>1:
    r=re.search('(.{%d}).*\\1'%i,s)
    if r:f=r.group(1);c+=1;l[c-o]=f;s=s.replace(f,chr(c))
    else:i-=1
for i in l:s=re.sub(chr(i+o),'<TAG%d>%s</TAG%d>'%(i,l[i],i),s)
### ------------------------------------------------------------
print s

And the result of running this on the example input, it picks the following words ('LoremIpsum', 'dummytext', 'industry', 'print', 'types', 'oft', 'ing', 'and', 'the', 'ook', 'ss', 'im', 'he', 'tt', 'en', 'er', 'le', 'pe') and the result is:

<TAG1>LoremIpsum</TAG1>i<TAG11>ss</TAG11><TAG12>im</TAG12>ply<TAG2>dummytext</TAG2><TAG6>oft</TAG6><TAG13>he</TAG13><TAG4>print</TAG4><TAG7>ing</TAG7><TAG8>and</TAG8><TAG5>types</TAG5>e<TAG14>tt</TAG14><TAG7>ing</TAG7><TAG3>industry</TAG3>.<TAG1>LoremIpsum</TAG1>hasbe<TAG15>en</TAG15><TAG9>the</TAG9><TAG3>industry</TAG3>'<TAG11>ss</TAG11>t<TAG8>and</TAG8>ard<TAG2>dummytext</TAG2>ev<TAG16>er</TAG16>since<TAG9>the</TAG9>1500s,w<TAG13>he</TAG13>nanunknown<TAG4>print</TAG4><TAG16>er</TAG16>t<TAG10>ook</TAG10>agal<TAG17>le</TAG17>y<TAG6>oft</TAG6>y<TAG18>pe</TAG18><TAG8>and</TAG8>scramb<TAG17>le</TAG17>di<TAG14>tt</TAG14>omakea<TAG5>types</TAG5><TAG18>pe</TAG18>c<TAG12>im</TAG12><TAG15>en</TAG15>b<TAG10>ook</TAG10>.

Which is more readable on this wiki highlighted like this:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

PS. Somebody complained so I added input and output statements. To the confused I apologize - it seemed obvious to me. Apparently not, so I added prefix/trailer statements, which are not required by the problem spec and should not be counted to the code length.

Evadne answered 3/7, 2010 at 12:58 Comment(7)

What does the range(32) mean here? Does it mean it can handle only upto a length of 32 characters (sorry can't read python... yet!) – Diallage 7/7, 2010 at 6:4

-1: Doesn't follow specs, s is undefined, and no output even when s is defined. – Buckish 7/7, 2010 at 7:32

@trinithis: the spec does not request I/O to be done, hence my code takes string s as input and it's output is returned in s too. You may call this cutting corners if you want but you cannot say it does NOT follow spec! See the Perl solution, it also does not count input/output – Evadne 7/7, 2010 at 7:41

What I meant about specs is that you are using <T> over <TAG> – Buckish 7/7, 2010 at 8:3

@trinithis: I understood "tagging should be of the form <TAG#>" to mean that it should be "something of the kind of", "a-la" - and I chose shorter tags. Anyway changed the tags now to appease you. – Evadne 7/7, 2010 at 8:52

You didn't tag "im" in sIMply and specIMen – Lingcod 7/7, 2010 at 10:10

@M42: you were right, thanks for pointing out - there was a bug. i fixed it, as well as the results. – Evadne 7/7, 2010 at 11:8

Haskell: 343/344 403 420 445 485 characters

Character count is 343 while using an exponential algorithm, whereas it is 344 when using a quadratic algorithm.

The code posted is the quadratic one. For the exponential algorithm, change the one occurrence of inits=<<tails to subsequences in the code.

import Data.List
l=length
e=map$either id id
(&)=stripPrefix
y@(_:w)!x=l x>1&&maybe(w!x)(isInfixOf x)(x&y)
_!_=0<0
t@(x,i)?s@(y:z)=maybe(y:t?z)(((map Right$'<':v++e x++"</"++v)++).(t?))$x&s where v="TAG"++i++">"
_?_=[]
r s=e$foldr(?)s$zip(sortBy(\a b->compare(l a)$l b)$filter(s!)$inits=<<tails s)$map show[1..]
main=getLine>>=putStr.r.map Left

Input 1:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

Output 1:

<TAG338>LoremIpsum</TAG338>i<TAG72>ss</TAG72><TAG122>im</TAG122>ply<TAG336>dummytext</TAG336><TAG188>oft</TAG188><TAG91>he</TAG91><TAG275>print</TAG275><TAG153>ing</TAG153><TAG191>and</TAG191><TAG276>types</TAG276><TAG88>et</TAG88><TAG214>ting</TAG214><TAG328>industry</TAG328>.<TAG338>LoremIpsum</TAG338>hasbe<TAG123>en</TAG123><TAG183>the</TAG183><TAG328>industry</TAG328>'s<TAG73>st</TAG73><TAG191>and</TAG191>ard<TAG336>dummytext</TAG336>ev<TAG99>er</TAG99>s<TAG96>in</TAG96>ce<TAG183>the</TAG183>1500s,wh<TAG123>en</TAG123><TAG111>an</TAG111>unknown<TAG275>print</TAG275><TAG99>er</TAG99>t<TAG195>ook</TAG195>a<TAG103>ga</TAG103>l<TAG113>le</TAG113>y<TAG105>of</TAG105><TAG241>type</TAG241><TAG191>and</TAG191>scramb<TAG113>le</TAG113>dit<TAG115>to</TAG115>mak<TAG116>ea</TAG116><TAG276>types</TAG276><TAG121>pe</TAG121>c<TAG122>im</TAG122><TAG123>en</TAG123>b<TAG195>ook</TAG195>.

Input 2:

hello!TAG!</hello.TAG.</

Output 2:

<TAG28>hello</TAG28>!<TAG22>TAG</TAG22>!<TAG14></</TAG14><TAG28>hello</TAG28>.<TAG22>TAG</TAG22>.<TAG14></</TAG14>

Buckish answered 3/7, 2010 at 12:58 Comment(2)

you should be able to tighten your algorithm so that it can run against the problem example in reasonable time - see there is at least one other solution that does not rely on regex magic – Evadne 7/7, 2010 at 23:25

Come to think of it, using inits=<<tails x is only one more character than subsequences x, yet runs in polynomial time. Worth it or not? (Think code-golf!) – Buckish 8/7, 2010 at 19:32

Many thanks to Dennis Williamson who helped me arrive at this approach by answering a few related questions I had on shell scripting - here and here.

Known issues with the below:

It works only for ascii files and not binary ones
It works only if there are no newlines in the file
It takes exponentially longer as the file gets longer
It breaks easily on files more than a few kb long (runs out of tmp disk space)

As you can see, its a huge brute-force method - not a smart algorithm at all. I've recorded the time taken for a few sample files.

bytes  time(s)
204 1.281
407 24.916
610 269.302

The point of even doing this as I did below was more about a "thought-challenge" for me - to do this in a shell environment and in a manner as "complete" as possible. Nothing more. Of course, as the results show, its grossly inefficient, so its completely unsuited for a real world application.

filesize=`stat -c %s $1`
while [ $filesize -gt 1 ]
do
        filesize=`expr $filesize - 1`
        array=( "${array[@]}" $(cat $1 | sed -n ":a;/^.\{$filesize\}$/{p;b};s/.\{$filesize\}/&\n/;P;s/.//;s/\n//;ba" | sort | uniq -c | grep -v '      1' | cut -c9-) )
done

sample=$(<$1)
tag=0;
for entry in ${array[@]};
        do
        test="<[^>/]*>[^>]*$entry[^<]*</";
        if [[ ! $sample =~ $test ]];
                then ((tag++));
                sample=${sample//${entry}/<T$tag>$entry</T$tag>};
        fi;
        done;
echo $sample

Usage would be as:

sh tagwords4 sample2.txt

Diallage answered 3/7, 2010 at 12:58 Comment(0)

Mathematica - 262 Chars

Not pure functional / Not short / Not nice / Lots of side effects /

b = "LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.\
     LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,\
     whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimen\
     book."

i = 0
a = c = "@"
v = StringFreeQ@## &
w = StringReplace@## &
t = x__ ~~ y__ ~~ __ ~~ x__ ~~ y__ /; v[x <> y, c]
NestWhile[
  w[#, (a = SortBy[StringCases[#, t -> x <> y,Overlaps -> True], -StringLength@# &][[1]]) -> c] &,
  b,
  (z = k@++i; b = w[b, a -> "<TAG" <> z <> ">" <> a <> "</TAG" <> z <> ">"] /. k -> IntegerString; True) && ! v[#, t] &]

Unalienable answered 3/7, 2010 at 12:58 Comment(0)

Python, 236 281 chars, no REGEX :)

Makes a set M of all 2+ character strings then iterates through them to assign them in greedy-length order

s="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
#s="abcd1TAGabcd2TAG"

### ----
L,C,R=len,chr,range
M,l,T,t=set(),L(s),[],0
[[M.add(s[A:B])for B in R(A+2,l)]for A in R(l)]
while 1:
 m,t=sorted([(L(m),m)if s.count(m)>1 else(0,"")for m in M])[-1][1],t+1
 if m=="":break
 T+=[(t,m)]
 s=s.replace(m,C(t))
for(t,m)in T:
 s=s.replace(C(t),"<TAG%d>%s</TAG%d>"%(t,m,t))
### ----

print s

Outputs, as expected:

<TAG1>LoremIpsum</TAG1>i<TAG11>ss</TAG11><TAG15>im</TAG15>ply<TAG2>dummytext</TAG2><TAG13>of</TAG13><TAG6>the</TAG6><TAG5>print</TAG5><TAG8>ing</TAG8><TAG9>and</TAG9><TAG4>types</TAG4>e<TAG10>tt</TAG10><TAG8>ing</TAG8><TAG3>industry</TAG3>.<TAG1>LoremIpsum</TAG1>hasbe<TAG17>en</TAG17><TAG6>the</TAG6><TAG3>industry</TAG3>'<TAG11>ss</TAG11>t<TAG9>and</TAG9>ard<TAG2>dummytext</TAG2>ev<TAG16>er</TAG16>since<TAG6>the</TAG6>1500s,wh<TAG17>en</TAG17>anunknown<TAG5>print</TAG5><TAG16>er</TAG16>t<TAG7>ook</TAG7>agal<TAG14>le</TAG14>y<TAG13>of</TAG13>ty<TAG12>pe</TAG12><TAG9>and</TAG9>scramb<TAG14>le</TAG14>di<TAG10>tt</TAG10>omakea<TAG4>types</TAG4><TAG12>pe</TAG12>c<TAG15>im</TAG15><TAG17>en</TAG17>b<TAG7>ook</TAG7>.

Yard answered 3/7, 2010 at 12:58 Comment(2)

Code breaks with abcd1TAGabcd2TAG – Buckish 7/7, 2010 at 23:15

errr, problem: see how in <TAG4>ty<TAG12>pe</TAG12>s</TAG4> tags overlap? – Evadne 7/7, 2010 at 23:22

I think you can use back references to do this. See this post : Regular Expression to detect repetition within a string I've done many attempts and for the moment I have this expression: #([a-zA-Z ]+).*\1#, but I think it finds the first repeated string, not the largest... This was before I knew you didn't care about words... What you should do is:

find the largest sequence of characters repeated in the text
remove it from the text where it appears
iterate until you find nothing repeated
use tomit's method to colorize the strings you have memorized

the step is described on this page: http://en.wikipedia.org/wiki/Longest_common_substring_problem And here is a php implementation : http://www.davidtavarez.com/archives/longer-common-substring-problem-php-implementation/ (you'll have to fix it, it contains html entities, and the comment says it returns an integer but we don't know what it represents...), if it still does not work, you can try to implement wikipedia's pseudo-code.

Booklover answered 3/7, 2010 at 13:29 Comment(3)

Do you think its worthwhile to post this as a code golf challenge?? – Diallage 5/7, 2010 at 11:13

tomit had posted an intersting answer regarding the search and replace function to colorize the string, but seems to have withdrawed it, probably because he thought this was off-board. Never mind, it is no very difficult to do with str_replace. For the code golf, you might try, or you may even offer a bounty for this question... – Booklover 5/7, 2010 at 12:31

^ str_replace yields incorrect output for some inputs (at least if done naively). – Buckish 6/7, 2010 at 22:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++