Finding unique elements in an string array in C

Asked 10/5, 2010 at 16:40 Answered 15/5, 2010 at 16:37

C bothers me with its handling of strings. I have a pseudocode like this in my mind:

char *data[20]; 

char *tmp; int i,j;

for(i=0;i<20;i++) {
  tmp = data[i]; 
  for(j=1;j<20;j++) 
  {
    if(strcmp(tmp,data[j]))
      //then except the uniqueness, store them in elsewhere
  }
}

But when i coded this the results were bad.(I handled all the memory stuff,little things etc.) The problem is in the second loop obviously :D. But i cannot think any solution. How do i find unique strings in an array.

Example input : abc def abe abc def deg entered unique ones : abc def abe deg should be found.

Perfunctory answered 10/5, 2010 at 16:40 Comment(4)

Sorting the array first will get you a long ways. Then just iterate over the strings, and if the current string differs from the previous string, it is unique and you can store it elsewhere. – Riposte 10/5, 2010 at 16:44

the problem is i need the exact locations. You know like in this : input : abc def abe abc def deg entered unique ones : abc def abe deg if i sorted the array i will get unique ones like that : abc abe def deg This isn't what i want i need the locations aswell. – Perfunctory 10/5, 2010 at 16:47

Then create an array of pointers or an array of array indices into the initial array that you sort, instead of sorting the initial array. – Riposte 10/5, 2010 at 16:48

He could also try building a hash table, although for only 20 items or so that would definitely be overkill. – Whilst 10/5, 2010 at 20:24

You could use qsort to force the duplicates next to each other. Once sorted, you only need to compare adjacent entries to find duplicates. The result is O(N log N) rather than (I think) O(N^2).

Here is the 15 minute lunchtime version with no error checking:

  typedef struct {
     int origpos;
     char *value;
  } SORT;

  int qcmp(const void *x, const void *y) {
     int res = strcmp( ((SORT*)x)->value, ((SORT*)y)->value );
     if ( res != 0 )
        return res;
     else
        // they are equal - use original position as tie breaker
        return ( ((SORT*)x)->origpos - ((SORT*)y)->origpos );
  }

  int main( int argc, char* argv[] )
  {
     SORT *sorted;
     char **orig;
     int i;
     int num = argc - 1;

     orig = malloc( sizeof( char* ) * ( num ));
     sorted = malloc( sizeof( SORT ) * ( num ));

     for ( i = 0; i < num; i++ ) {
        orig[i] = argv[i + 1];
        sorted[i].value = argv[i + 1];
        sorted[i].origpos = i;
        }

     qsort( sorted, num, sizeof( SORT ), qcmp );

     // remove the dups (sorting left relative position same for dups)
     for ( i = 0; i < num - 1; i++ ) {
        if ( !strcmp( sorted[i].value, sorted[i+1].value ))
           // clear the duplicate entry however you see fit
           orig[sorted[i+1].origpos] = NULL;  // or free it if dynamic mem
        }

     // print them without dups in original order
     for ( i = 0; i < num; i++ )
        if ( orig[i] )
           printf( "%s ", orig[i] );

     free( orig );
     free( sorted );
  }

Kenzi answered 10/5, 2010 at 16:44 Comment(4)

i know this. I don't want a sorted array and doing the job. I need these with locations you know. You know like in this : input : abc def abe abc def deg entered unique ones : abc def abe deg if i sorted the array i will get unique ones like that : abc abe def deg This isn't what i want i need the locations aswell. – Perfunctory 10/5, 2010 at 16:51

I don't think Mark did know, actually, since you didn't mention that in your question. – Riposte 10/5, 2010 at 16:54

This is why i'm asking this :). I already know sorting and checking adjacent elements. But that's not solving my problem. – Perfunctory 10/5, 2010 at 16:58

Sorting the array of indices as WhirlWind suggests should solve that. It would leave the original order intact. – Kenzi 10/5, 2010 at 16:59

char *data[20];
int i, j, n, unique[20];

n = 0;
for (i = 0; i < 20; ++i)
{
    for (j = 0; j < n; ++j)
    {
        if (!strcmp(data[i], data[unique[j]]))
           break;
    }

    if (j == n)
        unique[n++] = i;
}

The indexes of the first occurrence of each unique string should be in unique[0..n-1] if I did that right.

Playtime answered 10/5, 2010 at 16:57 Comment(1)

that seems really interesting, I will try this. – Perfunctory 10/5, 2010 at 17:0

Why are you starting second loop from 1?

You should start it from i+1. i.e.

for(j=i+1;j<20;j++)

Like if the list is

abc
def
abc
abc
lop

then

when i==4

tmp="lop"

but then the second loop starts which is from 1 to 19. This means it will get a value of 4 too at one stage, and then

data[4], which is "lop", will be same as tmp. So although "lop" is unique but it will be flagged as repeated.

Hope it was helpful.

Biopsy answered 10/5, 2010 at 17:10 Comment(3)

That is definitely not the main problem. Still O(n^2) – Bitolj 10/5, 2010 at 17:17

That really depends on your definition of "main problem". This answer has identified a correctness problem, which is more severe than a performance problem. – Auto 11/5, 2010 at 1:2

@Auto and @ Terry: Actually, in his question I did not find anything related to performance. His quote "But when i coded this the results were bad.(I handled all the memory stuff,little things etc.) The problem is in the second loop obviously :D. But i cannot think any solution. How do i find unique strings in an array." So I only focused on that why his code is not working. Later, from other answers and comments I realised that discussion has taken a different shape. – Biopsy 11/5, 2010 at 1:50

Think a bit more about your problem -- what you really want to do is look at the PREVIOUS strings to see if you've already seen it. So, for each string n, compare it to strings 0 through n-1.

print element 0 (it is unique)
for i = 1 to n
  unique = 1
  for j = 0 to i-1 (compare this element to the ones preceding it)
    if element[i] == element[j]
       unique = 0
       break from loop
  if unique, print element i

Crapulous answered 15/5, 2010 at 16:37 Comment(0)

Might it be that your test is if (strcmp (this, that)) which will succeed if the two are different? !strcmp is probably what you want there.

Hierolatry answered 10/5, 2010 at 16:44 Comment(0)

Recommended topics

Hot tags