I'd do it this way:
- Parse the original file and save all entries into a new file. Use fixed length data blocks to write entries to the new file (so, say your longest string is 10 bytes long, take 10 + x as block length, x is for the extra info you want to save along the entries. So the 10th entry in the file would be at byte position 10*(10+x)). You'd also have to know the number of entries to create the (so the file size would noOfEntries*blocklength, use a RandomAccesFile and setLength to set the this file length).
- Now use quicksort algorithm to sort the entries in the file (my idea is to have a sorted file in the end which makes things far easier and faster finally. Hashing would theoretically work too, but you'd have to deal with rearranging duplicate entries then to have all duplicates grouped - not really a choice here).
- Parse the file with the now sorted entries. Save a pointer to the entry of the first occurence of a entry. Increment the number of duplicates until there is a new entry. Change the first entry and add that additonal info you want to have there into a new "final result" file. Continue this way with all remaining entries in the sorted file.
Conclusions: I think this should be a reasonably fast and use reasonable amount of resources. However, it depends on the data you have. If you have a very large number of duplicates, quicksort performance will degrade. Also, if your longest data entry is way longer than the average, it will also waste file space.