Sorted TStringList, how does the sorting work?
Asked Answered
A

6

5

I'm simply curious as lately I have been seeing the use of Hashmaps in Java and wonder if Delphi's Sorted String list is similar at all.

Does the TStringList object generate a Hash to use as an index for each item? And how does the search string get checked against the list of strings via the Find function?

I make use of Sorted TStringLists very often and I would just like to understand what is going on a little bit more.

Please assume I don't know how a hash map works, because I don't :)

Thanks

Arita answered 25/2, 2011 at 1:58 Comment(1)
The Sort algorithm used is a QuickSort.Census
N
5

I'm interpreting this question, quite generally, as a request for an overview of lists and dictionaries.

  • A list, as almost everyone knows, is a container that is indexed by contiguous integers.
  • A hash map, dictionary or associative array is a container whose index can be of any type. Very commonly, a dictionary is indexed with strings.

For sake of argument let us call our lists L and our dictionaries D.

Lists have true random access. An item can be looked-up in constant time if you know its index. This is not the case for dictionaries and they usually resort to hash-based algorithms to achieve efficient random access.

A sorted list can perform binary search when you attempt to find a value. Finding a value, V, is the act of obtaining the index, I, such that L[I]=V. Binary search is very efficient. If the list is not sorted then it must perform linear search which is much less efficient. A sorted list can use insertion sort to maintain the order of the list – when a new item is added, it is inserted at the correct location.

You can think of a dictionary as a list of <Key,Value> pairs. You can iterate over all pairs, but more commonly you use index notation to look-up a value for a given key: D[Key]. Note that this is not the same operation as finding a value in a list – it is the analogue of reading L[I] when you know the index I.

In older versions of Delphi it was common to coax dictionary behaviour out of string lists. The performance was terrible. There was little flexibility in the contents.

With modern Delphi, there is TDictionary, a generic class that can hold anything. The implementation uses a hash and although I have not personally tested its performance I understand it to be respectable.

There are commonly used algorithms that optimally use all of these containers: unsorted lists, sorted lists, dictionaries. You just need to use the right one for the problem at hand.

Nonobjective answered 25/2, 2011 at 7:39 Comment(2)
I've never experienced TStringList performance as 'terrible' but that may be because I hardly ever put more than a few hunder thousands of items in it. Nice explanation, although it doesn't mention the quicksort anywhere, which is in fact the sorting a TStringList uses.Rembrandt
@Golez When the list is Sorted it uses insertion sort. When you call Sort() on an unsorted list it uses quicksort. The performance of a string list is excellent when used as a list, but poor when coerced to behave like a dictionary.Nonobjective
K
4

TStringList holds the strings in an array.

If you call Sort on an otherwise unsorted (Sorted property = false) string list then a QuickSort is performed to sort the items.

The same happens if you set Sorted to true.

If you call Find (or IndexOf which calls find) on an unsorted string list (Sorted property = false, even if you explicitly called Sort the list is considered unsorted if the Sorted property isn't true) then a linear search is performed comparing all strings from the start till a match is found.

If you call Find on a sorted string list (Sorted property = true) then a binary search is performed (see http://en.wikipedia.org/wiki/Binary_search for details).

If you add a string to a sorted string list, a binary search is performed to determine the correct insertion position, all following elements in the array are shifted by one and the new string is inserted.

Because of this insertion performance gets a lot worse the larger the string list is. If you want to insert a large number of entries into a sorted string list, it's usually better to turn sorting off, insert the strings, then set Sorted back to true which performs a quick sort.

The disadvantage of that approach is that you will not be able to prevent the insertion of duplicates.

EDIT: If you want a hash map, use TDictionary from unit Generics.Collections

Kalsomine answered 25/2, 2011 at 3:18 Comment(5)
+1 for the 'Sorting off while inserting many items in large lists'. People tend to forget that and complain about stringlist performance when inserting thousands of items.Rembrandt
You can get this even faster by allocating enough space by setting the Capacity property. This will prevent the stringlist from having to reallocate the whole array when a new item doesn't fit. Stringlist does increase capacity in large chunks, but if you know the number of items, it is best to set capacity yourself.Rembrandt
The problem with leaving insertion on whilst adding is in fact the memory copy each time you insert. Adding to the end of a list is basically free. Inserting in the middle costs. The time complexity of the alternative sorts comes out the same.Nonobjective
Adding to the end of a list is basically free, but not for TStringList. That's because the data isn't in a list, but in an array meaning that adding an item needs reallocation of the whole array. TStringList tries to work around that by reallocating for large chunks at a time when the limit of the currently allocated array is reached. That solved the speed problem, but still left large holes of unallocated memory, which caused problems with the previous memory manager (later Delphis have a new memory manager that is somewhat smarter).Rembrandt
Goleztrol: it is not just the reallocation, but mainly the fact that if you insert item i in an array of n items, you need to move n-i items. That makes insertion an O(n) operation. A simple solution is to make the main array an array of arrays. I did this in the past, and then insertion is nearly constant time ( O(total n/ avg nr items in subarray)) till the toplevel array gets close to hundred thousands . Then you would need additional level of indirection.Draughty
C
1

You could look at the source code, since that comes with Delphi. Ctrl-Click on the "sort" call in your code.

It's a simple alphabetical sort in non-Unicode Delphi, and a slightly more complex Unicode one in later versions. You can supply your own comparison for custom sort orders. Unfortunately I don't have a recent version of Delphi so can't confirm, but I expect that under the hood there's a proper Unicode-aware and locale-aware string comparison routine. Unicode sorting/string comparison is not trivial and a little web searching will point out some of the pitfalls.

Supplying your own comparison routine is often done when you have delimited text in the strings or objects attached to them (the Objects property). In those cases you often wat to sort by a property of the object or something other than the first field in the string. Or it might be as simple as wanting a numerical sort on the strings (so "2" comes after "1" rather than after "19")

Churchyard answered 25/2, 2011 at 2:28 Comment(4)
Thanks, yea I had thought to look at the source however I assumed (poorly) that it would be slightly more complex than iterating through the list doing a compare.Arita
@CatalystNZ, it only does the linear search if you have not set Sorted to true. With Sorted set to true a binary search is performed which is pretty quick.Kalsomine
The default TStringList.Sort uses the QuickSort algorithm, there is no hashing or anything involved. If add a string to an already sorted list, TStringList uses the binary searching Find method to locate the correct postion, then uses System.Move to "open up" a gap to add the new item. In a long list the memory move is probably the slowest part, unless you are using a slow custom Compare method.Semen
P.S. There is also a simple THashedStringList in the IniFiles unitSemen
U
1

There is also a THashedStringList, which could be an option (especially in older Delphi versions).

Uranium answered 25/2, 2011 at 8:22 Comment(0)
I
0

BTW, the Unicode sort routines for TStringList are quite slow. If you override the TStringList.CompareStrings method then if the strings only contain Ansi characters (which if you use English exclusively they will), you can use customised Ansi string comparisons. I use my own customised TStringList class that does this and it is 4 times faster than the TStringList class for a sorted list for both reading and writing strings from/to the list.

Interjacent answered 25/2, 2011 at 13:53 Comment(0)
M
0

Delphi's dictionary type (in generics-enabled versions of Delphi) is the closest thing to a hashmap, that ships with Delphi. THashedStringList makes lookups faster than they would be in a sorted string list. you can do lookups using a binary search in a sorted stringlist, so it's faster than brute force searches, but not as fast as a hash.

The general theory of a hash is that it is unordered, but very fast on lookup and insertion. A sorted list is reasonably fast on insertion until the size of the list gets large, although it's not as efficient as a dictionary for insertion.

The big benefit of a list is that it is ordered but a hash-lookup dictionary is not.

Marlie answered 27/2, 2011 at 2:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.