Basically I have about 1,000,000 strings, for each request I have to check if a String belongs to the list or not.
I'm worried about the performance, so what's the best method? ArrayList
? Hash?
Basically I have about 1,000,000 strings, for each request I have to check if a String belongs to the list or not.
I'm worried about the performance, so what's the best method? ArrayList
? Hash?
Your best bet is to use a HashSet
and check if a string exists in the set via the contains()
method. HashSets are built for fast access via the use of Object methods hashCode()
and equals()
. The Javadoc for HashSet
states:
This class offers constant time performance for the basic operations (add, remove, contains and size),
HashSet stores objects in hash buckets which is to say that the value returned by the hashCode
method will determine which bucket an object is stored in. This way, the amount of equality checks the HashSet
has to perform via the equals()
method is reduced to just the other Objects in the same hash bucket.
To use HashSets and HashMaps effectively, you must conform to the equals
and hashCode
contract outlined in the javadoc. In the case of java.lang.String
these methods have already been implemented to do this.
In general, a HashSet will give you better performance, since it does not have to look through each element and compare, like an ArrayList does, but typically compares at most a few elements, where the hashcodes are equal.
However, for 1M strings, the performance of hashSet may still not be optimal. A lot of cache misses will slow down searching the set. If all strings are equally likely, then this is unavoidable. However, if some strings are more often requested than others, then you can place the common strings into a small hashSet, and check that first, before checking the larger set. The small hashset should be sized to fit in cache (e.g. a few hundred K at most). Hits to the small hashset will then be very fast, while hits to the larger hashset proceed at speed limited by the memory bandwidth.
Before going further, please consider this: Why are you worried about performance? How often is this check called?
As for possible solutions:
If the list is already sorted, then you can use java.util.Collections.binarySearch
which offers the same performance characteristics as a java.util.TreeSet
.
Otherwise you can use a java.util.HashSet
that as a performance characteristic of O(1). Note that calculating the hash code for a string that doesn't have one calculated yet is an O(m) operation with m=string.length()
. Also keep in mind that hashtables only work well until they reach a given load factor, i.e. hashtables will use more memory than plain lists. The default load factor used by HashSet is .75, meaning that internally a HashSet for 1e6 objects will use an array with 1.3e6 entries.
If the HashSet does not work for you (e.g. because there are lots of hash-collisions, because memory is tight or because there are lots of insertions), than consider using a Trie. Lookup in a Trie has a worst-case complexity of O(m) where m=string.length()
. A Trie has also some extra-benefits that might be useful for you: e.g., it can give you the closest fit for a search string. But keep in mind that the best code is no code, so only roll your own Trie implementiation if the benefits outweight the costs.
Consider using a database if you want more complex queries, e.g. match for a substring or a regular expression.
I'd use a Set
, in most cases HashSet
is fine.
With such a huge number of Strings, I immediately think of a Trie. It works better with a more limited set of characters (such as letters) and/or when the start of many string overlap.
Having run the exercise here are my results.
private static final int TEST_CYCLES = 4000;
private static final long RAND_ELEMENT_COUNT = 1000000l;
private static final int RAND_STR_LEN = 20;
//Mean time
/*
Array list:18.55425
Array list not contains:17.113
Hash set:5.0E-4
Hash set not contains:7.5E-4
*/
I believe the numbers speak for themselves. The lookup time of the hash set is way, wayyyy faster.
Perhaps it isn't required for your case but I think it's useful to mention that there are some space-efficient probabilistic algorithms. For example Bloom filter.
If you are having such a large amount of strings, the best opportunity for you is to use a database. Look for MySQL.
contains()
in a hash. –
Brocket Not only for String, you can use Set for any case you need unique items.
If the type of items is primitive or wrapper, you may not care. But if it is a class, you must override two methods:
Sometimes you want to check if an object is in the list/set and at the same time you want the list/set to be ordered. If you are looking to also retrieve objects easily without using an enumeration or iterator, you may consider using both an ArrayList<String>
and HashMap<String, Integer>
. The list is backed by the map.
Example from some work I recently did:
public class NodeKey<K> implements Serializable, Cloneable{
private static final long serialVersionUID = -634779076519943311L;
private NodeKey<K> parent;
private List<K> children = new ArrayList<K>();
private Map<K, Integer> childrenToListMap = new HashMap<K, Integer>();
public NodeKey() {}
public NodeKey(Collection<? extends K> c){
List<K> childHierarchy = new ArrayList<K>(c);
K childLevel0 = childHierarchy.remove(0);
if(!childrenToListMap.containsKey(childLevel0)){
children.add(childLevel0);
childrenToListMap.put(childLevel0, children.size()-1);
}
...
In this case, parameter K
would be a String
for you. The map (childrenToMapList
) stores Strings
inserted into the list (children
) as the key, and the map values are the index position in the list.
The reason for the list and the map is so that you can retrieve indexed values of the list, without having to do an iteration over a HashSet<String>
.
© 2022 - 2024 — McMap. All rights reserved.