How to access the reference values of a HashSet<TValue> without enumeration?
Asked Answered
M

2

7

I have this scenario in which memory conservation is paramount. I am trying to read in > 1 GB of Peptide sequences into memory and group peptide instances together that share the same sequence. I am storing the Peptide objects in a Hash so I can quickly check for duplication, but found out that you cannot access the objects in the Set, even after knowing that the Set contains that object.

Memory is really important and I don't want to duplicate data if at all possible. (Otherwise I would of designed my data structure as: peptides = Dictionary<string, Peptide> but that would duplicate the string in both the dictionary and Peptide class). Below is the code to show you what I would like to accomplish:

public SomeClass {

       // Main Storage of all the Peptide instances, class provided below
       private HashSet<Peptide> peptides = new HashSet<Peptide>();

       public void SomeMethod(IEnumerable<string> files) {
            foreach(string file in files) {
                 using(PeptideReader reader = new PeptideReader(file)) {
                     foreach(DataLine line in reader.ReadNextLine()) {
                         Peptide testPep = new Peptide(line.Sequence);
                         if(peptides.Contains(testPep)) {

                            // ** Problem Is Here **
                            // I want to get the Peptide object that is in HashSet
                            // so I can add the DataLine to it, I don't want use the
                            // testPep object (even though they are considered "equal")
                            peptides[testPep].Add(line); // I know this doesn't work

                            testPep.Add(line) // THIS IS NO GOOD, since it won't be saved in the HashSet which i use in other methods.

                         } else {
                            // The HashSet doesn't contain this peptide, so we can just add it
                            testPep.Add(line);
                            peptides.Add(testPep);
                         }
                     }   
                 }
            }
       }
}

public Peptide : IEquatable<Peptide> {
     public string Sequence {get;private set;}
     private int hCode = 0;

     public PsmList PSMs {get;set;}

     public Peptide(string sequence) {
         Sequence = sequence.Replace('I', 'L');
         hCode = Sequence.GetHashCode();             
     }

     public void Add(DataLine data) {
         if(PSMs == null) {
             PSMs = new PsmList();
         } 
         PSMs.Add(data);
     }

     public override int GethashCode() {
         return hCode;
     }

     public bool Equals(Peptide other) {
         return Sequence.Equals(other.Sequence);
     }
}

public PSMlist : List<DataLine> { // and some other stuff that is not important }

Why does HashSet not let me get the object reference that is contained in the HashSet? I know people will try to say that if HashSet.Contains() returns true, your objects are equivalent. They may be equivalent in terms of values, but I need the references to be the same since I am storing additional information in the Peptide class.

The only solution I came up with is Dictionary<Peptide, Peptide> in which both the key and value point to the same reference. But this seems tacky. Is there another data structure to accomplish this?

Multipara answered 3/9, 2011 at 0:44 Comment(3)
Dead interesting, but also a duplicate of Why can't I retrieve an item from a HashSet without enumeration? +1 nonetheless for a well presented question.Leannleanna
Extending KeyedCollection<TKey, TItem> is a good choice when the key is also a property of the value.Connected
I'll add something to Skeet response: if you want to reimplement HashSet, you can take it from the mono project and modify it :-) You can even try to look at C5 (itu.dk/research/c5) and see if there is anything that is useful.Trisomic
T
9

Basically you could reimplement HashSet<T> yourself, but that's about the only solution I'm aware of. The Dictionary<Peptide, Peptide> or Dictionary<string, Peptide> solution is probably not that inefficient though - if you're only wasting a single reference per entry, I would imagine that would be relatively insignificant.

In fact, if you remove the hCode member from Peptide, that will safe you 4 bytes per object which is the same size as a reference in x86 anyway... there's no point in caching the hash as far as I can tell, as you'll only compute the hash of each object once, at least in the code you've shown.

If you're really desperate for memory, I suspect you could store the sequence considerably more efficiently than as a string. If you give us more information about what the sequence contains, we may be able to make some suggestions there.

I don't know that there's any particularly strong reason why HashSet doesn't permit this, other than that it's a relatively rare requirement - but it's something I've seen requested in Java as well...

Tetchy answered 3/9, 2011 at 0:48 Comment(2)
Thanks for the quick response. I use the HashSet in many other methods after I populate it, so I thought I would save time by storing the HashCode since it will never change. That can easily be modified. I thought about encoding the sequence somehow (there are only 20 possible characters) but felt the overhead of encoding/decoding would hurt the duty cycle. But I guess you cannot always have your cake and eat it too.Multipara
@Moop: The hash code will only be requested once by the HashSet. If you were putting the same object in many hash sets, that might affect things. As for the sequencing - just storing a byte array instead of a string would save you quite a bit. No need for anything complicated.Tetchy
B
1

Use a Dictionary<string, Peptide>.

Brail answered 3/9, 2011 at 0:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.