RDKit: how to check molecules for exact match?

Asked 13/2, 2020 at 15:45 Answered 13/2, 2020 at 17:7

I'm using RDKit and trying to check molecules for exact match. After using Chem.MolFromSmiles() the expression m == p apparently doesn't lead to the desired result. Of course, I can check whether p is a substructure of m and whether m is a substructure of p. But to me this looks too complicated. I couldn't find or overlooked a code example for exact match in the RDKit-documentation. How do I do this correctly? Thank you for hints.

Code:

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'          # Carbazole
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'      # Carbazole

m = Chem.MolFromSmiles(myMolecule)
p = Chem.MolFromSmiles(myPattern)

print(m == p)                    # returns False, first (unsuccessful) attempt to check for identity

print(m.HasSubstructMatch(p))    # returns True
print(p.HasSubstructMatch(m))    # returns True
print(m.HasSubstructMatch(p) and p.HasSubstructMatch(m))    # returns True, so are the molecules identical?

Desiderative answered 13/2, 2020 at 15:45 Comment(3)

Have you tried any of the fingerprinting options in rdkit? – Babara 13/2, 2020 at 15:50

Surprised to see that such a function doesn't exist yet! – Floris 14/9, 2020 at 12:36

I think that your test (m.HasSubstructMatch(p) and p.HasSubstructMatch(m)) is a good test for equality. – Seesaw 2/8, 2021 at 2:58

To check if two different SMILES represent the same molecule you can canonicalize the SMILES.

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'

a = Chem.CanonSmiles(myPattern)
b = Chem.CanonSmiles(myMolecule)

print(a)
'c1ccc2c(c1)[nH]c1ccccc12'

print(b)
'c1ccc2c(c1)[nH]c1ccccc12'

print(a==b)
True

Kvass answered 13/2, 2020 at 17:7 Comment(3)

Thank you for your suggestion. Ok, I'm not sure but my guess would be that creating two canonical SMILES is probably less computationally intensive than searching twice for substructure?! – Desiderative 13/2, 2020 at 18:58

@Desiderative I made a quick test with 5k SMILES, and your substructure solution was a bit faster. – Kvass 14/2, 2020 at 16:2

While this is one way, going from rdkit molecules to canonical SMILES is probably overkill. – Seesaw 2/8, 2021 at 3:1

My RDKit knowledge isn't great and their documentation is famously terrible but I have done this kind of thing myself. A (perhaps over-engineered) method would be to generate a graph with networkx and just compare the nodes and edges.

This is surprisingly simple, using rdkit to read the file/smiles string then just generate the topology on the fly. If you generate an rdkit_mol object from a smiles string as you have above, you would then do:

import networkx as nx


def topology_from_rdkit(rdkit_molecule):

    topology = nx.Graph()
    for atom in rdkit_molecule.GetAtoms():
        # Add the atoms as nodes
        topology.add_node(atom.GetIdx())

        # Add the bonds as edges
        for bonded in atom.GetNeighbors():
            topology.add_edge(atom.GetIdx(), bonded.GetIdx())

    return topology


def is_isomorphic(topology1, topology2):
    return nx.is_isomorphic(topology1, topology2)

Babara answered 13/2, 2020 at 16:1 Comment(4)

thank you for your suggestion. However, yet another extra package seems to be an overkill to me... – Desiderative 13/2, 2020 at 18:49

This is not reasonable. – Seesaw 2/8, 2021 at 3:1

Love the creative solution. I ended up using this script because I was only interested in the connectivity. One thing to be careful of - this snippet checks matching graph topology but if you substitute an N for an S, for example, it will still say the molecules are the same. – Griqua 9/5, 2022 at 19:57

Nice, this makes sense! Just to follow up on the other comments, nx is how "edit distance" is calculated. The first function there is

adjacency: npt.ArrayLike[np.float64] = Chem.GetAdjacencyMatrix(mol,useBO=True);  topology: nx.Graph = nx.from_numpy_array(adjacency)

—this preserves bond order but has Kekulisation issues. About different elements, the diagonal is set to the atomic Zahl in edit distance – Mantling 14/5 at 12:33

Recommended topics

Hot tags