RDKit: how to check molecules for exact match?
Asked Answered
D

2

12

I'm using RDKit and trying to check molecules for exact match. After using Chem.MolFromSmiles() the expression m == p apparently doesn't lead to the desired result. Of course, I can check whether p is a substructure of m and whether m is a substructure of p. But to me this looks too complicated. I couldn't find or overlooked a code example for exact match in the RDKit-documentation. How do I do this correctly? Thank you for hints.

Code:

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'          # Carbazole
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'      # Carbazole

m = Chem.MolFromSmiles(myMolecule)
p = Chem.MolFromSmiles(myPattern)

print(m == p)                    # returns False, first (unsuccessful) attempt to check for identity

print(m.HasSubstructMatch(p))    # returns True
print(p.HasSubstructMatch(m))    # returns True
print(m.HasSubstructMatch(p) and p.HasSubstructMatch(m))    # returns True, so are the molecules identical?
Desiderative answered 13/2, 2020 at 15:45 Comment(3)
Have you tried any of the fingerprinting options in rdkit?Babara
Surprised to see that such a function doesn't exist yet!Floris
I think that your test (m.HasSubstructMatch(p) and p.HasSubstructMatch(m)) is a good test for equality.Seesaw
K
16

To check if two different SMILES represent the same molecule you can canonicalize the SMILES.

from rdkit import Chem

myPattern = 'c1ccc2c(c1)c3ccccc3[nH]2'
myMolecule = 'C1=CC=C2C(=C1)C3=CC=CC=C3N2'

a = Chem.CanonSmiles(myPattern)
b = Chem.CanonSmiles(myMolecule)

print(a)
'c1ccc2c(c1)[nH]c1ccccc12'

print(b)
'c1ccc2c(c1)[nH]c1ccccc12'

print(a==b)
True
Kvass answered 13/2, 2020 at 17:7 Comment(3)
Thank you for your suggestion. Ok, I'm not sure but my guess would be that creating two canonical SMILES is probably less computationally intensive than searching twice for substructure?!Desiderative
@Desiderative I made a quick test with 5k SMILES, and your substructure solution was a bit faster.Kvass
While this is one way, going from rdkit molecules to canonical SMILES is probably overkill.Seesaw
B
8

My RDKit knowledge isn't great and their documentation is famously terrible but I have done this kind of thing myself. A (perhaps over-engineered) method would be to generate a graph with networkx and just compare the nodes and edges.

This is surprisingly simple, using rdkit to read the file/smiles string then just generate the topology on the fly. If you generate an rdkit_mol object from a smiles string as you have above, you would then do:

import networkx as nx


def topology_from_rdkit(rdkit_molecule):

    topology = nx.Graph()
    for atom in rdkit_molecule.GetAtoms():
        # Add the atoms as nodes
        topology.add_node(atom.GetIdx())

        # Add the bonds as edges
        for bonded in atom.GetNeighbors():
            topology.add_edge(atom.GetIdx(), bonded.GetIdx())

    return topology


def is_isomorphic(topology1, topology2):
    return nx.is_isomorphic(topology1, topology2)
Babara answered 13/2, 2020 at 16:1 Comment(4)
thank you for your suggestion. However, yet another extra package seems to be an overkill to me...Desiderative
This is not reasonable.Seesaw
Love the creative solution. I ended up using this script because I was only interested in the connectivity. One thing to be careful of - this snippet checks matching graph topology but if you substitute an N for an S, for example, it will still say the molecules are the same.Griqua
Nice, this makes sense! Just to follow up on the other comments, nx is how "edit distance" is calculated. The first function there is adjacency: npt.ArrayLike[np.float64] = Chem.GetAdjacencyMatrix(mol,useBO=True); topology: nx.Graph = nx.from_numpy_array(adjacency) —this preserves bond order but has Kekulisation issues. About different elements, the diagonal is set to the atomic Zahl in edit distanceMantling

© 2022 - 2024 — McMap. All rights reserved.