NetworkX: Find longest path in DAG returning all ties for max
Asked Answered
T

3

2

I'm having trouble figuring out how to update the networkx dag_find_longest_path() algorithm to return "N" for ties instead of returning the first max edge found, or returning a list of all edges that are tied for max weight.

I first created a DAG from pandas dataframe that contains an edgelist like the following subset:

edge1        edge2          weight
115252161:T  115252162:A     1.0
115252162:A  115252163:G     1.0
115252163:G  115252164:C     3.0
115252164:C  115252165:A     5.5
115252165:A  115252166:C     5.5
115252162:T  115252163:G     1.0
115252166:C  115252167:A     7.5
115252167:A  115252168:A     7.5
115252168:A  115252169:A     6.5
115252165:A  115252166:G     0.5

Then I use the following code to topologically sort the graph and then find the longest path according to the weights of the edges:

 G = nx.from_pandas_edgelist(edge_df, source="edge1", 
                                target="edge2", 
                                edge_attr=['weight'], 
                                create_using=nx.OrderedDiGraph())

longest_path = pd.DataFrame(nx.dag_longest_path(G))

This works great, except when there are ties for max weighted edge, it returns the first max edge found, and instead I would like it to just return an "N" representing "Null". So currently, the output is:

115252161   T
115252162   A
115252163   G
115252164   C
115252165   A
115252166   C

But what I actually need is:

115252161   T
115252162   N (or [A,T] )
115252163   G
115252164   C
115252165   A
115252166   C

The algorithm for finding the longest path is:

def dag_longest_path(G):

    dist = {}  # stores [node, distance] pair
    for node in nx.topological_sort(G):
        # pairs of dist,node for all incoming edges
        pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
        if pairs:
            dist[node] = max(pairs)
        else:
            dist[node] = (0, node)
    node, (length, _) = max(dist.items(), key=lambda x: x[1])
    path = []
    while length > 0:
        path.append(node)
        length, node = dist[node]
    return list(reversed(path))

Copy-pastable definition of G.

import pandas as pd
import networkx as nx
import numpy as np

edge_df = pd.read_csv(
    pd.compat.StringIO(
        """edge1        edge2          weight
115252161:T  115252162:A     1.0
115252162:A  115252163:G     1.0
115252163:G  115252164:C     3.0
115252164:C  115252165:A     5.5
115252165:A  115252166:C     5.5
115252162:T  115252163:G     1.0
115252166:C  115252167:A     7.5
115252167:A  115252168:A     7.5
115252168:A  115252169:A     6.5
115252165:A  115252166:G     0.5"""
    ),
    sep=r" +",
)


G = nx.from_pandas_edgelist(
    edge_df,
    source="edge1",
    target="edge2",
    edge_attr=["weight"],
    create_using=nx.OrderedDiGraph(),
)

longest_path = pd.DataFrame(nx.dag_longest_path(G))
Trichinosis answered 9/12, 2018 at 23:16 Comment(0)
T
1

I ended up just modeling the behavior in a defaultdict counter object.

from collections import defaultdict, Counter

I modified my edgelist to a tuple of (position, nucleotide, weight):

test = [(112,"A",23.0), (113, "T", 27), (112, "T", 12.0), (113, "A", 27), (112,"A", 1.0)]

Then used defaultdict(counter) to quickly sum occurences of each nucleotide at each position:

nucs = defaultdict(Counter)

for key, nuc, weight in test:
    nucs[key][nuc] += weight

And then looped through the dictionary to pull out all nucleotides that equal the max value:

for key, nuc in nucs.items():
    seq_list = []
    max_nuc = []
    max_val = max(nuc.values())
    for x, y in nuc.items():
        if y == max_val:
            max_nuc.append(x)

    if len(max_nuc) != 1:
        max_nuc = "N"
    else:
        max_nuc = ''.join(max_nuc)

    seq_list.append(max_nuc)
    sequence = ''.join(seq_list)

This returns the final sequence for the nucleotide with the max value found, and returns N in the position of a tie:

TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT

However, the question bothered me, so I ended up using node attributes in networkx as a means to flag each node as being a tie or not. Now, when a node is returned in the longest path, I can then check the "tie" attribute and replace the node name with "N" if it has been flagged:

def get_path(self, edge_df):

    G = nx.from_pandas_edgelist(
        edge_df,
        source="edge1",
        target="edge2",
        edge_attr=["weight"],
        create_using=nx.OrderedDiGraph()
    )

    # flag all nodes as not having a tie
    nx.set_node_attributes(G, "tie", False)

    # check nodes for ties
    for node in G.nodes:

        # create list of all edges for each node
        edges = G.in_edges(node, data=True)

        # if there are multiple edges
        if len(edges) > 1:

            # find max weight
            max_weight = max([edge[2]['weight'] for edge in edges])

            tie_check = []
            for edge in edges:
                # pull out all edges that match max weight
                if edge[2]["weight"] == max_weight:
                    tie_check.append(edge)

            # check if there are ties       
            if len(tie_check) > 1:
                for x in tie_check:

                    # flag node as being a tie
                    G.node[x[0]]["tie"] = True

    # return longest path
    longest_path = nx.dag_longest_path(G)     

    return longest_path
Trichinosis answered 11/12, 2018 at 22:34 Comment(2)
This sounds like a good solution. Am I correct in assuming that an edge exists between every pair of consecutive positions? If so, you may consider adding it to the question description. Note that in your original example, there is no edge between 115252161:T and 115252162:T.Edmundedmunda
If 115252161:T is a sequencing error, it's possible to see this error at this position as a node that does not connect to any following nodes.Trichinosis
K
1

This line inside the function seems to discard the paths you want; because max only returns one result:

node, (length, _) = max(dist.items(), key=lambda x: x[1])

I would keep the max value and then search based on it all the items. Then reuse the code to find the desired paths. An example would be like this:

def dag_longest_path(G):
    dist = {}  # stores [node, distance] pair
    for node in nx.topological_sort(G):
        # pairs of dist,node for all incoming edges
        pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
        if pairs:
            dist[node] = max(pairs)
        else:
            dist[node] = (0, node)
    # store max value inside val variable
    node, (length, val) = max(dist.items(), key=lambda x: x[1])
    # find all dictionary items that have the maximum value
    nodes = [(item[0], item[1][0]) for item in dist.items() if item[1][1] == val]
    paths = []
    # iterate over the different nodes and append the paths to a list
    for node, length in nodes:
        path = []
        while length > 0:
            path.append(node)
            length, node = dist[node]
        paths.append(list(reversed(path)))
    return paths

PS. I haven't tested this code to know if it runs correctly.

Kidder answered 11/12, 2018 at 3:27 Comment(3)
I had a similar idea, but this gives a single path (same as OPs original). Give me a minute, I will update the question with a copy-pastable definition of G.Edmundedmunda
I think that topsort must be adjusted. If you print the distances after they are defined, you can see that '115252162:T' only occurs in reference to itself. It might as well be a part of a disjoint component, so my thinking is that distances as defined should not be sufficient to recover the OPs desired output.Edmundedmunda
you are correct. I totally removed the topsort from the picture when thinking a simple approach. There must be a different approach to the creation of the dictionary (topsort doesn't help).Kidder
E
1

Judging by your example, each node is determined by position ID (number before :) and two nodes with different bases attached are identical for the purposes of computing the path length. If this is correct, there is no need to modify the algorithm and you could get your result by manipulating vertex labels.

Basically, drop everything after semicolon in the edge_df, compute the longest path and append the base labels from your original data.

edge_df_pos = pd.DataFrame(
    {
        "edge1": edge_df.edge1.str.partition(":")[0],
        "edge2": edge_df.edge2.str.partition(":")[0],
        "weight": edge_df.weight,
    }
)

vert_labels = dict()
for col in ("edge1", "edge2"):
    verts, lbls = edge_df[col].str.partition(":")[[0, 2]].values.T
    for vert, lbl in zip(verts, lbls):
        vert_labels.setdefault(vert, set()).add(lbl)


G_pos = nx.from_pandas_edgelist(
    edge_df_pos,
    source="edge1",
    target="edge2",
    edge_attr=["weight"],
    create_using=nx.OrderedDiGraph(),
)

longest_path_pos = nx.dag_longest_path(G_pos)

longest_path_df = pd.DataFrame([[node, vert_labels[node]] for node in longest_path_pos])

Result

# 0       1
# 0  115252161     {T}
# 1  115252162  {A, T}
# 2  115252163     {G}
# 3  115252164     {C}
# 4  115252165     {A}
# 5  115252166  {G, C}
# 6  115252167     {A}
# 7  115252168     {A}
# 8  115252169     {A}

If my interpretation is not correct, I doubt that there is a simple extension of the algorithm based on topological sort. The problem is that a graph can admit multiple topological sorts. If you print dist as defined in the dag_longest_path in your example, you'll get something like this:

{'115252161:T': (0, '115252161:T'),
 '115252162:A': (1, '115252161:T'),
 '115252162:T': (0, '115252162:T'),
 '115252163:G': (2, '115252162:A'),
 '115252164:C': (3, '115252163:G'),
 '115252165:A': (4, '115252164:C'),
 '115252166:C': (5, '115252165:A'),
 '115252166:G': (5, '115252165:A'),
 '115252167:A': (6, '115252166:C'),
 '115252168:A': (7, '115252167:A'),
 '115252169:A': (8, '115252168:A')}

Note that '115252162:T' occurs in the third line and nowhere else. Hence, dist cannot discriminate between your example and another example where '115252162:T' occurs as a disjoint component. So it should not be possible to recover any paths through '115252162:T' just using data in dist.

Edmundedmunda answered 11/12, 2018 at 4:28 Comment(3)
I am sending so much gratitude your way today. I'm new to networkx, so this was really driving me nuts. Works perfectly and it's fast!Trichinosis
Ah, so close. I only want to return all possibilities when the weights are tied (so at position 115252162, A and T have a weight of 1). However, at position 115252166, C has a weight of 8.5 and G only has a weight of 0.5, so this should return only G.Trichinosis
It won't let me edit my comment.. so above, sorry, it should return C at position 115252166.Trichinosis
T
1

I ended up just modeling the behavior in a defaultdict counter object.

from collections import defaultdict, Counter

I modified my edgelist to a tuple of (position, nucleotide, weight):

test = [(112,"A",23.0), (113, "T", 27), (112, "T", 12.0), (113, "A", 27), (112,"A", 1.0)]

Then used defaultdict(counter) to quickly sum occurences of each nucleotide at each position:

nucs = defaultdict(Counter)

for key, nuc, weight in test:
    nucs[key][nuc] += weight

And then looped through the dictionary to pull out all nucleotides that equal the max value:

for key, nuc in nucs.items():
    seq_list = []
    max_nuc = []
    max_val = max(nuc.values())
    for x, y in nuc.items():
        if y == max_val:
            max_nuc.append(x)

    if len(max_nuc) != 1:
        max_nuc = "N"
    else:
        max_nuc = ''.join(max_nuc)

    seq_list.append(max_nuc)
    sequence = ''.join(seq_list)

This returns the final sequence for the nucleotide with the max value found, and returns N in the position of a tie:

TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT

However, the question bothered me, so I ended up using node attributes in networkx as a means to flag each node as being a tie or not. Now, when a node is returned in the longest path, I can then check the "tie" attribute and replace the node name with "N" if it has been flagged:

def get_path(self, edge_df):

    G = nx.from_pandas_edgelist(
        edge_df,
        source="edge1",
        target="edge2",
        edge_attr=["weight"],
        create_using=nx.OrderedDiGraph()
    )

    # flag all nodes as not having a tie
    nx.set_node_attributes(G, "tie", False)

    # check nodes for ties
    for node in G.nodes:

        # create list of all edges for each node
        edges = G.in_edges(node, data=True)

        # if there are multiple edges
        if len(edges) > 1:

            # find max weight
            max_weight = max([edge[2]['weight'] for edge in edges])

            tie_check = []
            for edge in edges:
                # pull out all edges that match max weight
                if edge[2]["weight"] == max_weight:
                    tie_check.append(edge)

            # check if there are ties       
            if len(tie_check) > 1:
                for x in tie_check:

                    # flag node as being a tie
                    G.node[x[0]]["tie"] = True

    # return longest path
    longest_path = nx.dag_longest_path(G)     

    return longest_path
Trichinosis answered 11/12, 2018 at 22:34 Comment(2)
This sounds like a good solution. Am I correct in assuming that an edge exists between every pair of consecutive positions? If so, you may consider adding it to the question description. Note that in your original example, there is no edge between 115252161:T and 115252162:T.Edmundedmunda
If 115252161:T is a sequencing error, it's possible to see this error at this position as a node that does not connect to any following nodes.Trichinosis

© 2022 - 2024 — McMap. All rights reserved.