Read data from a pandas DataFrame and create a tree using anytree in python
Asked Answered
G

4

13

Is there a way to read data from a pandas DataFrame and construct a tree using anytree?

Parent Child
A      A1
A      A2
A2     A21

I can do it with static values as follows. However, I want to automate this by reading the data from a pandas DataFrame with anytree.

>>> from anytree import Node, RenderTree
>>> A = Node("A")
>>> A1 = Node("A1", parent=A)
>>> A2 = Node("A2", parent=A)
>>> A21 = Node("A21", parent=A2)

Output is

A
├── A1
└── A2
    └── A21

This question AND especially the ANSWER has been adopted, copied really, from:

Read data from a file and create a tree using anytree in python

Many thanks to @Fabien N

Georgeanngeorgeanna answered 27/9, 2020 at 4:24 Comment(0)
S
9

Create nodes first if not exist, store their references in a dictionary nodes for further usage. Change parent when necessary for children. We can derive roots of the forest of trees by seeing what Parent values are not in Child values, since a parent is not a children of any node it won't appear in Child column.

def add_nodes(nodes, parent, child):
    if parent not in nodes:
        nodes[parent] = Node(parent)  
    if child not in nodes:
        nodes[child] = Node(child)
    nodes[child].parent = nodes[parent]

data = pd.DataFrame(columns=["Parent","Child"], data=[["A","A1"],["A","A2"],["A2","A21"],["B","B1"]])
nodes = {}  # store references to created nodes 
# data.apply(lambda x: add_nodes(nodes, x["Parent"], x["Child"]), axis=1)  # 1-liner
for parent, child in zip(data["Parent"],data["Child"]):
    add_nodes(nodes, parent, child)

roots = list(data[~data["Parent"].isin(data["Child"])]["Parent"].unique())
for root in roots:         # you can skip this for roots[0], if there is no forest and just 1 tree
    for pre, _, node in RenderTree(nodes[root]):
        print("%s%s" % (pre, node.name))

Result:

A
├── A1
└── A2
    └── A21
B
└── B1

Update printing a specific root:

root = 'A' # change according to usecase
for pre, _, node in RenderTree(nodes[root]):
    print("%s%s" % (pre, node.name))
Shemikashemite answered 27/9, 2020 at 5:26 Comment(6)
Excellent @Suparshva! Because there is a large forest, how do we modify the code to print only one tree based on user input of the parent? By the way, I will mark this as the answer for sure as soon as I finish successfully adopting it for my specific case.Georgeanngeorgeanna
@Georgeanngeorgeanna You can drop outer for root in roots: and assign root variable directly with the root key you are looking for. Have added an example.Coachwhip
Thank you @Suparshva. This Answer is perfect and exactly what the doctor ordered.Georgeanngeorgeanna
Welcome, glad it helped !Coachwhip
Might be one of the most on-the-spot answers I've ever seen from SO. Addressed something that I've been trying to implement for the last 16 hours straight.Bind
Hey @Shemikashemite can you help me with this one? #74622729Mulloy
G
2

there is a Python library bigtree that does this out-of-the-box for you as it integrates with Python lists, dictionary, and pandas DataFrame.

# Setup
import pandas as pd
 
data = pd.DataFrame(
   [
      ["A", "A1"],
      ["A", "A2"],
      ["A2", "A21"],
   ],
   columns=["Parent", "Child"],
)
 
# Using bigtree
from bigtree import dataframe_to_tree_by_relation
 
root = dataframe_to_tree_by_relation(data, child_col="Child", parent_col="Parent")
root.show()

This will result in the output

A
├── A1
└── A2
    └── A21

Disclaimer: I'm the author of bigtree :)

Gast answered 15/1 at 9:11 Comment(0)
G
1

Please refer to @Fabian N 's answer at Read data from a file and create a tree using anytree in python for details.

Below is an adoption of his answer for an external file to work with a pandas DataFrame:

    df['Parent_child'] = df['Parent'] + ',' + df['child'] # column of comma separated Parent,child

    i = 0
    for index, row in df.iterrows():
        if row['child']==row['Parent']:  # I modified the DataFrame by concatenating a 
                                         # dataframe of all the roots in my data, then 
                                         # copied in into both parent and child columns.  
                                         # This can be skipped by statically setting the 
                                         # roots, only making sure the assumption 
                                         # highlighted by @Fabien in the above quoted 
                                         # answer still holds true (This assumes that the 
                                         # entries are in such an order that a parent node 
                                         # was always introduced as a child of another 
                                         # node beforehand)

            root = Node(row['Parent'])
            nodes = {}
            nodes[root.name] = root
            i=i+1
        else:
            line = row['Parent_child'].split(",")
            name = "".join(line[1:]).strip()
            nodes[name] = Node(name, parent=nodes[line[0]])
            #predecessor = df['child_Parent'].values[i]
            i=i+1
                
    for pre, _, node in RenderTree(root):
        print("%s%s" % (pre, node.name))

If there is a better way to achieve the above, kindly post an answer and I will accept is as the solution.

Many thanks @Fabian N.

Georgeanngeorgeanna answered 27/9, 2020 at 4:44 Comment(1)
Refer to accepted answer from @Suparshva. It is much better than my proposal.Georgeanngeorgeanna
K
1

If you need to attach data to the node (not just have the parent-child keys) and if for whatever reason you don't want to scan the whole dataframe to find the roots, the following modification to @Suparshva 's answer works. Saving and checking prospective roots in a set is fast. The attached data (val in this example) can of course be substituted by anything, including the whole dataframe row. No assumption on the order of the inputs is made.

import pandas as pd
from anytree import Node, RenderTree


def print_tree(nodes: dict, roots: set) -> None:
    for root in roots:
        print()
        for pre, _, node in RenderTree(nodes[root]):
            print(f'{pre}{node.name} ({node.val})')


def add_nodes(nodes: dict, roots: set, parent: str, child: str, val: int) -> None:
    if parent not in nodes:
        nodes[parent] = Node(parent, val=None)
        roots.add(parent)
    if child not in nodes:
        nodes[child] = Node(child, val=val)
    else:
        nodes[child].val = val
    nodes[child].parent = nodes[parent]
    if child in roots:
        roots.remove(child)


def create_tree(df: pd.DataFrame) -> None:
    nodes = {}
    roots = set()
    for row in df.itertuples(index=False, name='df_row'):
        if row.c_key is not None:
            add_nodes(nodes, roots, row.p_key, row.c_key, row.val)
    print_tree(nodes, roots)

# Sample DataFrame
data = {'p_key': ['R', 'A', 'A', 'B', 'B', 'C', 'G', 'Y'],
        'c_key': ['X', 'B', 'C', 'D', 'E', 'F', 'H', 'A'],
        'val': [0, 1, 2, 3, 4, 5, 6, 55]}

df = pd.DataFrame(data)
create_tree(df)

Produces:


R (None)
└── X (0)

Y (None)
└── A (55)
    ├── B (1)
    │   ├── D (3)
    │   └── E (4)
    └── C (2)
        └── F (5)

G (None)
└── H (6)
Kresic answered 3/10, 2023 at 21:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.