Fastest way to perform complex search on pandas dataframe
Asked Answered
T

2

15

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       DEF       123     456     8000    8000
   DEF       XYZ       456     893     9999    9999
   AAA       BBB       473     917     5555    5555
   BBB       CCC       917     341     5555    5555

After search/sort:

flightTo  flightFrom  toNum  fromNum  toCode  fromCode
   ABC       XYZ       123     893     8000    9999
   AAA       CCC       473     341     5555    5555

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       DEF       123     456     8000    8000
   XYZ       DEF       893     456     9999    9999

After search/sort:

flight1  flight2      1Num    2Num     1Code   2Code
   ABC       XYZ       123     893     8000    9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

Threaten answered 28/5, 2019 at 14:7 Comment(5)
Are the connecting flights always adjacent in the data frame?Muggy
np.where(condition)Solvolysis
how about df['flightFrom'].shift() != df['fightTo']?Holloweyed
@Muggy the information can be completely random in the DataFrameThreaten
@Holloweyed check the values in fromNum, fromCode expected output, that's what makes this question complex imo.Dumdum
S
16

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d={k: v for d in L for k, v in d.items()}

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
         flightTo flightFrom  toNum  fromNum  toCode  fromCode
flightTo                                                      
0             ABC        XYZ    123      893    8000      9999
1             AAA        CCC    473      341    5555      5555

Installation networkx

  • Pip: pip install networkx
  • Anaconda: conda install -c anaconda networkx
Stadiometer answered 28/5, 2019 at 14:19 Comment(8)
great answer! Looked into networkx couple times, will do more now!Dumdum
@Dumdum this is more like linked key , and network edge pickStadiometer
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)Dumdum
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodologyDogberry
@WeNYoBen I updated the question with a potential edge case.Threaten
@Threaten I feel like the data is corrupt , since from to from should not consider within the one netStadiometer
@WeNYoBen you are correct. However, if we remove any indication of direction ('from' , 'to') then suddenly it becomes important.Threaten
@MaxB, I can only say split you dataframe with two , one is the normal network , another is your edge situation , using df1=df[df.ID.duplicated(keep=False)];df2=df2.drop(df1.index); df1.groupby('flightFrom').agg(....), df2 follow above steps Stadiometer
P
6

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
    x = df.to_numpy()
    # obtain a flat numpy array from both columns
    b = x[:,0:2].ravel()
    _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
    # Index of duplicate values in b
    ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
    # Indices to be used to replace the content in the columns
    replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
    # Col index of where duplicate value is, 0 or 1
    col = (ixs_drop % 2) ^ 1
    # 2d array to index and replace values in the df
    # index to obtain values with which to replace
    keep_cols = np.broadcast_to([3,5],(len(col),2))
    ixs = np.concatenate([col[:,None], keep_cols], 1)
    # translate indices to row indices
    rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
    c = np.empty((len(col), 5), dtype=x.dtype)
    c[:,::2] = x[rows_drop[:,None], ixs]
    c[:,1::2] = x[rows_replace[:,None], [2,4]]
    # update dataframe and drop rows
    df.iloc[rows_replace, 1:] = c
    return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      DEF        XYZ    456      893    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      893    8000      9999
2      AAA        CCC    473      341    5555      5555

This approach does not assume any particular order in terms of the rows where the duplicate is, and the same applies to the columns (to cover the edge case described in the question). If we use for instance the following dataframe:

    flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        DEF    123      456    8000      8000
1      XYZ        DEF    893      456    9999      9999
2      AAA        BBB    473      917    5555      5555
3      BBB        CCC    917      341    5555      5555

remove_middle_dest(df)

     flightTo flightFrom  toNum  fromNum  toCode  fromCode
0      ABC        XYZ    123      456    8000      9999
2      AAA        CCC    473      341    5555      5555
Pregnable answered 28/5, 2019 at 14:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.