Creating New Columns in Power BI with a Python Script
Asked Answered
K

1

6

I am trying to run a python script so that I can create a household count based on the residential address column and residential city column. Both columns just contain strings.

The script that I have tried can be seen below:

dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup()
dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')

Yet, it gives me this error after 20,000 rows:

DataSource.Error: ADO.NET: A problem occurred while processing your Python script. Here are the technical details: [DataFormat.Error] We couldn't convert to Number. Details:DataSourceKind=Python DataSourcePath=Python Message=A problem occurred while processing your Python script. Here are the technical details: [DataFormat.Error] We couldn't convert to Number. ErrorCode=-2147467259.

Is there any way to fix this? This code works in python every single time and the error code make absolutely no sense in Power BI and I would greatly appreciate any advice on how to do this with DAX.

Kaceykachina answered 3/9, 2019 at 23:34 Comment(11)
I would like to isolate the problem, can you take out the second line and see if you still get the problem?Shizukoshizuoka
@Shizukoshizuoka I still get this problem and with other code tooKaceykachina
@Shizukoshizuoka can you identify why this would happen? I have tried this with different python environments too.Kaceykachina
I want to know which line is causing the problem, please comment line 2 out and see if you get problem.Shizukoshizuoka
@Shizukoshizuoka I did that like I said it still does not workKaceykachina
Which line gives the problem?Shizukoshizuoka
@Shizukoshizuoka the first line is necessary for the second. The first line still gives the same error.Kaceykachina
Can you show your raw data and the outcome you have, please add to question?Shizukoshizuoka
@Shizukoshizuoka How? Per Stack overflow I can't attach links otherwise I'll get downvoted and the data set is 7.8 million rows. There are no blanks. It is all just strings and yet it is talking about numbers. Something is wrong here.Kaceykachina
I was thinking to solve this with DAX so you do not get the error anymore, to be able to do so I need some raw data (10 rows), your data structure and your output. I would like to mention that I am trying to help you..Shizukoshizuoka
@Data Science Acolyte Regarding the request from Aldert, you don't need to share your entire raw dataset. Normally a descriptio of your dataset would go a long way, like: One or more tables? Column names? Data types?Omdurman
O
2

I have not been able to reproduce your error, but I strongly suspect the source of the error to be the datatypes. In the Power Query Editor, try transforming your grouping variables to text. The fact that your query fails for a dataset larger than 20000 rows should have absolutely nothing to do with the problem. Unless, of course, the data content somehow changes after row 20000.

If you could describe your datasource and show the applied steps in the Power Query Editor that would be of great help for anyone trying to assist you. You could also try to apply your code one step at a time, meaning making one table using dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup() and yet another table using dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')

I might as well show you how to do just that, and maybe at the same time cement my suspicion that the error lies in the datatypes and hopefully rule out other error sources.


I'm using numpy along with a few random city and street names to build a dataset that I hope represents the structure and datatypes of your real world dataset:

Snippet 1:

import numpy as np
import pandas as pd

np.random.seed(123)
strt=['Broadway', 'Bowery', 'Houston Street', 'Canal Street', 'Madison', 'Maiden Lane']
city=['New York', 'Chicago', 'Baltimore', 'Victory Boulevard', 'Love Lane', 'Utopia Parkway']

RESIDENTIAL_CITY=np.random.choice(strt,21000).tolist()
RESIDENTIAL_ADDRESS1=np.random.choice(strt,21000).tolist()
sample_dataset=pd.DataFrame({'RESIDENTIAL_CITY':RESIDENTIAL_CITY,
                      'RESIDENTIAL_ADDRESS1':RESIDENTIAL_ADDRESS1})

Copy that snippet, go to PowerBI Desktop > Power Query Editor > Transform > Run Python Script and run it to get this:

enter image description here

Then do the same thing with this snippet:

dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup()

Now you should have this:

enter image description here

So far, your last step is called Changed Type 2. Right above is a step called dataset. If you click that you'll see that the datatype of ID there is a string ABC and that it changes to number 123 in the next step. With my settings, Power BI inserts the step Changed Type 2 automatically. Maybe that is not the case with you? It cerainly can be a potential error source.

Next, insert your last line as a step of it's own:

dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')

Now you should have the dataset like below, along with the same steps under Applied Steps:

enter image description here

With this setup, everything seems to be working fine. So, what do we know for sure by now?

  1. The size of the dataset is not the problem
  2. Your code itself is not the problem
  3. Python should handle this perfectly in Power BI

And what do we suspect?

  1. Your data is the problem - either missing values or wrong type

I hope this helps you out somehow. If not, then don't hesitate to let me know.

Omdurman answered 18/10, 2019 at 12:30 Comment(1)
@Data Science Acolyte How did this work out for you?Omdurman

© 2022 - 2024 — McMap. All rights reserved.