data-science Questions
1
For a particular prediction problem, I observed that a certain variable ranks high in the XGBoost feature importance that gets generated (on the basis of Gain) while it ranks quite low in the SHAP ...
Gangue asked 15/6, 2022 at 6:0
2
Solved
Suppose I have a pipeline for my data which does preprocessing and has an estimator at the end. Now if I want to just change the estimator/model at the last step of the pipeline, how do I do it wit...
Supplant asked 20/11, 2017 at 5:52
3
Solved
Question:
How can you use R to remove all special characters from a dataframe, quickly and efficiently?
Progress:
This SO post details how to remove special characters. I can apply the gsub fu...
Memphis asked 17/4, 2018 at 20:18
2
Solved
I have a dataframe with over 280 features.
I ran correlation map to detect groups of features that are highly correlated:
Now, I want to divide the features to groups, such that each group will be...
Remiss asked 19/10, 2020 at 9:34
1
Getting this error: AttributeError: 'GPT2Tokenizer' object has no
attribute 'train_new_from_iterator'
Very similar to hugging face documentation. I changed the input and that's it (shouldn't affe...
Harlen asked 22/4, 2022 at 20:43
1
I'm trying to deploy a SageMaker endpoint and it gets stuck in "Creating" stage indefinitely. Below is my Dockerfile and training / serving script. The model trains without any issue. Onl...
Boabdil asked 12/1, 2021 at 4:57
5
What is the difference between the two? It seems that both create new columns, which their number is equal to the number of unique categories in the feature. Then they assign 0 and 1 to data points...
Puduns asked 22/5, 2018 at 17:25
3
Solved
i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) ha...
Wrens asked 24/3, 2017 at 7:15
6
Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with too...
Dreamy asked 19/10, 2018 at 9:33
3
Solved
I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassi...
Glyptography asked 10/1, 2017 at 2:28
6
Solved
How to load a model from an HDF5 file in Keras?
What I tried:
model = Sequential()
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(LeakyReLU(alpha=0.3))
model.add(BatchNormalization...
Unmistakable asked 29/1, 2016 at 0:3
3
I plotted data on a barplot using seaborn library. But on the top of the bars, I can see some black lines. Can someone explain me what does it mean?
Note : the last bar does not have this line as ...
Shuffleboard asked 13/10, 2019 at 9:58
4
Solved
I am using SVC classifier with Linear kernel to train my model.
Train data: 42000 records
model = SVC(probability=True)
model.fit(self.features_train, self.labels_train)
y_pred = model.predict(...
Woolpack asked 27/12, 2018 at 5:43
3
Solved
Iam getting the error as
"ValueError: Expected 2D array, got 1D array instead: array=[ 45000.
50000. 60000. 80000. 110000. 150000. 200000. 300000.
500000. 1000000.]. Reshape your data either...
Lessen asked 24/10, 2018 at 5:52
1
Solved
I developed a custom dataset by using the PyTorch dataset class. The code is like that:
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, root_path, transform=None):
self.path = ...
Refrigeration asked 26/1, 2022 at 15:10
1
Solved
I am trying to achieve a calculation involving geometric progression (split). Is there any effective/efficient way of doing it. The data set has millions of rows.
I need the column "Traded_qua...
Moen asked 22/1, 2022 at 7:31
2
Solved
n_level = range(1, steps + 2)
steps is user input, using multi-index dataframe
df = {'crest': [754, 755, 762, 785], 'trough': [752, 725, 759, 765], 'L1T': [761, 761, 761, 761], 'L2T': [772, 772, ...
Haven asked 6/1, 2022 at 0:55
8
Solved
I am receiving the error:
ValueError: Wrong number of items passed 3, placement implies 1, and I am struggling to figure out where, and how I may begin addressing the problem.
I don't really under...
Beaux asked 4/4, 2017 at 1:35
1
I am trying to finetune a pre-trained GPT2-model. When applying the respective tokenizer, I originally got the error message:
Using pad_token, but it is not set yet.
Thus, I changed my code to:
G...
Southeaster asked 22/6, 2021 at 13:19
3
I am trying to use pandera library (I am very new with this) for pandas dataframe validation.
What I want to do is to ignore the rows which are not valid as per the schema.
How can I do that?
for e...
Optimism asked 12/11, 2021 at 19:11
4
I am not able to import category_encoders module in jupyter notebook in python 3 virtual environment.
Error
---------------------------------------------------------------------------
ModuleNot...
Nicholnichola asked 19/1, 2019 at 9:29
2
I have a temp DF that has the following data in it
Quarter
2016Q3 146660510.0
2016Q4 123641451.0
2017Q1 125905843.0
2017Q2 129656327.0
2017Q3 126586708.0
2017Q4 116804168.0
2018Q1 118167263.0
2018Q...
Reuter asked 28/1, 2021 at 15:0
3
Solved
I'm an avid R user and am learning python along the way. One of the example code that I can easily run in R is perplexing me in Python.
Here's the original data (constructed within R):
library(ti...
Alms asked 26/2, 2019 at 18:31
3
Solved
I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on...
Mikkimiko asked 22/12, 2016 at 19:0
1
Solved
I am trying to work with Featuretools to develop an automated feature engineering workflow for the customer churn dataset. The end outcome is a function that takes in a dataset and label times for ...
Medlock asked 12/9, 2021 at 4:40
© 2022 - 2024 — McMap. All rights reserved.