SageMaker Endpoint stuck at "Creating"

I'm trying to deploy a SageMaker endpoint and it gets stuck in "Creating" stage indefinitely. Below is my Dockerfile and training / serving script. The model trains without any issue. Only the Endpoint deployment gets stuck in the "Creating" stage.

Below is the folder structure

Folder structure

|_code
   |_train_serve.py
|_Dockerfile

Below is the Dockerfile

Dockerfile

# ##########################################################

# Adapt your container (to work with SageMaker)
# # https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
# # https://hub.docker.com/r/huanjason/scikit-learn/dockerfile

ARG REGION=us-east-1

FROM python:3.7

RUN apt-get update && apt-get -y install gcc

RUN pip3 install \
        # numpy==1.16.2 \
        numpy \
        # scikit-learn==0.20.2 \
        scikit-learn \
        pandas \
        # scipy==1.2.1 \
        scipy \
        mlflow

RUN rm -rf /root/.cache

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE

# Install sagemaker-training toolkit to enable SageMaker Python SDK
RUN pip3 install sagemaker-training

ENV PATH="/opt/ml/code:${PATH}"

# Copies the training code inside the container
COPY  /code /opt/ml/code

# Defines train_serve.py as script entrypoint
ENV SAGEMAKER_PROGRAM train_serve.py

Below is the script used for training and serving the model

train_serve.py

import os
import ast
import warnings
import sys
import json
import ast
import argparse
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from urllib.parse import urlparse
import logging
import pickle

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

if __name__ =='__main__':
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # Data, model, and output directories
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='kc_house_data_train.csv')
    parser.add_argument('--test-file', type=str, default='kc_house_data_test.csv')
    parser.add_argument('--features', type=str)  # we ask user to explicitly name features
    parser.add_argument('--target', type=str) # we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Reading training and testing datasets
    logging.info('reading training and testing datasets')
    logging.info(f"{args.train} {args.train_file} {args.test} {args.test_file}")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    
    logging.info(args.features.split(','))
    logging.info(args.target)
    train_x = np.array(train_df[args.features.split(',')]).reshape(-1,1)
    test_x = np.array(test_df[args.features.split(',')]).reshape(-1,1)
    train_y = np.array(train_df[args.target]).reshape(-1,1)
    test_y = np.array(test_df[args.target]).reshape(-1,1)  

    reg = linear_model.LinearRegression()

    reg.fit(train_x, train_y)
    predicted_price = 
    reg.predict(test_x)
    (rmse, mae, r2) = eval_metrics(test_y, predicted_price)

    logging.info(f"        Linear model: (features={args.features}, target={args.target})")
    logging.info(f"            RMSE: {rmse}")
    logging.info(f"            MAE: {mae}")
    logging.info(f"            R2: {r2}")

    model_path = os.path.join(args.model_dir, "model.pkl")
    logging.info(f"saving to {model_path}")          
    logging.info(args.model_dir)
    with open(model_path, 'wb') as path:
        pickle.dump(reg, path)


def model_fn(model_dir):
    with open(os.path.join(model_dir, "model.pkl"), "rb") as input_model:
        model = pickle.load(input_model)
    return model
    
def predict_fn(input_object, model):
    _return = model.predict(input_object)
    return _return

One way of investigating this is to attempt to use the same model via the AWS console as part of a Batch Transform, as this flow seems to give better error messaging and diagnostics versus Inference Endpoint creation.

In my case, this made me realise that the IAM role that was associated with the model upon its creation no longer existed. I'd overlooked this because because the roles were CDK-managed and at some point got removed, but the Models were created dynamically via Step Functions pipelines.

Anyway, deploying with a non-existent role would lead to the SageMaker endpoint remaining in the "Creating" state for a few hours, before failing with "Request to service failed. If failure persists after retry, contact customer support", and there would be no CloudWatch logs. Re-creating the model with a valid role fixed the issue.

Apologies if the above does not apply to the OP, who reports the same problem but with a different setup that I am not familiar with. I am just sharing my outcome with a similar problem which brought me to this page, in case it helps anyone in future.

Recommended topics

Hot tags