Path: blob/main/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb
8284 views
Sentence Embeddings with Hugging Face Transformers, Sentence Transformers and Amazon SageMaker - Custom Inference for creating document embeddings with Hugging Face's Transformers
Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker Python SDK to create a real-time inference endpoint running a Sentence Transformers for document embeddings. Currently, the SageMaker Hugging Face Inference Toolkit supports the pipeline feature from Transformers for zero-code deployment. This means you can run compatible Hugging Face Transformer models without providing pre- & post-processing code. Therefore we only need to provide an environment variable HF_TASK and HF_MODEL_ID when creating our endpoint and the Inference Toolkit will take care of it. This is a great feature if you are working with existing pipelines.
If you want to run other tasks, such as creating document embeddings, you can the pre- and post-processing code yourself, via an inference.py script. The Hugging Face Inference Toolkit allows the user to override the default methods of the HuggingFaceHandlerService.
The custom module can override the following methods:
model_fn(model_dir)overrides the default method for loading a model. The return valuemodelwill be used in thepredict_fnfor predictions.model_diris the path to your unzippedmodel.tar.gz.
input_fn(input_data, content_type)overrides the default method for pre-processing. The return valuedatawill be used inpredict_fnfor predictions. The inputs are:input_datais the raw body of your request.content_typeis the content type from the request header.
predict_fn(processed_data, model)overrides the default method for predictions. The return valuepredictionswill be used inoutput_fn.modelreturned value frommodel_fnmethondprocessed_datareturned value frominput_fnmethod
output_fn(prediction, accept)overrides the default method for post-processing. The return valueresultwill be the response to your request (e.g.JSON). The inputs are:predictionsis the result frompredict_fn.acceptis the return accept type from the HTTP Request, e.g.application/json.
In this example are we going to use Sentence Transformers to create sentence embeddings using a mean pooling layer on the raw representation.
NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances
Development Environment and Permissions
Installation
Install git and git-lfs
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
Create custom an inference.py script
To use the custom inference script, you need to create an inference.py script. In our example, we are going to overwrite the model_fn to load our sentence transformer correctly and the predict_fn to apply mean pooling.
We are going to use the sentence-transformers/all-MiniLM-L6-v2 model. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Create model.tar.gz with inference script and model
To use our inference.py we need to bundle it into a model.tar.gz archive with all our model-artifcats, e.g. pytorch_model.bin. The inference.py script will be placed into a code/ folder. We will use git and git-lfs to easily download our model from hf.co/models and upload it to Amazon S3 so we can use it when creating our SageMaker endpoint.
Download the model from hf.co/models with
git clone.
Updated git hooks.
Git LFS initialized.
Cloning into 'all-MiniLM-L6-v2'...
remote: Enumerating objects: 25, done.
remote: Counting objects: 100% (25/25), done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 25 (delta 3), reused 0 (delta 0).00 KiB/s
Unpacking objects: 100% (25/25), 308.60 KiB | 454.00 KiB/s, done.
copy
inference.pyinto thecode/directory of the model directory.
Create a
model.tar.gzarchive with all the model artifacts and theinference.pyscript.
Upload the
model.tar.gzto Amazon S3:
Create custom HuggingfaceModel
After we have created and uploaded our model.tar.gz archive to Amazon S3. Can we create a custom HuggingfaceModel class. This class will be used to create and deploy our SageMaker endpoint.
Request Inference Endpoint using the HuggingfacePredictor
The .deploy() returns an HuggingFacePredictor object which can be used to request inference.
Delete model and endpoint
To clean up, we can delete the model and endpoint.