Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/sagemaker/31_deploy_embedding_models/sagemaker-notebook.ipynb
Views: 2542
How to deploy Embedding Models to Amazon SageMaker using new Hugging Face Embedding DLC
This is an example on how to deploy the open Embedding Models, like Snowflake/snowflake-arctic-embed-l, BAAI/bge-large-en-v1.5 or sentence-transformers/all-MiniLM-L6-v2 to Amazon SageMaker for inference using the new Hugging Face Embedding Inference Container. We will deploy the Snowflake/snowflake-arctic-embed-m one of the best open Embedding Models for retrieval and ranking on the MTEB Leaderboard.
The example covers:
What is Hugging Face Embedding DLC?
The Hugging Face Embedding DLC is a new purpose-built Inference Container to easily deploy Embedding Models in a secure and managed environment. The DLC is powered by Text Embedding Inference (TEI) a blazing fast and memory efficient solution for deploying and serving Embedding Models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. TEI implements many features such as:
No model graph compilation step
Small docker images and fast boot times
Token based dynamic batching
Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
Safetensors weight loading
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
TEI supports the following model architectures
BERT/CamemBERT, e.g. BAAI/bge-large-en-v1.5 or Snowflake/snowflake-arctic-embed-m
XLM-RoBERTa, e.g. sentence-transformers/paraphrase-xlm-r-multilingual-v1
NomicBert, e.g. jinaai/jina-embeddings-v2-base-en
JinaBert, e.g. nomic-ai/nomic-embed-text-v1.5
Lets get started!
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Snowflake Arctic to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve the new Hugging Face Embedding Container
Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face Embedding Container in Amazon SageMaker, we can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face Embedding Container. Important to note is that TEI has 2 different versions for cpu and gpu, so we create a helper function to retrieve the correct image uri based on the instance type.
3. Deploy Snowflake Arctic to Amazon SageMaker
To deploy Snowflake/snowflake-arctic-embed-m to Amazon SageMaker we create a HuggingFaceModel
model class and define our endpoint configuration including the HF_MODEL_ID
, instance_type
etc. We will use a c6i.2xlarge
instance type, which has 4 Intel Ice-Lake vCPUs, 8GB of memory and costs around $0.204 per hour.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.c6i.2xlarge
instance type.
SageMaker will now create our endpoint and deploy the model to it. This can takes ~5 minutes.
4. Run and evaluate Inference performance
After our endpoint is deployed we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint.
Awesome we can now generate embeddings with our model, Lets test the performance of our model.
We will send 3,900 requests to our endpoint use threading with 10 concurrent threads. We will measure the average latency and throughput of our endpoint. We are going to sent an input of 256 tokens to have a total of ~1 Million tokens. We decided to use 256 tokens as input length to find the balance between shorter and longer inputs.
Note: When running the load test, the requests are sent from europe and the endpoint is deployed in us-east-1. This adds a network overhead to it.
Sending 3,900 requests or embedding 1 million tokens took around 841 seconds. This means we can run around ~5 requests per second. But keep in mind that includes the network latency from europe to us-east-1. When we inspect the latency of the endpoint through cloudwatch we can see that latency for our Embeddings model is 2s at 10 concurrent requests. This is very impressive for a small & old CPU instance, which cost ~150$ per month. You can deploy the model to a GPU instance to get faster inference times.
Note: We ran the same test on a ml.g5.xlarge
with 1x NVIDIA A10G GPU. Embedding 1 million tokens took around 30 seconds. This means we can run around ~130 requests per second. The latency for the endpoint is 4ms at 10 concurrent requests. The ml.g5.xlarge
costs around $1.408 per hour on Amazon SageMaker.
GPU instance are much faster than CPU instances, but they are also more expensive. If you want to bulk process embeddings, you can use a GPU instance. If you want to run a small endpoint with low costs, you can use a CPU instance. We plan to work on a dedicated benchmark for the Hugging Face Embedding DLC in the future.
5. Delete model and endpoint
To clean up, we can delete the model and endpoint