Path: blob/master/model_deployment/aws/aws_s3.ipynb
2587 views
Working with S3
One of the most common operations when working with Amazon S3 (Amazon Simple Storage Service) is to pull data from s3 to local as well as push data from local to s3. We can use aws command line tool to achieve this:
We'll also demonstrate how to use boto3 to perform these kind of operations in Python.
Suppose we have a python object in memory, one option is to use client's put_object method and save it as a json file.
Upload and Download Parquet Files
All of this is well and good until we work with some large python objects, which we'll encounter errors such as entity being too large.
Directly copied from S3's documentation
Upload an object in a single operation by using the AWS SDKs, REST API, or AWS CLI – With a single PUT operation, you can upload a single object up to 5 GB in size.
Upload an object in parts by using the AWS SDKs, REST API, or AWS CLI – Using the multipart upload API operation, you can upload a single large object, up to 5 TB in size.
Fortunately, we can rely on upload_file method, boto3 will automatically use multipart upload underneath the hood without us having to worry about lower level functions related to multipart upload. The following code chunk shows how to save our python object as a parquet file and upload it to s3 as well as downloading files from s3 to local and reading it as a pandas dataframe.
We can also upload an entire local directory to s3, and remove the local copy once complete.
Instead of downloading our parquet files to disk first, we can also read it directly into memory by wrapping the bytes object in a pyarrow BufferReader followed with read_table.
AWSWrangler
The upload and download functionality is technically quite generic and allows for us to use it as part of any object. If uploading/reading parquet file is all we need, then we could instead leverage awswrangler.
Additional Tips
When conducting heavy write operations, we might encounter slow down: please reduce your request rate related errors. Although Amazon S3 has announced performance improvements, it might not be enough for our use case. When countered with these type of situations, the two most common suggestions would be to: 1. add retry. 2. add prefixes/partitions.
When encountering 503 slown down error or other errors that we suspect to be not from our client, we could leverage built-in retry with exponential backoff and jitter capability [3] [4]. Meaning after we receive a server or throttling related error, we use progressively longer wait time with some noise added to it between each retries. For example, we can explicitly configure our boto3 client with the retry max attempt as well as retry mode (algorithm that's used for conducting the retry).
If then new limits and adjusting retry still prove to be insufficient. Prefixes would need to be used, which is adding any string between a bucket name and an object name, for example:
bucket/1/file
bucket/2/file
Prefixes of the object file would be: /1/, /2/. In this example, if we spread write across all 2 prefixes evenly, we can achieve double the throughput.
The offical documentation also has further suggestions on optimizing S3 performance [2].
Reference
[1] Blog: How to use Boto3 to upload files to an S3 Bucket?
[2] AWS Documentation: Best practices design patterns: optimizing Amazon S3 performance
[3] Boto3 Documentation: Retries
[4] AWS Documentation: Error retries and exponential backoff in AWS
[5] Blog: How to use Boto3 to download multiple files from S3 in parallel?