How to use the DataProxy S3 compatible interface
Introduction
The Aruna DataProxy implements a subset of the AWS S3 API specification. By providing an S3 compatible interface, the DataProxy is able to interact with a wide variety of modern data analysis tools which natively support S3 as storage solution.
Here we will give you examples for the currently supported and most common S3 functionality which can be used directly with an Aruna DataProxy. For the best accessibility all the examples are written for the freely available tool s5cmd which is an extremely fast command line client for uploading, retrieving and managing data in Amazon S3 and other cloud storage service providers that use/support the S3 protocol.
DataProxy access
Before you can directly communicate with or upload data to a DataProxy you have to "register" your user first by requesting credentials for the specific endpoint.
This is for data protection reasons, so that each DataProxy only receives and stores the data that is relevant to it.
s5cmd Configuration
s5cmd makes use of the standard S3 configuration and credentials file. The examples in this section assume only a minimalist configuration containing the Access Key ID, Secret Access Key and DataProxy host. The remaining parameters are optional and can be set with defaults.
The configuration files are automatically picked up by s5cmd when they exist in the following default locations:
<user-home>/.aws/config
and<user-home>/.aws/credentials
[default]
endpoint_url = <data-proxy-host-url>
region = <bucket-region>
output = json
...
[default]
aws_access_key_id = <access-key-id>
aws_secret_access_key = <access-secret>
...
An alternative to the configuration files would be to use environment variables:
- Access Key Id:
export AWS_ACCESS_KEY_ID='<access-key-id>'
- Secret Access Key:
export AWS_SECRET_ACCESS_KEY='<access-secret>'
- Endpoint URL:
export S3_ENDPOINT_URL='<data-proxy-host-url>'
- Bucket Region:
export AWS_REGION='<bucket-region>'
... or partially the global CLI options of s5cmd:
- Endpoint URL:
--endpoint-url <aruna-dataproxy-endpoint>
- Output format:
--json
- Disable SSL verification:
--no-verify-ssl
Object paths in Aruna
In order for Aruna to provide an S3 compatible interface, it is necessary that Objects can be accessed via a unique path instead of their id.
Currently, this path complies with the standardized specifications of AWS S3 and are represented in the format
<project-name>/<collection-name>/<dataset-name>/<object-filename>
which resembles the S3 path-style s3://bucket/key
where:
- bucket:
<project-name>
- key:
<collection-name>/<dataset-name>/<object-filename>
This also applies when the Object has a relation to multiple parents which means that in this case the Object is available through multiple paths.
Example for an Object path inside a Project
- Project name:
dummy-project
- Object filename:
example.file
This would correspond to the path: s3://dummy-project/example.file
Example for an Object path inside an existing Collection
- Project name:
dummy-project
- Collection name:
sample-collection
- Object filename:
example.file
This would correspond to the path: s3://dummy-project/sample-collection/example.file
Example for an Object path inside an existing Collection and Dataset
- Project name:
dummy-project
- Collection name:
sample-collection
- Dataset name:
my-dataset
- Object filename:
example.file
This would correspond to the path: s3://dummy-project/sample-collection/my-dataset/example.file
Path limitations
The fully qualified paths of objects are unique, which implies some conditions that must be met:
- Project names are unique globally
- Project names are restricted to the following characters: [a-z0-9-] (i.e. alphanumeric lowercase and hyphens)
- Collection and Dataset names are restricted to the safe characters specified in the S3 object key naming guidelines
- Names are unique within each hierarchy (e.g. you cannot create Objects with the same name inside the same Collection)
Create bucket
This operation is analog to the Create Project API request with a specific endpoint.
This means that the Project is also registered in the central ArunaServer catalog with the specific DataProxy as its main data location.
Native S3 Create Bucket Specification
Required permissions
To create a new Project you only have to be a registered Aruna user.
1 2 |
|
Put Object
Native S3 Put Request Specification
This operation is analog to the combination of Create Object, Upload data and Finish Object. That also means that on success the Object is directly available.
If an object already exists under the provided path, this object will be updated and a new revision will be created which takes the uploaded data.
Optional parent creation
Optionally, the non-existent parent resources of the path will also be created if they do not exist at the time of execution, except for Projects. As the middle part of a path in an Aruna hierarchy can be ambiguous, the Dataset is always favoured during creation. This means you can't create a Collection as a direct parent of an Object with an S3 request.
- Project name:
dummy-project
- Object filename:
example.file
s3://dummy-project/new-collection/new-dataset/example.file
would also create the Collection with the name new-collection
and the Dataset with the name new-dataset
. The created Dataset will be the direct parent of the Object.
- Project name:
dummy-project
- Object filename:
example.file
s3://dummy-project/new-dataset/example.file
would also create the Dataset with the name new-collection
as the parent of the Object. The created Dataset will be the direct parent of the Object.
Required permissions
This request requires at least APPEND permission on the parent resource in which the Collection/Dataset/Object is to be created.
1 2 |
|
1 2 |
|
1 2 |
|
1 2 |
|
Multipart Upload
s5cmd also supports Amazon S3 multipart uploads. Multipart uploads are automatically used
when a file to upload is larger than 5MB if the multipart_chunksize = <chunk-size>
parameter is not set in the configuration file.
Natively a multipart upload has to be created
for each part upload and completed
after the upload of all parts has been finished.
- Native S3 Create Multipart Upload Specification
- Native S3 Upload Part Specification
- Native S3 Complete Multipart Upload Specification
Required permissions
This request requires at least APPEND permission on the parent resource in which the Collection/Dataset/Object is to be created.
1 2 |
|
Head Object
The HEAD action retrieves metadata from an object without returning the object itself. This action is useful if you're only interested in an object's metadata.
Native S3 Head Object Specification
Required permissions
This request requires at least READ permissions on the Object or one of its parent resources.
1 2 |
|
Get Object
Retrieves the data associated with the specific Object.
Native S3 Get Object Specification
Required permissions
This request requires at least READ permissions on the Object or one of its parent resources.
1 2 |
|
List buckets/Projects
Lists all the buckets (i.e. Projects) the user has specific permissions for.
Native S3 ListBuckets specification
Required permissions
This request does not require any kind of specific permission but only buckets/Projects the user has permissions on will be returned.
1 2 |
|
List bucket/Project objects
Returns all (up to 1.000 for each request) Objects of a specific bucket i.e. Project.
Native S3 ListObjectsV2 specification
Required permissions
This request requires at least READ permissions on the Project
1 2 |
|
1 2 |
|