Elasticsearch S3 Repository Snapshot Errors

The S3 repository plugin lets Elasticsearch store snapshots in Amazon S3, but the number of moving parts - the AWS SDK, IAM, bucket policies, network configuration - means there are many ways for it to break. Most failures surface as opaque repository_exception or snapshot_exception messages that force you to piece together the root cause from logs across multiple systems.

IAM Permission Requirements

Elasticsearch needs both bucket-level and object-level S3 permissions. A minimal IAM policy looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:ListBucketMultipartUploads",
        "s3:ListBucketVersions"
      ],
      "Resource": "arn:aws:s3:::my-es-snapshots"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": "arn:aws:s3:::my-es-snapshots/*"
    }
  ]
}

Missing s3:ListBucket on the bucket resource is the single most common IAM mistake. Without it, S3 returns a 404 instead of 403 for missing keys, which Elasticsearch misinterprets as repository corruption. Missing AbortMultipartUpload causes orphaned multipart uploads that silently consume storage and can interfere with subsequent snapshots.

If your Elasticsearch nodes run on EC2 with an instance profile, verify that IMDSv2 is supported by your version of the repository-s3 plugin. Older plugin versions bundle an AWS SDK that cannot negotiate the IMDSv2 token exchange, resulting in Access Denied errors despite correct IAM roles.

Bucket Policy Conflicts and Encryption

S3 bucket policies can override IAM permissions. A Deny statement restricting access to a specific VPC endpoint or source IP will block Elasticsearch even when the IAM role grants full access. Check for deny rules with aws s3api get-bucket-policy --bucket my-es-snapshots and test effective permissions with aws sts get-caller-identity from the node itself.

For server-side encryption, the repository-s3 plugin supports SSE-S3 (AES256) out of the box. Set it during repository registration:

PUT _snapshot/my_s3_repo
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots",
    "server_side_encryption": true
  }
}

SSE-KMS requires the IAM role to also have kms:GenerateDataKey and kms:Decrypt on the KMS key ARN. If the bucket has a default encryption policy using a customer-managed KMS key, every PutObject from Elasticsearch needs those KMS permissions - even if you did not request KMS encryption in the repository settings. Missing KMS permissions produce AccessDenied errors with no mention of KMS in Elasticsearch logs. Check S3 server access logs or CloudTrail for the actual denial reason.

AWS Throttling During Large Snapshots

S3 supports 3,500 PUT and 5,500 GET requests per second per prefix. During a large snapshot - especially the first one, which cannot use incremental diffs - Elasticsearch can exceed these limits. The AWS SDK surfaces this as a 503 SlowDown exception, and you will see log entries like:

AmazonS3Exception: Please reduce your request rate (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)

The repository-s3 plugin retries throttled requests with exponential backoff, but the default retry count may not be enough for sustained bursts. Increase max_retries at registration time:

PUT _snapshot/my_s3_repo
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots",
    "max_retries": 10
  }
}

If throttling persists, reduce snapshot concurrency by lowering max_snapshot_bytes_per_sec or schedule snapshots during off-peak hours. Spreading objects across multiple prefixes by using different base_path values for different repositories can also help, since S3 rate limits are per-prefix.

Path Prefix Conflicts Between Clusters

Multiple clusters can share a single S3 bucket, but each cluster must use a unique base_path. If two clusters write to the same prefix, they will overwrite each other's index-N and index.latest metadata files, corrupting the repository for both:

PUT _snapshot/prod_repo
{
  "type": "s3",
  "settings": {
    "bucket": "shared-es-snapshots",
    "base_path": "prod-cluster"
  }
}

The base_path value must not start or end with /. A trailing slash creates a different S3 key prefix than the same string without the slash, so one cluster writing to prod/ and another reading from prod will not see each other's snapshots.

Only one cluster should have write access to a given path. If you need a second cluster to restore from the same snapshots, register the repository as readonly: true on the restoring cluster. Two clusters actively writing to the same repository path will eventually corrupt it regardless of how carefully the base paths are managed.

S3 Endpoint Configuration for VPC and Private Access

Nodes in a private subnet reach S3 through a NAT gateway by default, which introduces a bandwidth bottleneck and adds data transfer costs. An S3 VPC gateway endpoint eliminates both. Create the endpoint in the VPC console and the plugin uses it automatically - no Elasticsearch configuration change is needed.

If your environment uses an S3 interface endpoint (PrivateLink) or a custom endpoint for S3-compatible storage, set it explicitly in the Elasticsearch keystore on every node:

bin/elasticsearch-keystore add s3.client.default.endpoint
# Enter: bucket.vpce-0abc1234.s3.us-east-1.vpce.amazonaws.com

Restart the node or call POST _nodes/reload_secure_settings to apply. Misconfigured endpoints produce UnknownHostException or connection timeout errors rather than permission errors, making them easy to distinguish from IAM issues. Verify connectivity from the node with curl -I https://<endpoint>/<bucket> before blaming the plugin.