I recently switched EMR to the label 7.0.0. Part of my workload is doing some updates to big Iceberg tables using pyspark. I moved all my s3 paths to the s3 schema instead of s3a as suggested here.
Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.
While running the Iceberg job I got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4217 in stage 4.0 failed 4 times, most recent failure: Lost task 4217.3 in stage 4.0 (TID 5632) (ip-10-5-7-244.us-east-2.compute.internal executor 48): software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
The reason for encountering this error is due to a connection pool exhaustion where the number of incoming requests to S3 exceeds the capacity of the connection pool, resulting in a timeout while trying to lease a connection from the pool. If you have multiple threads attempting to access S3 concurrently and connections are not managed properly or exceed the pool size, contention for connections can cause timeouts.
In this case it's caused due a large number of concurrent Iceberg tasks accessing S3. The number of connections available is controlled by the property http-client.apache.max-connections
The default value in EMR 7.0.0 is 50. You can increase it adding this property to your spark job:
--conf spark.sql.catalog.{ice_catalog}.http-client.apache.max-connections=3000
Where {ice_catalog}
is the name of your catalog.