2022 : No FileSystem for scheme: s3? How to read/write data from AWS S3 with a Standalone Apache Spark 3.2.1 setup? and use Delta as format?

3 min readMar 24, 2022

How many times did you face this error? S3 or S3a schema not found error while working with AWS S3 + Spark outside the EMR?

An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

Most of the time I get approched to do Spark based proof of concepts from scratch, and I always have to take care of this setup process for the latest Apache Spark version before anything and its not always straight-forward.

In my experience writing Spark code is easy comparitively, the tough job comes when we wanted it to integrate it into other systems. One such pain is to use AWS S3.

Most of the time I have to start with local machine, sometimes with AWS EC2 and if I am lucky I can spin up my own EMR where all the dependcies are already taken care and I am good to do my Proof of Concepts.

Be it local machine or EC2 the setup remains more or less the same. Reading and writing to AWS S3 from a standalone Spark needs some extra Java libraries plugged into Spark.

Java Jar dependencies are very finicky.

To read S3 from a standalone Spark setup, we need hadoop-aws and aws-java-sdk-bundle

Maven Repository: org.apache.hadoop “ hadoop-aws

This module contains code to support integration with Amazon Web Services. It also declares the dependencies needed to…

mvnrepository.com

Maven Repository: com.amazonaws “ aws-java-sdk-bundle

A single bundled dependency that includes all service and dependent JARs with third-party libraries relocated to…

mvnrepository.com

With Spark 3.2.1, after a days effort I figured out the working combination and here is the screenshot of the versions.

from pyspark.sql import SparkSession
def get_spark():
    spark = SparkSession.builder.master("local[4]").appName('SparkDelta') \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.jars.packages", 
                "io.delta:delta-core_2.12:1.1.0,"
                "org.apache.hadoop:hadoop-aws:3.2.2,"
                "com.amazonaws:aws-java-sdk-bundle:1.12.180") \
        .getOrCreate()return spark# This is mandate config on spark session to use AWS S3
       spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
    spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")# spark.sparkContext.setLogLevel("DEBUG")

Remeber to give all dependencies as comma seperated in one shot for: spark.jars.pacakges

When you use Pyspark shell, use this command to download the dependencies prior to auto SparkSession initialization

bin/pyspark --packages io.delta:delta-core_2.12:1.1.0,org.apache.hadoop:hadoop-aws:3.2.2,com.amazonaws:aws-java-sdk-bundle:1.12.180

And in PySpark shell

spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")spark._jsc.hadoopConfiguration().set("fs.s3a.impl",
                           "org.apache.hadoop.fs.s3a.S3AFileSystem")spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,"                                   "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

Issues faced:

AWS S3 support with Spark standalone setup : https://community.cloudera.com/t5/Support-Questions/Spark-on-S3/m-p/103435 and https://stackoverflow.com/questions/30851244/spark-read-file-from-s3-using-sc-textfile-s3n/50276200#50276200
Spark version mismatch with hadoop-aws library.
Official thread @ https://issues.apache.org/jira/browse/HADOOP-16080 and https://issues.apache.org/jira/browse/SPARK-7442

Now store data in delta.io format:

def test_spark():
    spark = get_spark()
    data = {"col1": [1, 2, 3, 4, 5], "col2": [6, 7, 8, 9, 0]}
    pdf = pd.DataFrame(data)
    print(pdf)
    sdf = spark.createDataFrame(pdf)
    sdf.show()
    sdf.write.format("delta").mode('overwrite').save(""/tmp/delta/")
    return "Done"

output:

% ls /tmp/delta
_delta_log
part-00000-403015b4-2686-4198-8d4a-ac647a8f2e1c-c000.snappy.parquet
part-00001-1f3eb0ca-bacb-4315-9676-2e1faea57ad6-c000.snappy.parquet
part-00002-c9e55dc6-7e4d-4f45-83cc-494b2b5f4a42-c000.snappy.parquet
part-00003-d2930828-8aaa-4f8b-b62f-95b5c7404c7f-c000.snappy.parquet