What if I wanted to submit remote PySpark jobs to AWS EMR without worrying about library dependency versions ?

  1. Remote job submission
  2. Manage Python library dependencies

YARN Docker Container

FROM amazoncorretto:8RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development
RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv
RUN python -V
RUN python3 -V
ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3
RUN pip3 install --upgrade pip
RUN pip3 install numpy pandas
RUN python3 -c "import numpy as np"
cat ~/.docker/config.json 
{
"auths": {
"https://index.docker.io/v1/": {
"auth": "123456789abcdefghijklm=="
}
}
}
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///path/to/hadoop/config.json"
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///path/to/hadoop/config.json"
--conf spark.executorEnv.PYSPARK_PYTHON=python3
--conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=python3
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3

Livy

export USER=hadoop
export EMR_SPARK_MASTER=ip-99-33-222-111.us-west-2.compute.internal
sudo ssh -i ~/emrkey.pem -N -L 8998:$EMR_SPARK_MASTER:8998 hadoop@$EMR_SPARK_MASTER
export LIVY_URL=localhost:8998
curl $LIVY_URL/sessions/ | python -m json.tool
curl \
-X POST \
--data '{
"name": "UDCTestLivy1",
"file": "s3://users-dev-bucket/mageswaran/livy/pi.py",
"driverMemory": "4G",
"driverCores": 4,
"executorMemory": "32G",
"executorCores": 5,
"numExecutors": 2,
"conf": {"spark.scheduler.mode": "FAIR"}
}'\
-H "Content-Type: application/json" \
$LIVY_URL/batches
python3 livy_Spark.py \
--url http://$LIVY_URL \
--name SparkDockerLivyTest \
--file s3://bucket/mageswaran/livy/main.py \
--pyFiles s3://bucket/mageswaran/livy/project.zip \
--args '--config_file s3://bucket/mageswarand/livy/job1_config.json'

--

--

--

A simple guy in pursuit of of AI and Deep Learning with Big Data tools :) @ https://www.linkedin.com/in/mageswaran1989/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

What are the popular Content Management Systems (CMS) in PHP?

Is It Done Yet?: My Journey Into Digital Project Management

Code Smell: Excessively Short Identifiers

Deploy Java (JAX-RS) REST API to tomcat server using CLI without Eclipse IDE.

Dapper With Informix On .NET Core

How to Crack Microsoft Azure Fundamentals (AZ-900) Certification In 7 Days (Tips & Tricks) — A…

WhatsApp Features Updates: WhatsApp is preparing for a new feature, when it does this work, the…

How to Create Persistent Storage for Apache Mesos

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mageswaran D

Mageswaran D

A simple guy in pursuit of of AI and Deep Learning with Big Data tools :) @ https://www.linkedin.com/in/mageswaran1989/

More from Medium

Spark Streaming with Python

Configure Spark Application

Setting up Spark in Jupyter lab

Introduction