Also, on the Airflow instance, Java is very much required before Airflow can run and schedule Spark jobs successfully. One of the problems we encountered in this process was that Airflow was looking for the necessary Spark dependencies required to run, and submit Spark jobs to Airflow. In our case, I was working to schedule Spark jobs with Airflow sitting on a Docker container. You can package your deployment suite, create a Spark image from this and run your applications on containers. You can extend the Docker image of Spark and Airflow using Dockerfiles by including dependencies, and versions of Java, Spark and Scala in relation to external systems needs. We are not going to look at the Airflow Docker images in this blog post, but there is comprehensive documentation on how to set up AIrflow on Docker images here.įor this blog post, we would be using the one provided by Bitnami Spark because it has more features and is easy to use. I have not been able to test and explore all of these images personally, but I tested the Spark image provided by Datamechanics and Bitnami Spark which works for the ETL workflow provided here. In our case, I was able to try the images provided by Bitnami Spark and Datamechanics, although these companies have the highest downloads on the Dockerhub, there are still many different options. Interestingly, different companies and individuals have published Spark images on Dockerhub, you can pull the images, and try them out to see if they address your development use case. You can aggregate the amount of RAM and CPU that would be needed by different running applications on your machine thereby managing resources.You can isolate them with Docker containers Even if the application requires different versions of dependencies to run independently. You have the ability to package multiple applications on the same machine, for example, you can run Spark, Kafka, and Airflow on the same machine, and all your deployments will work as expected.You can build, and deploy your development on AWS, Azure, GCP or any cloud platform, and expect that you would get the same result.You can share your development with your co-worker with all the necessary dependencies included, and set up the full application within a few minutes.Some of the advantages of using Docker container to run Spark jobs are For example, problems that relate to a different environment, dependencies issues e.t.c, thereby leading to fast development, and deployment to production. This is very important because, with Docker images, we are able to solve problems we encountered in development. In this blog post, we set up Apache Spark and Apache Airflow using a Docker container, and in the end, we ran and scheduled Spark jobs using Airflow which is deployed on a Docker container.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |