Since its inception in 2009, Spark has become a very popular cluster computing framework, with use cases going from real-time ETL to Machine Learning.
Using Clojure to write Spark jobs allows us to expand on the data-driven and immutable-first characteristics of the language, leveraging its runtime similarities to Spark’s native language Scala, producing code that’s as easy to read as it is to test and deploy.
In this talk, we’ll cover some practical aspects of running these jobs in production, using a simple Spark job as a starting point. Topics include: – Debugging common Spark environment issues – Setting up a REPL workflow to run Spark jobs locally, and connecting it to external storage for a development environment closer to production – Leveraging Spark’s local deployment mode to write automated tests for job pipelines – Fine tuning the job: adding monitoring, using profiling to understand where bottlenecks are and optimising Spark operations to improve performance
Gabriel received his bachelor’s degree in Computer Engineering from UNICAMP in Brazil, having worked at Nubank in São Paulo, then Asana in San Francisco, before making his way to NYC. Since being bitten by the Clojure bug a few years ago, he’s been very interested in researching applications of simple and data-driven software design to different areas of software engineering. When not staring at parentheses, he’ll often be found taking photographs outside or rock climbing.