April 08 – 10
9:00 AM – 5:00 PM (Eastern Time)
500 7th Avenue
New York, New York 10018
This course covers an overview of Apache Spark, hands-on projects utilizing extract- transform-load operations (ETL), employing exploratory data analysis (EDA), building machine learning models, evaluating models, and performing cross validation.
All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.
Data scientists with experience in machine learning and Scala or Python programming, who want to adapt traditional machine learning tasks to run at scale using Apache Spark.
All participants need a laptop with updated versions of Chrome or Firefox (Internet Explorer and Safari are not supported and an internet connection which can support use of GoToTraining. GoToTraining will be the platform on which the class will be delivered. Prior to class, each registrant will receive GoToTraining log-in instructions.
For more information and to confirm your computer can run GoToTraining go to: https://support.logmeininc.com/gotomeeting/get-ready
COURSE LEARNING OBJECTIVES
1. On Spark:
• Improve performance through judicious use of caching and applying best practices.
• Troubleshoot slow running DataFrame queries using explain-plan and the Spark UI.
• Visualize how jobs are broken into stages and tasks and executed within Spark.
• Troubleshoot errors and program crashes using executor logs, driver stack traces, and local-mode runtimes.
• Troubleshoot Spark jobs using the administration UIs and logs inside Databricks.
• Find answers to common Spark and Databricks questions using the documentation and other resources.
2. On Extracting, Processing and Analyzing Data:
• Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames.
• Extract structured data from unstructured data sources by parsing using Datasets (where possible) or RDDs (if not possible with Datasets), with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
• Extend the capabilities of DataFrames using user defined functions (UDFs and UDAFs) in Python and Scala.
• Resolve missing fields in DataFrame rows using filtering and imputation.
• Apply best practices for data analytics using Spark
• Perform exploratory data analysis (EDA) using DataFrames and Datasets to:
– Compute descriptive statistics
– Identify data quality issues
– Better understand a dataset
3. On Visualizing Data:
• Integrate visualizations into a Spark application using Databricks and popular visualization libraries (d3, ggplot, matplotlib)
• Develop dashboards to provide “at-a-glance” summaries and reports.
4. On Machine Learning:
• Learn to apply various regression and classification models, both supervised and unsupervised.
• Train analytical models with Spark ML estimators including: linear regression, decision trees, logistic regression, and k-means.
• Use Spark ML transformers to perform pre-processing on a dataset prior to training, including: standardization, normalization, one-hot encoding, and binarization.
• Create Spark ML pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
• Evaluate model accuracy by dividing data into training and test datasets and computing metrics using Spark ML evaluators.
• Tune training hyper-parameters by integrating cross-validation into Spark ML pipelines.
• Compute using Spark MLlib functionality not present in SparkML by converting DataFrames to RDDs and applying RDD transformations and actions. (Optional Module)
• Troubleshoot and tune machine learning algorithms in Spark.
• Understand and build a general ML pipeline for Spark.