Apache Spark™ for Machine Learning and Data Science 3-day Onsite Public Class [San Antonio – TX Pass]


3-day pass for March 25 – 27 San Antonio Texas Apache Spark™ for Machine Learning and Data Science Onsite Class

SKU: SPARK301SATXMAR Category: Tags: , , , , , ,



March 25 – 27
9:00 AM – 5:00 PM (Eastern Time)

600 Navarro Street, Suite 350
San Antonio, Texas 78205



This course covers an overview of Apache Spark, hands-on projects utilizing extract- transform-load operations (ETL), employing exploratory data analysis (EDA), building machine learning models, evaluating models, and performing cross validation.

All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.


Data scientists with experience in machine learning and Scala or Python programming, who want to adapt traditional machine learning tasks to run at scale using Apache Spark.

All ​participants ​need ​a ​laptop ​with ​updated ​versions ​of ​Chrome ​or ​Firefox ​(Internet ​Explorer ​and ​Safari ​are ​not ​supported ​​and ​​an ​​internet ​​connection ​​which ​​can ​​support ​​use ​​of ​​GoToTraining. ​​ ​​GoToTraining ​​will ​​be ​​the ​​platform ​​on ​​which ​​the ​​class ​​will ​​be ​​delivered. ​​ ​​Prior ​​to ​​class, ​​each ​​registrant ​​will ​​receive ​​GoToTraining ​​log-in ​​instructions. ​

For more information and to ​​confirm ​​your ​​computer ​​can ​​run ​​GoToTraining ​​go to: ​​https://support.logmeininc.com/gotomeeting/get-ready​


1. On Spark:

• Improve performance through judicious use of caching and applying best practices.
• Troubleshoot slow running DataFrame queries using explain-plan and the Spark UI.
• Visualize how jobs are broken into stages and tasks and executed within Spark.
• Troubleshoot errors and program crashes using executor logs, driver stack traces, and local-mode runtimes.
• Troubleshoot Spark jobs using the administration UIs and logs inside Databricks.
• Find answers to common Spark and Databricks questions using the documentation and other resources.

2. On Extracting, Processing and Analyzing Data:

• Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames.
• Extract structured data from unstructured data sources by parsing using Datasets (where possible) or RDDs (if not possible with Datasets), with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
• Extend the capabilities of DataFrames using user defined functions (UDFs and UDAFs) in Python and Scala.
• Resolve missing fields in DataFrame rows using filtering and imputation.
• Apply best practices for data analytics using Spark
• Perform exploratory data analysis (EDA) using DataFrames and Datasets to:

– Compute descriptive statistics
– Identify data quality issues
– Better understand a dataset

3. On Visualizing Data:

• Integrate visualizations into a Spark application using Databricks and popular visualization libraries (d3, ggplot, matplotlib)
• Develop dashboards to provide “at-a-glance” summaries and reports.

4. On Machine Learning:

• Learn to apply various regression and classification models, both supervised and unsupervised.
• Train analytical models with Spark ML estimators including: linear regression, decision trees, logistic regression, and k-means.
• Use Spark ML transformers to perform pre-processing on a dataset prior to training, including: standardization, normalization, one-hot encoding, and binarization.
• Create Spark ML pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
• Evaluate model accuracy by dividing data into training and test datasets and computing metrics using Spark ML evaluators.
• Tune training hyper-parameters by integrating cross-validation into Spark ML pipelines.
• Compute using Spark MLlib functionality not present in SparkML by converting DataFrames to RDDs and applying RDD transformations and actions. (Optional Module)
• Troubleshoot and tune machine learning algorithms in Spark.
• Understand and build a general ML pipeline for Spark.


There are no reviews yet.

Be the first to review “Apache Spark™ for Machine Learning and Data Science 3-day Onsite Public Class [San Antonio – TX Pass]”

Your email address will not be published. Required fields are marked *

You may also like…

Start typing and press Enter to search