Big Data with Amazon Cloud, Hadoop/Spark and Docker

at NYC Data Science Academy -

(47)
Course Details
Price:
$2,840.50 10 seats left
Start Date:

Tue, Aug 25, 7:00pm - Oct 01, 9:30pm Eastern Time (12 sessions)

Next start dates (1)

Important:
Early bird price until July 25, $2990 thereafter
Purchase Options
Description
Class Level: Beginner
Age Requirements: 15 and older
Average Class Size: 15
Teacher: Jake Bialer

Flexible Reschedule Policy: This provider has flexible, free rescheduling for any-in person workshop. Please see the cancellation policy for more details

What you'll learn in this big data training:

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

What is Hadoop?

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Syllabus

Unit 1: Introduction to Hadoop

1. Data Engineering Toolkits

  • Running Linux using Docker containers
  • Linux CLI command and bash scripts
  • Python basics

2. Hadoop and MapReduce

  • Big Data Overview
  • HDFS
  • YARN
  • MapReduce

Unit 2 – MapReduce

3. MapReduce using MRJob 1

  • Protocols for Input & Output
  • Filtering

4. MapReduce using MRJob 2

  • Top n
  • Inverted Index
  • Multi-step Jobs

Unit 3 – Apache Hive

5. Apache Hive 1

  • Databases for Big Data
  • HiveQL and Querying Data
  • Windowing And Analytics Functions
  • MapReduce Scripts

6. Apache Hive 2

  • Tables in Hive
  • Managed Tables and External Tables
  • Storage Formats
  • Partitions and Buckets

Unit 4 – Apache Pig

7. Apache Pig 1

  • Overview
  • Pig Latin: Data Types
  • Pig Latin: Relational Operators

8. Apache Pig 2

  • More Pig Latin: Relational operators
  • More Pig Latin: Functions
  • Compiling Pig to MapReduce
  • The Parallel Clause
  • Join Optimizations

Unit 5 – Apache Spark and AWS

9. Apache Spark – Spark Core

  • Spark Overview
  • Running Spark using Databricks Notebooks
  • Working with PySpark: RDDs
  • Transformations and Actions

10. Apache Spark – Spark SQL

  • Spark DataFrame
  • SQL Operations using Spark SQL

11. Apache Spark – Spark ML

  • ML Pipeline using PySpark

12. Amazon Elastic MapReduce

  • Overview
  • Amazon Web Services: IAM, EC2, S3
  • Creating EMR Cluster
  • Submitting Jobs
  • Intro to AWS CLI

Project: Data Engineering Project


Remote Learning

This course is available for "remote" learning and will be available to anyone with access to an internet device with a microphone (this includes most models of computers, tablets). Classes will take place with a "Live" instructor at the date/times listed below.

Upon registration, the instructor will send along additional information about how to log-on and participate in the class.

School Notes: We offer a certification licensed by the NYS Board of Education.

Still have questions? Ask the community.

Refund Policy

Note: This provider has a temporary cancellation policy for COVID-19 related cancellations which is as follows: 

Students receive a full refund of their tuition fees if they cancel their enrollment any time before the first day of the course or if they don't start the course at all (no-shows). We don't charge any registration or materials fee, and there is no charge for transferring to a future session of the course. 

----

Original cancellation policy (non-COVID-19):

We offer full refund if you are not happy with the first class and decide to drop it.

Start Dates (2)
Start Date Time Teacher # Sessions Price
7:00pm - 9:30pm Eastern Time Jake Bialer 12 $2,840.50
This course consists of multiple sessions, view schedule for sessions.
Thu, Aug 27 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Sep 01 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Sep 03 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Sep 08 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Sep 10 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Sep 15 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Sep 17 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Sep 22 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Sep 24 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Sep 29 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Oct 01 7:00pm - 9:30pm Eastern Time Jake Bialer
7:00pm - 9:30pm Eastern Time Jake Bialer 12 $2,840.50
This course consists of multiple sessions, view schedule for sessions.
Thu, Oct 29 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Nov 03 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Nov 05 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Nov 10 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Nov 12 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Nov 17 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Nov 19 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Nov 24 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Dec 01 7:00pm - 9:30pm Eastern Time Jake Bialer
Thu, Dec 03 7:00pm - 9:30pm Eastern Time Jake Bialer
Tue, Dec 08 7:00pm - 9:30pm Eastern Time Jake Bialer

Benefits of Booking Through CourseHorse

Booking is safe. When you book with us your details are protected by a secure connection.
Lowest price guaranteed. Classes on CourseHorse are never marked up.
This class will earn you 28405 points. Points give you money off your next class!
Questions about this class?
Get help now from a knowledge expert!
Questions & Answers (0)

Get quick answers from CourseHorse and past students.

Reviews of Classes at NYC Data Science Academy (30)

Similar Classes

School: NYC Data Science Academy

NYC Data Science Academy

NYC Data Science Academy is a program designed to teach those who wish to learn.

Through hands-on projects and real-world applications, our students develop the skills they will need to pursue data science as both a hobby and profession. We also organize the NYC Open Data Meetup, which means that by...

Read more about NYC Data Science Academy

CourseHorse Approved

This school has been carefully vetted by CourseHorse and is a verified NYC educator.

Ready to take this class?
BOOK NOW
Booking this class for a group? Find great private group events here