End to End Big data project
Project detail
Problem Statement:
Imagine you are part of a data team that wants to bring in daily data for COVID-19
test occurring in New York state for analysis. Your team has to design a daily
workflow that would run at 9:00 AM and ingest the data into the system.
API:
https://health.data.ny.gov/api/views/xdss-u53e/rows.json?accessType=DOWNLOAD
By following the ETL process, extract the data for each county in New York state from
the above API, and load them into individual tables in the database. Each county table
should contain following columns :
❖ Test Date
❖ New Positives
❖ Cumulative Number of Positives
❖ Total Number of Tests Performed
❖ Cumulative Number of Tests Performed
❖ Load date
Implementation options:
1. Python scripts to run a daily cron job
a. Utilize SQLite in memory database for data storage
b. You should have one main standalone script for a daily cron job that
orchestrates all other remaining ETL processes
c. Multi-threaded approach to fetch and load data for multiple counties
concurrently
2. Airflow to create a daily scheduled dag
a. Utilize docker to run the Airflow and Postgres database locally
b. There should be one dag containing all tasks needed to perform the end
to end ETL process
c. Dynamic concurrent task creation and execution in Airflow for each county
based on number of counties available in the response
Implement unit and/or integration tests for your application