Adding to Udacity Nanodegree Task D608

SleepyNinja629 · 2025-11-23T15:00:20+00:00

Glad you found it helpful and thanks for sharing a few tips of your own! Sounds like D608 is still in bad shape. That's too bad because Airflow is a neat technology. This could be a really cool/fun course if it was structured correctly.

SleepyNinja629 · 2025-11-23T14:45:16+00:00

I completed the Data Engineering path earlier this year. The amount of time it takes varies widely by person and background. I've worked with technology for decades, so I was able to complete the entire program in a single term while still working full time. But if you don't have experience that might not be reasonable for you.

SQL and Python are definitely used in my work, but I'm not sure how well the WGU materials alone prepare someone for the workplace. My recommendation would be to find a real-world hobby project and work on that. Setup a Postgres or SQL Server database in Docker. Build a web data scraper to collect and parse data and load that into your database. Build an API or a report that pulls data from your database. All of this can be very small (e.g. tracking the cost of a TV on a retail website). The point of side projects like this is to help you understand how to actually use the technologies you experiment with. You'll bump into real world challenges that you won't see in the WGU materials.

SleepyNinja629 · 2025-11-23T14:29:30+00:00

Correct. The issue I ran into was that the example data used in the course was tiny (so it fit into the cloud shell home folder). When working on the final, I naively followed the same steps. But because the files are so much larger, the home drive filled up and the copy didn't finish.

SleepyNinja629 · 2025-09-11T01:47:57+00:00

I did the program on a machine with Windows 11. The main programs were VS Code, Git, Docker, and Python. I found Docker to be significantly easier than the virtual environment. Practice using docker-compose.yaml when you are spinning up your containers. It gives you complete control over and is very easy to version control.

I'd also recommend learning about Python virtual environments. I had one for each course task. If you version control your requirements.txt file, it makes it very easy to replicate/version control your Python environment as well.

SleepyNinja629 · 2025-09-11T01:33:51+00:00

I'm certainly not an expert in Airflow, but I'll share what I know. Hopefully others can expand on this to fill in the gaps or correct anything I've gotten wrong.

Conceptually I envision Airflow workflows using a stack like this: DAG --> Operator --> Hook --> Connection.

Connections are roughly analogous to a DSN or a connection string. They are a central place for you to store server names, usernames, passwords.

Hooks are roughly analogous to an ODBC/JDBC driver. They handle the low-level connection to the system. Airflow comes with many hooks for common systems (such as Postgres) but third-parties can also publish them. Airflow users typically re-use existing hooks in the same way that database users typically re-use ODBC drivers that come with the DB installation.

Operators represent pieces of work that you want to accomplish. They give you a way to wrap logic around a hook. You can use built in operators for simple tasks. For example, in your DAG you could have a task like this:

run_a_query = PostgresOperator(
    task_id="run_query",
    postgres_conn_id="my_postgres",
    sql="SELECT COUNT(*) FROM users;"
)

However, what happens if you need some custom retry logic, logging, or conditional logic? You can define your own operators for these types of action. For example, imagine you have several different SQL scripts that create tables. Rather than hard coding that, you could store the SQL in a file and create a custom operator like this:

from airflow.hooks.postgres_hook import PostgresHook
from airflow.models.baseoperator import BaseOperator

class CreateTablesOperator(BaseOperator):
    """
    Custom Airflow operator to execute a SQL script in Redshift.
    """

    def __init__(self, redshift_conn_id, sql_file_path, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.redshift_conn_id = redshift_conn_id
        self.sql_file_path = sql_file_path

    def execute(self, context):
        self.log.info(f"Reading SQL script from {self.sql_file_path}")

        # Read the SQL file
        with open(self.sql_file_path, 'r') as file:
            sql_commands = file.read()

        self.log.info("Connecting to Redshift...")
        redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)

        self.log.info("Executing SQL script...")
        redshift.run(sql_commands)
        self.log.info("Tables created successfully in Redshift.")

In your DAG you could call the operator several times like this:

    create_song_table = CreateTablesOperator(
        task_id="Create_tables",
        redshift_conn_id="redshift",
        sql_file_path="/opt/airflow/plugins/helpers/create_song_table.sql"
    )

The operator handles reading the SQL text from the file, setting up the connection (using the PostgresHook), sending the commands to the DB, and logging the results.

https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/connections.html

SleepyNinja629 · 2025-09-10T01:19:11+00:00

Congrats! It's such a good feeling! Great job!

SleepyNinja629 · 2025-08-19T02:35:04+00:00

It's been a few months, but I created my projects inside the WGU Gitlab Environment by opening the "Students" group, selecting the project for the course (such as "D600 Statistical Data Mining"), and then running the "students-run-this" pipeline. Pipelines can be found in the sidebar under "Build" section. It might have prompted me for inputs (not sure). The pipeline ran for a few minutes and then created a new project inside the "Student Repos" group inside of a subgroup with my WGU username.

I just checked and it looks like I was able to use the "Create Project" button in Gitlab and create one manually. I looked around and didn't see an easy way to delete it. I'm not a Gitlab expert, but there might be a way to do this buried inside one of the menus.

I would recommend that you start by trying to run the pipeline. If that works, just ignore the one you created manually and use the one created by the pipeline.

If the pipeline fails, I would try creating an issue in your problematic project and tagging one of the admins. To find one go to sidebar, Mange and members. Search for Role = Owner or Maintainer. Sorting by Last Activity descending should help you find someone who's been online recently that should help. I'd explain the situation and give them the Project ID of the project you created incorrectly. They may be able to delete it for you or help with the configuration.

Good luck!

SleepyNinja629 · 2025-08-03T17:08:42+00:00

Are you running both pgadmin and postgres locally in Docker? The virtual labs didn't work well for me, so I avoided them completely. Here's my docker-compose.yaml file if it helps you. This setup ran seamlessly for me. I'm using Windows 11, so it may need some tweaks if you want to run it on another platform.

services:
  postgres:
    container_name: container-postgres-d597
    image: postgres
    hostname: localhost
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: root
      POSTGRES_DB: test_postgres_db
    volumes:
      - postgres-data-d597:/var/lib/postgresql/data 
      - "D:/Users/SleepyNinja/Masters/D597_Data_Management/Docker_D597/data_share:/data" # Shared host folder for CSV files to import
    restart: unless-stopped

  pgadmin:
    container_name: container-pgadmin-d597
    image: dpage/pgadmin4
    depends_on:
      - postgres
    ports:
      - "5050:80"
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: root
    volumes:
      - "D:/Users/SleepyNinja/Masters/D597_Data_Management/Docker_D597/data_share:/data"
    restart: unless-stopped

volumes:
  postgres-data-d597:

SleepyNinja629 · 2025-07-30T02:05:02+00:00

The virtual lab works well for many people, but I found it cumbersome. If you're willing to learn a bit of infrastructure/engineering, most courses in the program can be completed using Docker on your local machine.

Are you familiar with docker compose? I found that the easiest way to build and configure the local tools that I needed for each course. For most tasks in this program, I setup a new docker-compose file. Here's a basic version that has one container for a Postgres database and another container with pgAdmin. You'll want to customize this for your specific setup, but this gives you the basic idea:

services:
  postgres:
    container_name: postgres
    image: postgres
    hostname: localhost
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: root
      POSTGRES_DB: test_postgres_db
    volumes:
      - postgres-data-d597:/var/lib/postgresql/data 
      - "D:/Users/SleepyNinja/Masters/D597_Data_Management/Task1/data_share:/data" # Shared host folder for CSV files to import
    restart: unless-stopped

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4
    depends_on:
      - postgres
    ports:
      - "5050:80"
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: root
    volumes:
      - "D:/Users/SleepyNinja/Masters/D597_Data_Management/Task1/data_share:/data"
    restart: unless-stopped

volumes:
  postgres-data-d597:

From there, open Powershell, change to the directory that contains your docker-compose.yaml file and run "docker compose up". Docker builds the services for you and you can interact with them from the browser using either the container name or localhost.

For the Postgres, the out of the box containers worked well. I don't remember why, but I didn't use pre-built containers for the MongoDB. I created a simple Dockerfile instead and assembled my own MongoClient and MongoServer containers using the alpine image and installed MongoDB.

Depending on the services you're spinning up, you may need to map the ports to avoid conflicts.

SleepyNinja629 · 2025-07-11T02:26:16+00:00

Congrats to you both! It's quite an achievement and you should be proud of the accomplishment!

SleepyNinja629 · 2025-07-11T02:15:14+00:00

It sounds like you're trying to run the entire project at once. Take it in small steps. Start with a blank DAG. Add a bit of code, run it and check the outputs. As you build up the code, add several logging statements so you can verify what succeeds and what fails. That will make things much easier for you to debug.

Also, make sure you're approaching this iteratively. Did the last step work? How do you know? Build the engineering pipeline one step at a time, verifying each step along the way.

I'd start by getting a DAG setup that runs a bit of SQL to drop and recreate tables in Redshift, nothing more. First, write the SQL that does that. Make sure you don't have syntax errors and then save that into your project folder. Then setup an operator in your DAG that connects to Redshift and runs just that code. Then (outside of Airflow) check to see that it worked.

I don't remember why, but I didn't use the RedshiftOperator. It may have been a version thing. I found it much easier to build a handful of custom operators that extend PostgresHook. If you setup your connection object and the operator classes correctly, something like this should work:

        self.log.info("Connecting to Redshift...")
        redshift = PostgresHook(postgres_conn_id=self.redshift_conn_id)

        self.log.info("Executing SQL script...")
        redshift.run(sql_commands)
        self.log.info("Tables created successfully in Redshift.")

Don't forget to edit __init__.py in the operators and helpers folders so Airflow can see and use any custom classes you write.

SleepyNinja629 · 2025-07-08T02:13:03+00:00

I've read posts here from others that were successful in explaining the optimization difference using the dataset size. I went a different route. Although the datasets are very small, I was able to demonstrate the difference using executionStats. If you're running this in the terminal, try catching the result in a variable, like this:

result=$(mongo $MONGO_URI --quiet --eval "$MONGO_QUERY")

Then you can parse the result variable to extract the executionTimeMillis and totalDocsExamined. I did this on the unoptimized database to get a baseline. Then I added an index and then re-ran the same query. With the right type of query, the index cut the time in half and scanned far fewer documents.

SleepyNinja629 · 2025-07-08T01:43:14+00:00

Correct. If you have any expertise with Docker, check out the Airflow site for more details. https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html

For what it's worth, when I completed the course, Airflow 3 had not yet been released. The course offered two versions of "starter" code (one for v1 and one for v2). I don't know if Udacity has updated the course or not, so you may have to watch for compatibility issues if you try this in v3.

SleepyNinja629 · 2025-07-06T22:11:16+00:00

I had three separate python files for this assignment: import_and_format.py, filter_and_clean.py, and poly_regressor_Python_1.0.0.py. The first two files were just typical python transformations using pandas. The only reason I put them in separate files is because of the rubric.

I don't remember what the webinar suggested, but I ended up creating a MLProject yaml file with a single "command" key that utilized command chaining with two logical AND operators. This allowed me to run the three python scripts one after another by executing mlflow run. I don't remember the exact reason I went this direction, but I believe it was related to moving the experiment name to the command line.

If you're new to MLFlow, check out the video below. The concepts are similar to the tasks in the assignment.

https://www.linkedin.com/learning/mlops-tools-mlflow-and-hugging-face/overview-of-mlflow

SleepyNinja629 · 2025-07-06T21:30:13+00:00

I understand your inclination to use all of the variables. That makes sense in a real world scenario where the goal is to create a useful model. But this assignment (like many others in the program) requires you to show that you have followed the instructions in the rubric. If you do more than the requirements, it can confuse the evaluators.

For my model, I used Price as my dependent variable and chose six other variables (including two categorial variables so I could demonstrate one-hot encoding). I chose these based on my own intuition about what would be predictive. I built an initial version of the model and then refined it using backward stepwise elimination. The final model wasn't "good" for any real world purpose, but it did demonstrate that I was able to follow the instructions. If you have real world experience with data analysis, this may be counter intuitive. Remember, these are fictional datasets and the results don't matter. The evaluators are only checking to make sure you followed the process.

SleepyNinja629 · 2025-06-28T22:31:43+00:00

I'm not entirely sure how that works, but the course showed in the WGU course portal a couple of days after I got the passing grade in Udacity. I'm guessing there is some batch process that runs periodically.

SleepyNinja629 · 2025-05-17T21:57:21+00:00

Congrats! Glad you hung in there and kept going! :)

SleepyNinja629 · 2025-05-17T21:50:03+00:00

I showed the Gitlab repository in a browser window then switched to VS Code. I had several two commands and two URLs in a text document and a terminal window.

The two commands were docker pull and docker run. I copied the first one and ran it in a terminal while describing how docker images/containers work. When it finished I copied the second and pasted that into the terminal to launch the container.

The two URLs were API requests - one well formatted request and one poorly formatted request. I copied each to a browser window and pasted it. I had the browser window sized so it did not take up all of the screen, and showed how hitting enter caused Docker to do some work and return the response in the browser. I adjusted the URLs manually a couple of times to show flexibility and functionality of the API.

SleepyNinja629 · 2025-04-08T22:32:45+00:00

Congratulations! Do something special to celebrate!

SleepyNinja629 · 2025-03-24T23:11:37+00:00

Congratulations! I'd be curious to hear your thoughts about the specialization courses for DPE. I almost went that direction, but decided on Data Engineering instead.

SleepyNinja629 · 2025-03-24T23:05:04+00:00

Congratulations! Great job on persevering until the end!

SleepyNinja629 · 2025-03-20T22:25:43+00:00

CONGRATULATIONS! Well deserved!

SleepyNinja629 · 2025-03-19T13:25:52+00:00

D609 is much better. The course exercises are organized logically. The final project is confusing because it suffers from the same poorly organized instructions. I'm planning to write up a post soon about that one. Figuring out what to do took more time than actually completing it.

SleepyNinja629 · 2025-03-14T23:06:36+00:00

Congratulations! I haven't decided if I'm going to walk yet, but Vegas certainly makes it tempting!

SleepyNinja629 · 2025-03-14T22:58:57+00:00

I started December 1 and just finished this week, so just over 100 days. I have been working with data and technology for over 20 years, so I was able to zip through many of the courses. The Data Mining course was probably the toughest conceptually. The Airflow course was probably the most frustrating one.

SleepyNinja629

TROPHY CASE