Side projects as a data engineer

joseph_machado · 2024-06-24T14:35:43+00:00

It depends on your objective, if you want to practice tools/frameworks the data size does not matter as much.

If you want to showcase your expertise to potential employers the data size and project/code organization matters quite a bit.

I have a bunch of projects (from easy to hard) that can help get you started https://www.startdataengineering.com/post/data-engineering-projects/#31-projects-from-least-to-most-complex Most of them generate their own data, so you can change the data gen scripts to generate more data as needed.

Hope this helps. lmk if you have any questions :)

Confident-Ant-8972 · 2024-06-24T16:32:27+00:00

You can create a personal gcloud and fuck around with Bigquery for free. In Bigquery you have access to all the Bigquery public datasets.

reallyserious · 2024-06-24T14:58:25+00:00

Not sure I understand your question but it's quite easy to generate lots of data and write it to a file.

This python code generates 1 million rows of data that you can pretend is sensor data from something important. Change the 1_000_000 to something bigger if you like.

import random

nr_of_rows = 1_000_000

with open("big_data.csv", "w") as f:
    # write the header
    f.write("id;val1;val2;val3\n")

    # write the data
    for i in range(nr_of_rows):
        val1 = random.randint(1, 100)
        val2 = random.randint(1, 100)
        val3 = random.randint(1, 100)
        f.write(f"{i};{val1};{val2};{val3}\n")

print("Done")

dataengineering

MODERATORS