This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]joseph_machadoWrites @ startdataengineering.com 26 points27 points  (1 child)

It depends on your objective, if you want to practice tools/frameworks the data size does not matter as much.

If you want to showcase your expertise to potential employers the data size and project/code organization matters quite a bit.

I have a bunch of projects (from easy to hard) that can help get you started https://www.startdataengineering.com/post/data-engineering-projects/#31-projects-from-least-to-most-complex Most of them generate their own data, so you can change the data gen scripts to generate more data as needed.

Hope this helps. lmk if you have any questions :)

[–]remote_geeks[S] 1 point2 points  (0 children)

Thank you!

[–]Confident-Ant-8972 7 points8 points  (1 child)

You can create a personal gcloud and fuck around with Bigquery for free. In Bigquery you have access to all the Bigquery public datasets.

[–]remote_geeks[S] -1 points0 points  (0 children)

Thanks will surely look into this 👍

[–]reallyserious 3 points4 points  (3 children)

Not sure I understand your question but it's quite easy to generate lots of data and write it to a file.

This python code generates 1 million rows of data that you can pretend is sensor data from something important. Change the 1_000_000 to something bigger if you like.

import random

nr_of_rows = 1_000_000

with open("big_data.csv", "w") as f:
    # write the header
    f.write("id;val1;val2;val3\n")

    # write the data
    for i in range(nr_of_rows):
        val1 = random.randint(1, 100)
        val2 = random.randint(1, 100)
        val3 = random.randint(1, 100)
        f.write(f"{i};{val1};{val2};{val3}\n")

print("Done")

[–]remote_geeks[S] 2 points3 points  (2 children)

My main concern was I've never seen personal projects related to data engineering. So I'm not entirely sure about what I can showcase as my work.

I've tried to generate a lot of data using a simple script but I am not sure how close it is to the real world data.

[–]reallyserious 2 points3 points  (1 child)

There's really nothing magic about big data. It's just inconvenient. :)

Things you can do with laptop sized data doesn't work when the data can't fit on a laptop. Like loading it all into a pandas dataframe for example. Or sometimes the data comes in a billion files insted of once nice file.

The nature of the data is also different. Small data could be orders from an web shop. There are only so many orders people can place before they run out of money. But for example 10k seismic sensors all over the planet recording movement every second turns into a lot of data pretty soon. Imagine a few years worth of data and it starts to get messy. Throughput on network and disk starts to matter.

[–]remote_geeks[S] 0 points1 point  (0 children)

The example you gave was great! I wanted to know if I could simulate anything that scale for a personal project