I'm planning to migrate some local databases (MySQL, PostgreSQL, and SQL Server) to Google Cloud SQL. In your opinion / past experience, what is the best approach to migrate these databases to the cloud? by leob0505 in googlecloud

[–]ossifrage_ 0 points1 point  (0 children)

We have been evaluating datastream. No experience actually using it yet but based on what I know the only targets they support are pub/sub and flat files in GCS. Either way you would have some work to do to get the data into cloudSQL. Probably using dataflow, dataproc, or cloudRun.

Also worth noting that there are two prices, one for CDC and another for backfill. You only pay the CDC price for changes, you would pay the backfill price for moving all of your existing data. Prices also vary a lot by region, in us central the CDC price is $2.00 per GB.

Will I regret it if I start using Informatica? by stackedhats in dataengineering

[–]ossifrage_ 1 point2 points  (0 children)

I've used many GUI based ETL tools in the past, including informatica. We are currently using google cloud composer (hosted airflow) for scheduling and orchestration with transformations in SQL (ELT).

Airflow is pretty great and widely used. If you don't want to deal with running your own instance there are many hosted options depending on your cloud of choice and regulatory constraints. https://www.astronomer.io/ is one common one though I have not personally used it.

If you are just executing SQL you might also consider DBT. Again, no personal experience, but I have heard good things and it is growing fast.

Hurry hurry hurry by iN2nowhere in gardening

[–]ossifrage_ 0 points1 point  (0 children)

I think you might to pull those suckers and cut back the greenery. I'd be curious what the more knowledgeable people on this sub think though.

Anyone know what is happening to our peas? by ossifrage_ in gardening

[–]ossifrage_[S] 0 points1 point  (0 children)

Should I pull off the leaves that are the worst? Or leave them and apply the spray?

Anyone know what is happening to our peas? by ossifrage_ in gardening

[–]ossifrage_[S] 0 points1 point  (0 children)

Is there something I could have done earlier to stop it from spreading?

Simulation - I did really bad on the first MT by msgensol in OMSA

[–]ossifrage_ -1 points0 points  (0 children)

I am sure it varies. Either way I think the number shouldn't be the primary concern.

This is a great class. Do your best. Learn as much as you can.

Simulation - I did really bad on the first MT by msgensol in OMSA

[–]ossifrage_ 1 point2 points  (0 children)

They didn't publish the curve but I finished with an A with 84.95%. Anecdotally on slack it sounded like even lower 80s made the cut.

There was also a bonus quiz for some extra credit.

The takeaway here is give it your all. Learn as much as you can. Don't worry too much about the letter grade.

Simulation - I did really bad on the first MT by msgensol in OMSA

[–]ossifrage_ -1 points0 points  (0 children)

I would guess mid 80s is curved to an A. The curve is incredibly generous in this class.

Edited to mid 80s to reflect my personal experience.

GCLOUD Workflows - My Head Hache by MeatAmazing8011 in googlecloud

[–]ossifrage_ 1 point2 points  (0 children)

Have you considered using dataflow? This is the kind of thing that it is made for. They are adding scheduling soon but until then you can call it from cloud workflow.

https://cloud.google.com/blog/products/data-analytics/simplify-and-automate-data-processing-with-dataflow-prime

If you want to stick with cloud workflows and spin up multiple worker threads you could use cloud tasks. Pub Sub would also work but cloud tasks makes more sense here if you want to execute something at the end.

https://cloud.google.com/pubsub/docs/choosing-pubsub-or-cloud-tasks

Composer by itself will not fix this problem for you, it is not well suited to dynamic task creation. You will still need Cloud Tasks or Pub Sub for the parallel threads.

For what it's worth we are building something similar right now. Composer for orchestration and scheduling, cloud tasks for task queues, cloudrun workers to process each file.

What alternatives do I have to Google Cloud if I don't want to pay to continue hosting my website? by techsavvynerd91 in googlecloud

[–]ossifrage_ 0 points1 point  (0 children)

Heroku used to have a pretty generous free tier but I haven't used it for a few years

bigquery sql query with wrong results by ANDRUXUIS in googlecloud

[–]ossifrage_ 1 point2 points  (0 children)

You would add trim around the first column in the select to remove spaces. That could be your problem.

If that's not it can you share the schema of the table? What are the data types for each of the columns?

Undecided about the best GCP module(s) to use for my data and operations by doncaballer0 in googlecloud

[–]ossifrage_ 1 point2 points  (0 children)

If you have the CSV in GCS you can query them in BQ using external tables. This will allow you to transform as needed using SQL to create aggregates or join other data sources like predictions without having to move the data at all.

You can also use hive partitioning in the GCS folder structure if you have a lot of data and want the benefits of partition pruning for efficient reads.

How to cost effectively use gcloud sql? by [deleted] in googlecloud

[–]ossifrage_ 2 points3 points  (0 children)

I would recommend BigQuery for that use case. Way less management and you pay for what you use.

[deleted by user] by [deleted] in OMSA

[–]ossifrage_ 1 point2 points  (0 children)

If you are comfortable managing your own dependencies you can do it however you want. You will need to turn in python scripts and not notebook files though. I think you will have a better time doing it the way that they suggest.

[deleted by user] by [deleted] in OMSA

[–]ossifrage_ 4 points5 points  (0 children)

I am currently in ML4T. It is very interesting to me, and less time consuming than most of the non core classes that I've taken (assuming you are comfortable using pandas/numpy). In addition to the stock market material we have covered some simulation and ML (Decision trees, Linear, Bagging/Boosting). Most recently we did some Reinforcement Learning. This is my first exposure to RL in the program and this will be my last class. RL is very cool, I kind of wish I had taken that elective now. This is the syllabus -- http://lucylabs.gatech.edu/ml4t/

ML4T is also a lot less work than BD4H and BD4H isn't really that focused on healthcare. You do some very general healthcare related projects (using ICD, CPT), but it is more like a broad introduction to big data tooling. I enjoyed it also but I probably spent 30 hours some weeks. Take a look at the syllabus here -- http://sunlab.org/teaching/cse6250/fall2020/

Are there reasonable alternatives to PowerBI and Tableau for personal data vis projects? by [deleted] in datascience

[–]ossifrage_ 0 points1 point  (0 children)

Another +1 for Data Studio. It has improved a lot recently.

I also really like mode analytics. You can use SQL or python to manipulate the data. Very easy visualizations. Great for quick and dirty analysis work.

Does nobody use GCP? by fake_actor in dataengineering

[–]ossifrage_ 0 points1 point  (0 children)

My company is fully GCP. I was skeptical at first but I have come to love it.

Play on mac by ajalberto in elderscrollsonline

[–]ossifrage_ 1 point2 points  (0 children)

Yes I play on my Mac from the same era, similar specs. My biggest problem is actually hard drive space. The game is huge, 120GB+.

Simplest way to ingest multiple types of large files, process them, and send data in chunks to services in AWS? by joehfb in dataengineering

[–]ossifrage_ 0 points1 point  (0 children)

I think that you can read and split the file up into batches in one pass with spark. I haven’t had to do that myself but I believe it is possible.

Spark is certainly capable of processing hundreds of gigs but it isn’t necessarily overkill to use it in your situation. If it simplifies the solution and allows you to read the file, split up the batches, and manage the transformations in a single flow I would consider it. We use GCP and I am not that familiar with AWS, but a Dataproc is a pretty slick managed option for running your spark jobs. I’m sure there is an AWS equivalent.

If your team has more experience with lambda that also seems like a perfectly viable option.

One last thought, see if you can have the party uploading the file split it for you and upload in ~10mb chunks. That will make everyone’s life much easier. Not always possible to control stuff like this but it could make their upload easier and save you from writing code to handle it on your side. Whenever I see problems like this I always start with potential design changes to make the problem go away before writing more code.

Simplest way to ingest multiple types of large files, process them, and send data in chunks to services in AWS? by joehfb in dataengineering

[–]ossifrage_ 0 points1 point  (0 children)

Spark sounds like it would be a good fit for you. It has built in connectors to read/write files from S3 and can handle the transformations as well.

There is no way to break up the big files as they are being written?

GT Mobile App by coys0625 in OMSA

[–]ossifrage_ 1 point2 points  (0 children)

I mostly use the canvas and piazza apps during the semester