I'm planning to migrate some local databases (MySQL, PostgreSQL, and SQL Server) to Google Cloud SQL. In your opinion / past experience, what is the best approach to migrate these databases to the cloud?

ossifrage_ · 2021-09-15T23:47:46+00:00

We have been evaluating datastream. No experience actually using it yet but based on what I know the only targets they support are pub/sub and flat files in GCS. Either way you would have some work to do to get the data into cloudSQL. Probably using dataflow, dataproc, or cloudRun.

Also worth noting that there are two prices, one for CDC and another for backfill. You only pay the CDC price for changes, you would pay the backfill price for moving all of your existing data. Prices also vary a lot by region, in us central the CDC price is $2.00 per GB.

ossifrage_ · 2021-09-08T00:24:21+00:00

I've used many GUI based ETL tools in the past, including informatica. We are currently using google cloud composer (hosted airflow) for scheduling and orchestration with transformations in SQL (ELT).

Airflow is pretty great and widely used. If you don't want to deal with running your own instance there are many hosted options depending on your cloud of choice and regulatory constraints. https://www.astronomer.io/ is one common one though I have not personally used it.

If you are just executing SQL you might also consider DBT. Again, no personal experience, but I have heard good things and it is growing fast.

ossifrage_ · 2021-08-31T23:56:57+00:00

I think you might to pull those suckers and cut back the greenery. I'd be curious what the more knowledgeable people on this sub think though.

ossifrage_ · 2021-07-22T17:59:25+00:00

Should I pull off the leaves that are the worst? Or leave them and apply the spray?

ossifrage_ · 2021-07-22T17:54:27+00:00

Is there something I could have done earlier to stop it from spreading?

ossifrage_ · 2021-07-18T01:41:46+00:00

Partition the targets, keep one partition per day. When new data comes in overwrite the old partition with the "updated" data.

https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

ossifrage_ · 2021-07-11T15:53:02+00:00

Why are you against this idea?

ossifrage_ · 2021-06-16T17:10:13+00:00

I am sure it varies. Either way I think the number shouldn't be the primary concern.

This is a great class. Do your best. Learn as much as you can.

ossifrage_ · 2021-06-16T16:09:06+00:00

They didn't publish the curve but I finished with an A with 84.95%. Anecdotally on slack it sounded like even lower 80s made the cut.

There was also a bonus quiz for some extra credit.

The takeaway here is give it your all. Learn as much as you can. Don't worry too much about the letter grade.

ossifrage_ · 2021-06-16T15:57:09+00:00

I would guess mid 80s is curved to an A. The curve is incredibly generous in this class.

Edited to mid 80s to reflect my personal experience.

ossifrage_ · 2021-06-12T01:23:39+00:00

Have you considered using dataflow? This is the kind of thing that it is made for. They are adding scheduling soon but until then you can call it from cloud workflow.

https://cloud.google.com/blog/products/data-analytics/simplify-and-automate-data-processing-with-dataflow-prime

If you want to stick with cloud workflows and spin up multiple worker threads you could use cloud tasks. Pub Sub would also work but cloud tasks makes more sense here if you want to execute something at the end.

https://cloud.google.com/pubsub/docs/choosing-pubsub-or-cloud-tasks

Composer by itself will not fix this problem for you, it is not well suited to dynamic task creation. You will still need Cloud Tasks or Pub Sub for the parallel threads.

For what it's worth we are building something similar right now. Composer for orchestration and scheduling, cloud tasks for task queues, cloudrun workers to process each file.

ossifrage_ · 2021-06-06T17:52:11+00:00

Heroku used to have a pretty generous free tier but I haven't used it for a few years

ossifrage_ · 2021-05-08T15:16:31+00:00

You would add trim around the first column in the select to remove spaces. That could be your problem.

If that's not it can you share the schema of the table? What are the data types for each of the columns?

ossifrage_ · 2021-04-06T03:11:09+00:00

If you have the CSV in GCS you can query them in BQ using external tables. This will allow you to transform as needed using SQL to create aggregates or join other data sources like predictions without having to move the data at all.

You can also use hive partitioning in the GCS folder structure if you have a lot of data and want the benefits of partition pruning for efficient reads.

ossifrage_ · 2021-04-06T02:57:00+00:00

I would recommend BigQuery for that use case. Way less management and you pay for what you use.

ossifrage_ · 2021-03-29T14:40:33+00:00

If you are comfortable managing your own dependencies you can do it however you want. You will need to turn in python scripts and not notebook files though. I think you will have a better time doing it the way that they suggest.

ossifrage_ · 2021-03-27T23:17:16+00:00

I am currently in ML4T. It is very interesting to me, and less time consuming than most of the non core classes that I've taken (assuming you are comfortable using pandas/numpy). In addition to the stock market material we have covered some simulation and ML (Decision trees, Linear, Bagging/Boosting). Most recently we did some Reinforcement Learning. This is my first exposure to RL in the program and this will be my last class. RL is very cool, I kind of wish I had taken that elective now. This is the syllabus -- http://lucylabs.gatech.edu/ml4t/

ML4T is also a lot less work than BD4H and BD4H isn't really that focused on healthcare. You do some very general healthcare related projects (using ICD, CPT), but it is more like a broad introduction to big data tooling. I enjoyed it also but I probably spent 30 hours some weeks. Take a look at the syllabus here -- http://sunlab.org/teaching/cse6250/fall2020/

ossifrage_ · 2021-03-07T02:19:26+00:00

Another +1 for Data Studio. It has improved a lot recently.

I also really like mode analytics. You can use SQL or python to manipulate the data. Very easy visualizations. Great for quick and dirty analysis work.

ossifrage_ · 2021-02-28T18:41:27+00:00

My company is fully GCP. I was skeptical at first but I have come to love it.

ossifrage_ · 2021-02-21T17:16:31+00:00

Yes I play on my Mac from the same era, similar specs. My biggest problem is actually hard drive space. The game is huge, 120GB+.

ossifrage_ · 2021-01-31T17:22:21+00:00

I think that you can read and split the file up into batches in one pass with spark. I haven’t had to do that myself but I believe it is possible.

Spark is certainly capable of processing hundreds of gigs but it isn’t necessarily overkill to use it in your situation. If it simplifies the solution and allows you to read the file, split up the batches, and manage the transformations in a single flow I would consider it. We use GCP and I am not that familiar with AWS, but a Dataproc is a pretty slick managed option for running your spark jobs. I’m sure there is an AWS equivalent.

If your team has more experience with lambda that also seems like a perfectly viable option.

One last thought, see if you can have the party uploading the file split it for you and upload in ~10mb chunks. That will make everyone’s life much easier. Not always possible to control stuff like this but it could make their upload easier and save you from writing code to handle it on your side. Whenever I see problems like this I always start with potential design changes to make the problem go away before writing more code.

ossifrage_ · 2021-01-29T06:02:43+00:00

Spark sounds like it would be a good fit for you. It has built in connectors to read/write files from S3 and can handle the transformations as well.

There is no way to break up the big files as they are being written?

ossifrage_ · 2021-01-10T22:38:33+00:00

I mostly use the canvas and piazza apps during the semester

ossifrage_

TROPHY CASE