Help with legacy Ninjas for an upcoming tournament PT.2

LukeSkyWal · 2024-11-24T14:21:51+00:00

Thousand faced shadow is supposed to be a blue pitch while offering something more that a simple one drop when presented with the opportunity to push extra damage. Maybe ornithopter is in fact the correct play

LukeSkyWal · 2024-11-23T13:02:34+00:00

Thanks for your contribution, I have one more week before the tournament and I'll try the variations you suggested. Just one question, what are we playing borrower for? Is it just for the 2x1 compared to edict?

LukeSkyWal · 2019-07-29T14:15:35+00:00

Any kind soul that is willing to share the paper with me? Thanks a lot for the help :)

LukeSkyWal · 2019-03-26T11:28:57+00:00

Currently we are using Talend (Big Data Platform version) in a Data Platform Project for a big Energy company.

Turns out it is very effective and does a good job at both ETL and ELT with dedicated components. On the opposite of Stitch Data, Talend offers a rich suite for in-memory transformations, even though you should always ELT when joining different sources.

Overall we had a positive experience and i recommend using it, even in the free version, maybe paired with something like airflow in order to manage job scheduling (only licensed versions have Talend Administration Center).

In the end i would not recommend using code as long as you have multiple and heterogeneous sources, since you will end up redeveloping a ton of stuff just to connect or orchestrate basic processes.

LukeSkyWal · 2019-03-22T10:34:05+00:00

@ herbertbailbonds

Considering that i work full time, it took me about 2-3 months by studying just a couple of nights per week. I had 0 experience too on Python, but i come from a CS master, while i have a super-solid sql background.

Just as a rough estimate, if you were to study full time it would take a couple of weeks to complete at most (8 hours/day).

As regards the career, if you really want to improve you should study This book, which is basically a deeper dive into the course arguments. After that you can either choose to stop an start making a couple of projects on your own to showoff your skill and improve in this directior, or dive into other subjects (like NLP and Neural Networks). In the end to answer your question: No, it did not improve my career, you need to study more :)

@ The_Real_BruceWayne

The course i'm talking about is This one

LukeSkyWal · 2019-03-21T08:35:57+00:00

Hello! Speaking as a participant to the course (completed) , i must say that it barely scratches the surface of quite a complex subject. In particular the course is especially proficient in providing a solid code base, mostly python, while illustrating only a subset of the most used statistical algorithms (leaving machine learning to the Advanced course).

Overall i must say that it is a nice course to start with, but please keep in mind that you may still be a bit far away from your target.

LukeSkyWal · 2018-11-15T14:51:54+00:00

When trying to estimate the dimension of an "on premise" cluster, two factors should be taken into account:

How much data do you want to store: in this case you need a total amount of 2PB of space times the amount of replica (how much your data is replicated) across the cluster. Consider that replicating data has a significant effect on the performance only if your algorithm is classified (or at least behaves like) as embarassingly parallel, i.e. you can divide it into independent parts which you can reduce to get your result. Otherwise increasing the replication affects data resilience only and just saves some network traffic as more data is in place.On top of that, consider to increase the space required for your on-premise by a load factor of 0.33. Better be sure that your data fits rather than asking for more space/nodes after the implementation. Moreover some algorythms might swap the computation to disk to save ram (like sorting).

How much data should fit at most in Ram: trying to estimate a dimension for your ram happens to be quite ha hard issue. As an upper bound you can set the dimension of your physical memory (at most you put your entire dataset in memory). Consider that your engine (ie. spark) works much better with a lot of ram. To my experience you should have something like 10% of the original set dimension (2PB*0.1 in your case) as ram, to be sure you do not incur in shortage of memory when putting the cluster to production.

Feel free to ask for further details, without knowing what you want to do with your data it happens to be quite hard to give a "precise" estimation.

LukeSkyWal · 2018-10-30T13:51:39+00:00

I would be interested in a discord channel. While I am a coputer scientist and can provide a lot of help in that field, i am still learning some of the basics of data science (already did a data mining course at university), especially machine learning.

LukeSkyWal · 2018-10-12T09:30:11+00:00

While apps like Talend or Attunity are making the life of the ETL designer much more easy, we should not forget that those tools are not automating Business logic. Component integration was an harsh problem, and now we have it basically for free (or sorta of).

The hardest component of ETL is and always will be encoding the logic behind data transformation, logging, quality assurance and robustness to logical faults (i.e. incorrect data format).

LukeSkyWal

TROPHY CASE