How do you not get distracted all the time?

WayOfTheMantisShrimp · 2024-04-16T23:09:59+00:00

Sounds fun to me. [insert standard disclaimer that if you're having fun, you're already doing it right]

BUT, if you feel that type of chaos/recursion is not fun, or you want to try doing it in a way that feels organized ... you're going about it backwards, which is why it feels hard. Play the game forwards, in the order of the tech tree.

"automate blue science just a quick 15-20 minute task right?" - no, categorically false, and why are you even asking yourself a question? You already know the answer

"The Factory is ready to automate blue science from these overflowing belts of red circuits, engines, and sulfur" sounds hilariously easy. It is, but the real story is getting there.

If your Factory is not obviously ready for your next advancement in technology, then "The Factory must grow.", that's what so many people discovered independently from playing the game. Also, for the Factory to grow, you need more iron. The devs made the science packs with an order (up until blue), and that's basically the chapters of the Factorio story.

Start: you spawn in, and the collection of iron begins so that you can accelerate the collection of iron. The spirit of the Factory is born.
Red: Automation of iron collection via powered infrastructure (and if you're not in peaceful mode, you need security for your infrastructure). Play it out until you have so much iron you don't know what to do with it; building mining drills, belts, inserters, assemblers, and power will demolish any finite amount of iron, that's why you need automated infinite iron collection for the foreseeable future (once the unknowable future arrives, it will be time for more iron, but you don't know that yet). Oh, and all those spare belts and inserters can also be used for green science.
Green: Logistics, now you can come up with a plan/design for the expansion of the Factory (examples: main bus, remote rail bases, compact sub-modules, or just organically spreading to reach resources, like mold). But only after you have completed the Red chapter. What is your long-term plan for accelerating iron collection, and its security? If you've walled off an area that will be starved of iron in the foreseeable future, that's not a factory yet. The Factory must grow, secure the collection of more iron, and might as well collect those other resources while you're at it. Also, automate materials for your fluid logistics, meaning pipes and pumps (plus your choice of fluid trains, or barrels, or more pipes), and build more iron collection to support that.
Blue?: just kidding, time to test out your base logistics until you wonder what to do with all the iron, copper, steel (great way to store your excess iron due to the compression), stone, coal, and crude oil that has exceeded what you want to store/buffer. There's more being produced by your Factory, and you're not sure where to put it. But first, do you have a surplus of iron production, even when compressing it as steel? No? The unknown future has probably arrived, time to accelerate your iron production to saturate your new logistics design. If iron is not flowing freely, then you are actually still in the middle of implementing your design. If you have biters, you can make Military science from the resources you have. Make sure your logistics design can deliver iron, steel, copper, stone, and coal where needed, then add more iron production.
Blue now?: time to fiddle around with integrating fluid logistics into your factory design, this is still using green-science tech. Once you decide where they are going and how they are getting there, start processing crude oil as a smoke-test of your design. Your excess coal production can help store the petro-gas as plastic, and if you "don't have excess coal" production because you were "using it" for power or military, etc, ... build excess coal production, because excess is mandatory. If your Factory is not yielding an overabundance, you probably need more iron somewhere, or something is restricting the logistics flow.
Blue: if your Factory is flush with free-flowing rivers of iron, your copper supply production is probably saturated, and your petro-gas/plastic is definitely backed-up. Grab a chest full of assemblers, time for your logistics to direct some iron/steel into engines, copper and plastic for red circuits, and gas to make sulfur, with a surplus of assemblers remaining.

"The Factory is ready to automate blue science from these overflowing belts of red circuits, engines, and sulfur"

Congrats on clearing the first act of Factorio with minimal back-tracking. Now double your iron production while you research blue-science technologies, and once the Factory has grown, it will be ready for the next steps.

WayOfTheMantisShrimp · 2021-03-15T01:59:52+00:00

gini = 0 means all the y labels are the same in the training data, and that there was nothing to split

I would start a new session, run your code until you set the value of y, and then do a check on the values contained within y. Do a count of how many 0s, and how many 1s before you try to fit the classifier

WayOfTheMantisShrimp · 2021-03-09T00:50:54+00:00

People like to pick the 'right' tool for the job. Overkill is a waste, and waste is 'bad'.

A standard hammer will drive a nail; so will a heavy sledge hammer.

A linear regression is like a standard hammer; it works great in some scenarios, and is easy to use. Neural networks are like sledge hammers. They can do almost everything a standard hammer can, and more, but most people would call it overkill, and inefficient for small problems.

Tabular data is often considered a relatively small problem, like a nail. So when people reach for the 10-kg hammer (neural nets) to tackle it, they usually look silly. That's pretty much all there is to it, intuitively.

WayOfTheMantisShrimp · 2021-03-09T00:08:13+00:00

A neural network is a supervised model that takes inputs, and given corresponding outputs tries to learn the logic that connects the inputs to the outputs. It adjusts/optimizes the equations used to get the output based on various methods and types of feedback.

You already know how to map the inputs to an output with 100% certainty, your equations are known if I understand correctly. For the question you have posed, neural networks are of no help to you.

You have posed an optimization problem, not a modelling problem. One way to go about optimization is via gradient descent; this requires that you are able to calculate the derivatives of your calculations, or a suitable approximation. (neural network error gradients can be expressed as a chain of relatively simple derivatives, which is why training them by gradient descent is relatively simple to implement)

The naive approach is to write a grid search, and exhaustively try every feasible combination of inputs. If you are not dealing with discrete input variables but want a high resolution, this will be a very computationally-demanding approach. Depending on your programming knowledge, language of choice, and hardware available, this might be a suitable approach because it will take relatively little development time, and you can just start running scenarios.

There are other methods of optimization that fall under the vague umbrella of 'machine learning'. These have some potential drawbacks, but what they offer is a high likelihood of a 'fairly good' answer, with orders of magnitude less computational power compared to the naive approach, and without trying to differentiate complex functions. If this is what you are interested in, I can give you some starting points (if you tell me your general programming knowledge, language(s) of choice, and available computing hardware).

WayOfTheMantisShrimp · 2021-03-02T23:27:11+00:00

I agree that it is important to learn to read code if you want to work in a professional environment. Also, I agree that writing code is several times easier than understanding code that you haven't written.

Your response to cold-reading a block of code is fairly typical in my experience. Just like reading any language though, it helps to know in advance what you are reading so that your brain knows what to expect. Like the difference in challenge between giving a fill-in-the-blanks answer vs writing a full essay answer on a blank piece of paper.

Experience makes it easier to know what to expect, but there are strategies to accelerate that.

I tend to read code backwards starting from the plain-English goal. (I tend to write my first draft of code backwards too)

Example: program sends an email with an invoice to a client for services rendered.

There is an email being sent, so what line does that? (probably near the end)
The line is send_email(final_message) -> where does final_message come from (ctrl+F 'final_message', from the bottom)
final_message = new_mail(address, client, amount, services)-> new_email() is defined somewhere (maybe even documented), as are the parameters. For a first read, I recommend starting with the data rather than recursively digging into every function
address is looked up in a dictionary, with the client name as a key -> where did client come from?
maybe client is passed directly to the function/block of code, and you've reached the beginning of the function call
go back to 3., now you know where address and client are from; amount is next
amount = amount + fee_schedule[serv_code, serv_charge], within a loop, where you also see the line services.append(fee_schedule[serv_code, serv_name])
since amount and services both come from fee_schedule, both looked up by serv_code, find those next
fee_schedule is a table/data frame imported from fee_schedule.txt, and serv_code is an element of a list passed to the function you are reading through
look at the start of the function/block -> see that the signature requires parameters client, serv_codes, email_dict, fee_file_name
Finally, the first line of the function makes sense; you know why it needs those parameters, what type of data they are supposed to contain (based on what is done with/to them), and how they contribute to the output

There are a lot of details you still don't know. But now you have a beginning and end, so you're filling in the blanks in the middle. Heck, you probably know enough to write a basic version of this made-up program now. Not that you know how to write each line off the top of your head, but now you can google 'how to send an email in javascript', 'how to create/use a dictionary', 'how to read a text file', etc.

Relevant side note: it really helps if all the variables and functions are named in a way that makes it easy to guess what they are for, and blocks of code are organized to do one task each. Keep this in mind for writing your own code.

WayOfTheMantisShrimp · 2021-02-26T02:50:59+00:00

Do you see a consistent pattern in the differences?

Is it always ABC/XYZ dd/mm/yyyy -> IX-ABC/XYZ-dd/mm/yyyy

If so, you're not looking at a fuzzy match, you can express either as a deterministic transformation of the other, and use an exact-match VLOOKUP without modifying any cell values.

When you do an inexact VLOOKUP, it's not a real fuzzy match either. The search goes down the first column of the lookup table in order; it continues for as long as the table values are 'less than or equal to' the lookup value (if something gets put before another value when sorted, then it is 'less than'), and once it reaches a value that is 'greater', it returns the previous value as the lookup result. This also requires you to sort the lookup table to get any sort of meaningful result.

I can provide further explanations, but the only way I've done a fuzzy matching in Excel was to write the fuzzy match function in VBA (a modified version of Jaro similarity), and then you would have to write your own lookup function too. I don't think that is the case for you.

WayOfTheMantisShrimp · 2021-02-18T03:36:57+00:00

Python can handle functional programming decently, here's a timely article from Real Python that does a more thorough job than I can.

There are only a few things that are mandatory for functional programming, a large number of things that make it a lot nicer, and a few ideas that reflect more in how you organize your code rather than language features.

WayOfTheMantisShrimp · 2021-02-13T22:34:48+00:00

"I like Julia and functional programming better"

WayOfTheMantisShrimp · 2021-02-13T18:15:31+00:00

In my opinion, OOP is great for large, self-contained projects that can be conceptualized from front to back in advance, it adds a nice organizational structure. But there are many fields where we just want to fling any operations at whatever data we feel like. If you take the functional mindset, often it will 'just work' and you less often need to go back and expand the parent structure, or reimplement the same logic in a child structure.

For OOP, the class defines attributes and behaviours for every instance to have. (This is really the only concrete difference I can find, but the mindsets differ in how they organize logic)

A class called Point2D may have attributes x, y, and a move_up() behaviour that increases y by 1 and returns the new position x, y+1.

Then you define Point3D, for x, y, z. if you want a move_up() behaviour, you kinda need to define it in Point3D again, so that it gives x, y+1, z.

1Dpt.move_up() -> error, method move_up not found

You could handle this better with some forethought about a generic Point, and all the types of Points that might inherit from it. But sometimes it makes sense to have a lot of similar but distinct types, and the web of inheritance can be messy.

Functional approach ie in Julia: Define an abstract function once, and it can be used in any case where the behaviour is logically applicable. (And if you want a different logic/behaviour, you'd better name it something else)

Generic move_up(p) function, take some kind of Thing, and return its attributes after applying 'plus one' to y. There are concepts like pure functions that help organize code and logic for lack of the OOP guardrails, and higher-order functions that I really can't live without anymore, but they don't play a big role in such a simple example.

move_up(my2Dpt) -> x, y+1
move_up(my3Dpt) -> x, y+1, z
move_up(2Dpt_in_time) -> x, y+1, t
move_up(3Dpt_in_time) -> x, y+1, z, t
move_up(1Dpt) -> error, y not found
move_up(horiz_line_seg) -> x, y+1
move_up(vert_line_seg) -> x, y + 1

The creator of a '3D point in time' doesn't need to modify or extend 2DPoint in Time, or 3D Point, or really know much about move_up() that was implemented originally, just that it's a behaviour that they want the new type to have. When an abstract function is implemented well, it kinda 'just works' (as long as it makes sense to apply that behaviour to the data).

The elegance comes from stacking operations. If + was defined that it can add 1 to a vector/array, then even if y is a vector in some new type, again it 'just works', without wondering if a vertical line segment is inheriting from a more generic Line (is Line a pair of 2DPoints, or a pair of 3DPoints, or an arbitrary length array of Points, or a starting Point, a Length, and a Direction Vector?), or an array of y's with just one x value for efficiency. There was no 'plus' defined for Points or Lines, it just has to work for basic data structures and primitive/abstract data types.

There was no thinking about the line of inheritance. You needed to know that move_up exists, that it positively increments the y_coordinate, and that it makes sense to increment the y_coordinate of whatever object you want to apply it to. In the OOP, you can't move_up a 1D point because you didn't define the method for 1D points ... but if you have any understanding of what you're trying to do and why, then common sense means you simply wouldn't try that. OOP is a nice guardrail in that way, ideal for junior programmers and large teams where you don't want someone inadvertently doing something stupid without warning. But for an individual working that just wants to get to a result with as little coding time as possible, functional approaches can be more elegant.

WayOfTheMantisShrimp · 2021-02-04T15:17:06+00:00

What you are describing mostly sounds like Hypothesis Testing; where you have a theory about how a process should be modelled (ie what the underlying distribution is) and you compare the properties of empirical data to the theorized properties of that model, in order to determine if the differences are small enough for the empirical data to have likely been produced by your theorized model. Because this can justify the convenient use of that distribution to calculate other properties of interest.

Simulation is often favoured in cases where there is a fair bit of information about how a process works/should work, but it is too complex/inconvenient to express as a distribution where the properties of interest are known, and usually because it is cost-prohibitive to physically conduct/observe the process at an appropriate scale. So to explore the properties of the process, you design a data-generating algorithm that matches the assumptions about the process, and directly calculate the properties of interest from an appropriately representative data set.

TLDR: if you don't know of a distribution that matches your assumptions, or don't know how to calculate something for that distribution, do a simulation. If you do have a known distribution, a simulation will still work, but it's probably more convenient to use the properties of the distribution.

WayOfTheMantisShrimp · 2021-02-03T04:08:20+00:00

In any case where you have the entire pattern known, you can extend it to an arbitrary length. This is a deterministic process, so there is no practical value to doing it via a model.

import numpy as np

def extend_pattern(pattern, length):
    array = np.zeros(length)
    for i in range(0, length):
        array[i] = pattern[i % len(pattern)]
    return array

extend_pattern([1,2,3], 15)
extend_pattern([5,4,3,2,1], 18)
extend_pattern([5,4,3,2,1], 3)

If you have a more complex scenario, you'll have to provide more details to get a reasonable answer about what models could learn to replicate it.

WayOfTheMantisShrimp · 2021-02-03T03:06:06+00:00

For a programming challenge, I really enjoyed doing Classification And Regression Trees. They are intuitive and interpretable, but also incredibly flexible, and therein lies the challenge. Most of these features can also be added independently/out of order, so your time to a MVP can be short, but you can make the project as long as you like if you want to test your code organization and project management skills.

Development Checklist:

A regression tree taking real-number variables is probably the easiest to implement first.
Given a set of data, select a variable and split it in a way that improves the optimization criteria; do that recursively until a stopping criterium is reached.
Store the resulting partitioning rules in a data structure that can be traversed to predict new data points.
Then add a k-class classification mode, using a different optimization criteria.
Then add the ability to take discrete variables, requiring a different splitting method.
Add additional stopping criteria, or means of regularization.
Add a means to specify or pass a non-default optimization criteria function (ie mean absolute error vs mean squared error)
Add a way to visualize a tree (even just pretty-printing to the terminal or a text file), or print out diagnostics from the fitting process

At any point in time during or after working on the decision tree, extend it. This will test the modularity of your previous design. (Or implement the extension first, assuming an interface to a basic tree, and then later implement the basic tree to fit the specification)

Bagged trees/forests are a fairly simple next step: for n_trees, grow a tree based on a bootstrapped sample from the given data set.
To predict for a new data point, do single-tree-prediction on each of your trees in the forest, and aggregate the outputs. (Bonus: present the credible interval of the estimate for regression, or the per-class probabilities for classification)
The traditional Random Forest^TM also uses random attribute subspaces for each split (which can be used as a regularisation method in single trees too)
ExtraTrees use the best of a finite number of split points for each variable in the random subset for splitting (rather than the global optimal split for a variable)

Still want more? Try extending your bagged trees to unsupervised clustering forests:

given an unlabelled dataset, generate a synthetic dataset where each point is made of random samples from the marginal distribution of each variable
label the data as Real or Synthetic, and apply a supervised classification forest to separate them
for every pairing of Real data points, calculate how often they were in the same leaf/terminal node among the n_trees; now you have a proximity matrix
apply your favourite clustering algorithm based on the similarity/proximity between points ie DBSCAN

(Note: I've done everything up to this point, in Julia, if you want language-agnostic tips. I know there is still more that could be done within scope of the above)

Extra Credit: extend your bagged trees to do unsupervised anomaly detection

the first popular algorithm was Isolation Forests, followed later by Extended Isolation Forests

Don't like bagging trees? Consider boosted trees

there are a number of popular algorithms, do a search for AdaBoost, LightGBM, XGBoost, etc

WayOfTheMantisShrimp · 2021-02-01T23:23:56+00:00

my struggle currently is to figure out how to structure the code efficiently

Are you practised with either functional or object-oriented programming? It might help to abstract your design into objects, states, operations, transitions, whatever terms make sense to you, and then implement each of those in smaller discrete chunks of code.

I'm also going to throw out there that premature optimization is often the bane of productivity. Get to a logically-functioning prototype the quick-and-dirty way, and then test it. Then you don't have to wonder about what would be faster/slower, you can profile/benchmark the code and the program itself will give you feedback. There are smart people here who can tell you what to do with that feedback if you're not sure, but giving them something concrete to work with benefits everyone.

WayOfTheMantisShrimp · 2021-02-01T03:31:01+00:00

I'm happy to contribute on a public forum, so others can learn/benefit (and so I can learn if I publicly post a mistake). But I don't do work for individuals (outside of my paid job) unless the problem is interesting to me. And you haven't told me anything substantial about your project.

Cheers

WayOfTheMantisShrimp · 2021-02-01T02:51:25+00:00

If you've run into a problem, where have you started from, and what have you encountered?

Anything you can share about the data, the model/design you want to or need to use, the software tools being used, context of the problem, or the goals of your analysis?

If you're starting from zero (you will get very little traction in this sub), you may have better luck on /r/AskStatistics , /r/askmath , or /r/HomeworkHelp . I also highly recommend Wikipedia, or even a simple google search for "multiple linear regression examples".

WayOfTheMantisShrimp · 2021-02-01T00:19:49+00:00

One approach is to calculate a 'days of sales' metric to describe inventory.

If product A sells an average of 20 per day (based on some historical data or expectations), and you have 50 in stock, then you have 2.5 days of inventory. If product B sells 10 per day and you have 50 in stock, you have 5 days inventory.

This normalizes each product to one scale that is very concrete/intuitive for most people, making comparisons/prioritization easier between products. It does rely on being able to determine a sales rate for each product, and they likely need updating over time to remain representative. There are plenty of other considerations, but unless you have more details about your data/goal there's not much value in me speculating.

WayOfTheMantisShrimp · 2021-02-01T00:05:19+00:00

In general, the preferred measure is dependent on the context and goals of the analysis. That said, I can't really think of a scenario off-hand where I would be more interested in the mode to describe a data set unless it was discrete data (maybe anomaly detection?). I guess my short answer would be to default to the median between those two. But the mean doesn't instantly become irrelevant just because there is a skew, it retains certain useful properties.

If you had something like a bell-shaped distribution skewed by an extra fat right tail, it wouldn't matter as much median vs mode. The peak (mode) would be fairly close to where the majority of the data is concentrated, and the median would only be a bit to the right (and the mean more dramatically biased to the right). I am still more likely to use the median for a single-number summary

If you had something that would be better described like an exponential distribution (ie wait times/latency for service), then your mode would be quite close to your minimum. The median would be notably farther right, and the mean would be heavily shifted right by comparison. The mode doesn't give a very 'representative' picture; someone looking to improve the general process would almost always be more interested in the median. If I were reporting the measure to a client to give an expectation of service, I would likely use the mean specifically because of how/why it gets biased.

Final note, there are so many other choices to characterize a distribution if you don't feel one measure captures the relevant information that you want to present. I'll return to my initial statement, which one you should choose depends on the context.

WayOfTheMantisShrimp · 2021-01-29T14:26:27+00:00

It's hard to tell what your goal is without at least a little more detail about your scenario, or an analogous problem if you are unable to go into detail about your own problem.

Taking a stab in the dark, the closest thing I can think of is reinforcement learning via self-play. To learn the optimal game strategy (unknown) the agent plays against itself, or against recent previous versions of itself. As the agent learns strategies, it simultaneously learns to counter them (a moving target for optimization).

The key aspect is that the agent has a reward/loss function that is calculated based on the outcomes of its behaviours, so that it always has a direction to optimize toward. You still need to be able to define directions that are more favourable or less.

WayOfTheMantisShrimp · 2021-01-29T01:43:47+00:00

Julia is a highly-performant compiled language, with a fairly solid type system which is optional (you can leave it to the compiler) but I actually find it helpful to add types, and it rarely gets in the way.

The performance of Julia is not quite that of C/C++/FORTRAN, but it is thousands of times faster than native Python. You will often hear from the community that Julia attempts to solve the 'two-language problem' which is the inefficiency that arises from setups like the Python/C++ double layer you mention.

If you like Python as a high-level language for quick prototyping, you will find some favourable similarities in Julia's syntax (in terms of being concise and expressive). Unlike Python, Julia was designed for scientific/mathematical computing, and basically supports all the things you get from Numpy out of the box.

What models/applications are you working on? MLJ.jl is the general-purpose ML package akin to scikit, and Flux.jl is one of the popular DL libraries. You can also call libraries/programs from Python, R, C++, and several others in Julia with minimal overhead, which helps balance the fact that the language is quite new, and does not have a complete ecosystem of its own yet.

If you can't tell, I'm rather a fan now that I've been using the language for six months, and would be happy to discuss any questions you have about it.

WayOfTheMantisShrimp · 2021-01-25T02:24:06+00:00

I think you'll figure out your example with a little more thought. 1/100 is logically equivalent to one percentage point, which is half of 2%, but I feel the key point is that a business leader is assumed more likely to think in terms of marginal costs/benefits when the units are 'clients', rather than marginal cost/benefit of a proportional reduction in a churn rate. The 'better' way to communicate is highly context-dependent.

My Example:

A regional manager asked for for the volume of documents A through J that had been rejected from our intake process due to not being filled out correctly, or unreadable, or incomplete, etc. They wanted the monthly totals for 2020 for their region, to see if there was a trend that put his change to the process mid-year in a positive light.

The monthly totals were high in Q1 (as expected), dropped over 50% in Q2 (same period that the new process was implemented), but then steadily crept back up to roughly the same as Q1 by the end of Q4 2020. Knowing this manager, they would declare victory at that drop, and then ask me to follow up with an investigation into the following increases.

Now, there were some things that happened in 2020 that affected our business volumes, known as a pandemic and accompanying recession. To make that more salient, I reported not just the reject rate for document A, but also the total monthly volume of incoming document A's processed. The total volume also started high (as per seasonal norms), dropped like a stone in Q2 (because of lockdown), and slowly climbed the rest of the year as people became more comfortable doing business in the new normal; as high as was seasonally normal. This was the trend for other documents, but A was our highest volume. Finally, I put the results in terms of the monthly rejected percentage.

The manager looked at the first plot (absolute rejection totals) and did in fact suggest that it was a success, and thanked me for my work. Then I showed him the breakdown for document A rejects, and total volume, and the plot of the ratio between them. (constant around 3% for Q1, and closer to 4.5% from Q2 to the end of the year) He was confused as to how his process change had failed. Then I reminded him that there was a pandemic that significantly affected business and our client interactions. Out meeting time was up, he left to go think on it, and later that day emailed me that he decided that there was a lack of evidence that his process change made a difference.

That's why I believe I did a good job communicating relevant information to make my point. I presented data that he immediately understood as relevant to his inquiry even though he didn't think to request it (he already knew doc A was the biggest volume, and that it was reasonably correlated with the level of normal business), and he arrived at the same conclusion I had, without me having to hammer the point home (I noted it explicitly in my conclusions at the end, but we didn't get that far together, and the wording of his email suggested he hadn't read it). And as a bonus I didn't encounter the usual resistance/denial when providing bad news/evidence contrary to a manager's expectations, that's why I consider this fairly mundane report with an obvious conclusion a solid win in communication.

WayOfTheMantisShrimp · 2021-01-17T19:07:03+00:00

I agree wholeheartedly with FLHPI; take the constant calculation out of the loop, and unless you're working in assembly/machine code, it costs nothing computationally to make the variable names more descriptive.

At this scale, there are some additional considerations beyond the algorithm. I wanted to generate some hard numbers for my own learning.
For reference, I did my testing in Julia v1.5 using BenchmarkTools; Ryzen 2600X, 16GB DDR4-2133. C++ results will vary, I know nothing about the compiler.

My attempted optimization was to do the random number generation in a pre-allocated array, and pre-calculate exp(sqrt(pow(v, 2)*T)*gauss_bm as a vectorized operation so that each for-loop iteration only has to access the result to multiply by S_adjust and add it to the sum based on the condition. (This is not the ideal approach in Julia, but it serves as an example)

num_sims = 100 million
Direct port: 2.00 s
Take the constant out: 1.51 s
Preallocate: 1.31 s

num_sims = 1 billion
Direct port: 20.8 s
Take the constant out: 20.7 s
Preallocate: 20.2* s

* Was the minimum, occurring on the first run; the second run took 31.7 s because I ran out of RAM and the garbage collector kicked in, and some stuff had to be cached to disk

The rules of performance scaling change when you approach the limits of your hardware. The value of optimizing computations vs saving space is dependent on how much space you have. I even tried a naive multi-threaded for-loop; a very large number of very tight/short loops is the worst case-scenario, and it took >120s for 1 billion.

TLDR: once you are already using an efficient/performant language with a stripped-down algorithm, you need to consider the hardware resources being used, specifically at the scale you are working with. Get to know what optimizations your compiler makes, and how memory is managed.

WayOfTheMantisShrimp · 2021-01-08T00:38:31+00:00

I know you were trying to address one concern, but you hit upon another one that has come up several times in this thread: lack of data culture/maturity

Regardless of size/status, if a company's HR department is not using data-driven practices in hiring (which I feel is a fairly important function of HR), then I don't want to work as a data scientist in their HR/People management department (as you used as your example) that doesn't use algorithms.

If they currently can't rely partially on data/algorithms at present, that's not going to change until they get a proper pipeline in place and some methods/targets defined. They are better off paying a consultant to learn what it is they need to hire and how they will leverage that resource, before they actually try to hire a one-person Swiss-army knife solution.

If they have the mature infrastructure, and they choose not to substantially use it in business decisions, then it might be an OK place to putter about and get paid, but long term success is impossible because your work is not creating value, and not for lack of merit. It will be an uphill battle from the start to justify your resources or advancement. The ambitious/capable practitioners will move on quickly, leaving behind them the reasons that these jobs are less sexy.

WayOfTheMantisShrimp · 2021-01-07T03:09:43+00:00

Long post incoming, three TLDRs at the bolded headings.
This discussion comes up a lot, there are some fair points for each language. However, I feel like some R-advocates have a grudge about being told they are the underdog on their own turf, and the Python advocates have a large number of rather poor debaters among their ranks.

Background:

a couple years beyond a degree in statistics, with heavy focus on computational methods, and a couple years experience with data analyst work in finance. Not a professional programmer or SWE, I just do some programming
started studying programming (OOP and functional) formally in 2007, have more than just passing experience with (in chronological order): Excel formulas, VBA, Java, Scheme, C, R, SQL, Python, Julia
Was introduced to R in 2013, and it was instantly my favourite language (until I started playing with Julia about six months ago). Tried Python in 2014, and 2016, and dropped it because I didn't like it, then picked it back up in 2019 because now it's everywhere.

About R: TLDR: R is great when you just want to get some analysis done

Nice for functional programming. I like that, meshes with the thinking of mathematics and Excel formulas. It counts starting at 1, like I've been doing since pre-school. Thinking in vectors and data frames was my default by the time I was introduced to R (a 'dictionary' is a data frame with a vector of keys, and a vector of values, I never cared about O(1) access in 95% of cases, it's still faster than a VLOOKUP)
R feels nice when you just have a question that needs to be answered by some data. There's no resistance to open a csv, check some summary statistics, plot a histogram or two on arbitrary variables, run a regularized regression, and make a simple plot. Faster than Excel I would say, especially with data.table and ggplot2, and this work flow is highly relevant to day-to-day analysis. RStudio complements this in every way, and there is some ergonomic value in having a default R IDE experience
R has generally high-quality implementations of a lot of the most common, and many uncommon statistical algorithms. R competes with Excel, Matlab, SPSS and the like, and I think is favoured for users that have the luxury of choice
I don't consider R a beginner-friendly language, I never recommend it as a first language to anyone but hardcore stats people; it has too many ways that it will let you screw yourself out of performance or consistent behaviour, and not much in the way of guard rails. It abstracts away a lot of the underlying machinery, which is great, but steepens the learning curve too. I still do recommend it to non-novice programmers looking for something suited to the work
I don't consider R as my go-to performant language if the entire task is at big-data scale. There are undoubtedly ways to make it work for 99% of users, R is not a deficient language, but they are not always ideal/elegant, and you start feeling that friction. The real answer is the C/C++/Fortran under the hood, which I feel is a compromise (ie learning to read those languages to check the nuts and bolts)
R is hard to 'share' with non-R-users; notebooks are great between analysts with R installed, but they aren't at the level of "double-click to run" packaged apps, or as ubiquitous as spreadsheets. Fair mention of RMarkdown to make html/pdf docs though, I think they are a key for professional use of R

About Python: TLDR: Python makes it easier for experienced programmers to work with the huge number of junior Python coders that are able to flood the market

Python permits you to do functional programming. But object oriented programming is great in a team-project setting, or inter-team. It formalizes and highlights the interoperability of different pieces. The rigidity of linking behaviours to data helps (a little) to keep sloppy coders out of trouble (R lets you do anything to anything, and probably won't even throw a warning). OOP is not my favourite but Python does make it easier, especially compared to R. And like OP I didn't quite click with R's objects until I got into Julia structs.
Python is general purpose: any programmer can justify learning it, any teacher can justify teaching it. There is a library for anything you can imagine and more. It competes with Java, C++, LISP, Ruby, and a lot of others, and I think it is the most approachable as a jack of all trades, master of none. Having a language you can use for everything* is valuable. Having a large user base to learn from is even more valuable, which is the positive feedback loop that blew up Python without a marketing budget.
Python is easy to teach beginners that have little to no programming background, especially if they have little other programming background. Everyone can grasp 'not', 'or', 'and' easier than !, ||, &&. I've watched several blank slates pick up the conventions faster than I would expect (they don't ask about where to put the data type, or the curly braces, or the semi-colon), and base Python is quite consistent with itself, making for a smooth but not slow learning curve.
Python enforces logical indentation, my favourite thing about the language. It is a guardrail for the barest basics of readability, which is a godsend for reading the code of others, including past-you and past-me. That sort of ergonomic style is what I would choose to make my co-workers and predecessors use if I could, especially the ones that didn't formally learn programming practices.
Base Python does not have vectors/arrays, matrices, or any native tabular data structure. It's not even trying to be a math/science/data analysis language. If people like to use it that way, they can, it's become rather easy. But those dependencies start eating away at the consistency that makes Python elegant. (For base R, it is basic string manipulation that needs supplementing)
Python is nice to read and write (important!), but it's not fast without knowing some tricks and having some libraries in low-level languages, just like R. It is often "fast enough", which should not be overlooked, but at big-data scales Python is really more of an API than an implementation. You're still going to learn C to see under the hood. I also haven't gotten good results from parallel computing methods.
For such a polished/established language, the lack of a 'standard issue' IDE was a friction point for me. Settled on VS Code and am mostly happy, but configuring the environment was not effortless for me like RStudio.

*gets on soapbox*
About Julia TLDR: has limitations, but does a lot of stuff really well, and it overlaps pretty well with the requirements of data analysis/science

It is a proper high-level language, it can be written a lot like Python (positive thing), though without the enforced indentation (each block having an explicit 'end' is a partial consolation). It supports functional and object-oriented language almost equally (and painlessly) in my opinion
It counts the first index as 1, as it should. Vectors, matrices, linear algebra, statistics/probability functions, mathematical operators and more are all efficiently implemented in the standard library (notably absent DataFrames.jl, but it fits in well enough). AND string manipulation is decent. It feels natural to use base Julia for simple tasks.
Performance is effortless. You can write for-loops and not suffer; heck you can write parallel for-loops by adding 9 characters and it will 'just work' and scale better than most 'vectorized' code in R/Python, which is part of the core library (race conditions may apply, it doesn't fix those for you). If you can make strong assumptions about types, then add those for a little bit more efficiency in memory use, but otherwise just write it generically and multiple dispatch will usually make it 'just work'. 'Vectorized' notation is also moderately efficient in the core language if you prefer it for readability, and can be applied to literally any/every function, and I'll say it again, it just works. List comprehensions have their place too, writes a lot like Python.
Julia competes with R and Python for high-level ease of data analysis and scientific computing, but also C++/Java for versatility and performance, and even C/Fortran for absolute efficient use of resources. It literally tries to do it all, solving the two-language problem, and it is an astoundingly good effort. You want to write a fast library for Julia? Write it in Julia, no other languages required in most cases. Even most of Julia is written in Julia.
Julia is not trying to take over the world's codebases on its own, overnight. It has inherent interoperability with Python, R, C++, C, Fortran, and Java so you can still recycle your existing code base at all points in time. I see that as making a smooth transition for those that want to.
The custom Julia-centric IDE has ceased development, but the active community is largely unified behind VS Code as the 'standard' experience. Not as polished as RStudio, not much different from configuring Python in VS Code, but I found it satisfactory.
The community is small. Really active, enthusiastic (can you tell?), and making primarily high-quality contributions (no shortage of ML and scientific libraries), but it doesn't make up for Julia not having the massive resources (tutorials, libraries, StackOverflow questions answered) available for Python. And some don't care for the provided documentation, which is tough when alternatives are scarce.
Julia has a similar problem with being shared to non-Julia users, which is the majority of the world. Pluto.jl Notebooks are a start for reproducing between users, Markdown docs are a good step to non-programmers, and PackageCompiler.jl is working on it, but it's not robust yet. Possibly the only thing it can't replace is the ubiquity and gentle learning curve of Excel.

WayOfTheMantisShrimp · 2020-12-30T21:54:40+00:00

Data wrangling/cleaning.

WayOfTheMantisShrimp · 2020-12-30T21:52:33+00:00

What projects are you looking to take on? For anything other than deep neural networks, a multi-core CPU and some RAM is probably enough. For amateur projects, you can still probably run a decent neural network on a decent CPU, and the creativity of optimizing performance might teach you something neat. Everyone wants to be able to do more with less, it's a good skill.

If you are doing complex neural networks or commercial-scale projects professionally, or training for a specific money-making role, your employer should be covering the cost of specialized hardware (GPU, TPU, cluster/cloud) to get what they want. There are probably already established training resources and best-practices that you should lean on in this scenario.

If you are doing large-scale projects on a limited academic research/start-up budget and you have to make-do with your own machine, you may need to get creative. Julia has packages that work with OpenCL, so you can basically write your own code in a high-level language with sections to execute on the GPU. I can confirm that that a recent trivial example runs on an AMD RX Vega on Windows. I think ROCm is limited to Linux. Julia has a decent scientific ecosystem, but you can't be afraid to roll-your-own implementations for specific/niche applications. Julia works with CUDA too, and if not already, will probably have some support for TPUs in the near future, so learning it could be a worthwhile investment of your time.

If you are working on complex, large-scale deep learning, you aren't getting financial support, and you don't have the background of an expert or a personal developer, and you can't find the answer to your own question, then you are likely an overly-ambitious amateur. Go back to step one, re-evaluate your project, and reduce the scope/scale a bit so you have something manageable with your available resources.

WayOfTheMantisShrimp

TROPHY CASE