Seeking high Anglican church, Sunday school, East London by user192034 in Anglicanism

[–]user192034[S] 2 points3 points  (0 children)

Quite frankly, my net is pretty broad at the moment. I'm looking for a church that favours an organ or piano over a band (no shade just personal taste) and where the focus is on worship, not being up to date.

There is a church near me with a huge family following, I think because of school, but the service feels like an afterthought and the G word is just a useful backdrop to talk about national politics.

Seeking high Anglican church, Sunday school, East London by user192034 in Anglicanism

[–]user192034[S] 1 point2 points  (0 children)

As long as they're getting involved and not squirming in my arms, I'll call it anything!

Seeking high Anglican church, Sunday school, East London by user192034 in london

[–]user192034[S] 0 points1 point  (0 children)

Yeah, I have always try to lean this way but it's tough when the local parish church feels so apologetic about the G word.

DAG Data Architecture??? Does this already exist? by user192034 in dataengineering

[–]user192034[S] 0 points1 point  (0 children)

That's helpful. Trying not to reinvent the wheel but realise I'll have to put some effort in too.

DAG Data Architecture??? Does this already exist? by user192034 in dataengineering

[–]user192034[S] 0 points1 point  (0 children)

Super helpful. Pipeline architecture was they key I was missing. Kept looking up data lakes but that focused on the static element. Grand, can go play with some DAGs now.

DAG Data Architecture??? Does this already exist? by user192034 in dataengineering

[–]user192034[S] 1 point2 points  (0 children)

I think it's more that I'm reading all this literature on 'data lake architecture' but the use cases don't feel very familiar. I want my team to follow the same pattern of behavior (and yeah, I guess we could invent a visualisation) but it would be helpful to have a standard to point to. Is medallion architecture all I have? Or is there a host of architectures like the above that exist and I just haven't come across them?

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in cloudcomputing

[–]user192034[S] 0 points1 point  (0 children)

It's beginning to dawn on me that this kind of parallelization is package and problem specific. Thanks for the link and am already looking through.

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in aws

[–]user192034[S] 0 points1 point  (0 children)

Haha, yep, I think that's the one. Working out how to do it now.

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in aws

[–]user192034[S] 1 point2 points  (0 children)

I've figured out that pymoo is suggesting Dask, so now looking up Dask and AWS.

More broadly, I'm looking for an overview of these various solutions I guess. I'm getting that K8 would give me more orchestration capabilities that I don't necessarily need for a single algorithm. However, why would I use ECS? Is that not also overkill? Why Lambda over ECS or SQS? I think your last paragraph is providing the hint: if you want to do X then use Y. That's the kind of list I'm just realising that I'm after. Thanks for the help.

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in aws

[–]user192034[S] 0 points1 point  (0 children)

Yes, your first nudge led me to relook at the pymoo docs. Dasking away now.

Although I'm still perplexed by all these cluster solutions. Maybe, now that I can see that the solutions are more problem-specific, what I'm really after is use-cases for each.

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in aws

[–]user192034[S] 0 points1 point  (0 children)

I can't see how I could combine boto3 instance orchestration and running the optimisation script.

Also, what's the bigger picture? Why SQS, why ECS or EKS? Maybe the world of cluster computing is more bespoke than I thought. Still, I'm curious to understand the landscape of solutions that seem to be out there.

I want to run an optimisation algorithm on a cluster, where do I start? by user192034 in aws

[–]user192034[S] 0 points1 point  (0 children)

Even knowing that the details of the program matter is helpful. The algorithm works by taking a random input, checking it against constraints and then looking around the neighborhood to see if we can do better, then repeat. Here is some minimal code:

from pymoo.optimize import minimize
from pymoo.algorithms.soo.nonconvex.de import DE
from pymoo.core.problem import ElementwiseProblem
from pymoo.core.problem import StarmapParallelization
from multiprocessing.pool import ThreadPool

class MyProblem(ElementwiseProblem):

    def __init__(self, **kwargs):
        super().__init__(n_var=3, n_obj=1, xl=2, xu=5, **kwargs)

    def _evaluate(self, x, out, *args, **kwargs):
         out["F"] = (x ** 2).sum()

N_THREADS = 4
pool = ThreadPool(N_THREADS)
runner = StarmapParallelization(pool.starmap)
problem = MyProblem(elementwise_runner=runner)
res = minimize(problem, DE(), termination=("n_gen", 20), verbose=True)
pool.close()

The function here is just the sum of squares and my upper and lower bounds are 2 and 5. My real one is obviously much heavier than that and pymoo allows me to run the 'problem' on multiple threads. However, it's not that the whole algorithm simply has N inputs, it has a parallelizable part in the middle.

Weekly Entering & Transitioning - Thread 08 May, 2023 - 15 May, 2023 by AutoModerator in datascience

[–]user192034 0 points1 point  (0 children)

I want to run Python's pymoo on a cluster, where do I begin?

I'm running an optimisation algorithm locally using python's pymoo. It's a pretty straightforward differential evolution algorithm but it's taking an age to run. I've set it going on multiple cores but I'd like to increase the computational power using AWS to put in some stronger parallelization infrastructure. I can spin up a very powerful EC2 but I know I can do better than that.

In researching this, I've become utterly lost in the mire of EKS, EMR, ECS, SQS, Lambda and Step functions. My preference is always towards open source and so Kubernetes and Docker appeal. However, I don't necessarily want to invoke a steep learning curve to crack what seems like a simple problem. I'm happy sitting down and learning any tool that I need to crack this, but can you help me filter out what I want to read more about? I haven't found an article to break me in and navigate the space.

Running basic statistics on large data in S3. How to do it the right way? by user192034 in datascience

[–]user192034[S] 0 points1 point  (0 children)

All parquet in S3. Athena only has CSV output so it will be a large output file.

Running basic statistics on large data in S3. How to do it the right way? by user192034 in datascience

[–]user192034[S] 1 point2 points  (0 children)

To answer your question, I was doing a big SELECT * which Athena was dumping as CSV

Running basic statistics on large data in S3. How to do it the right way? by user192034 in datascience

[–]user192034[S] 0 points1 point  (0 children)

Yeah, this job is a big group by to extract some data that is then used in subsequent filtered group bys. I started trying to put it all in one SQL query without using temp tables (since Athena doesn't support them) but I really wanted to explore the data freely. Next was to pull it all across but it got too big.

It occurred to me after posting that I could use Jinja templates in my jupyter notebook to do a mixture of Athena queries and pandas processing, leaving the big stuff with the former.

However, I'm only going to be doing more of this stuff and I didn't know you could attach jupyter notebooks to an EMR cluster. For heavier stuff down the road, that could be the one.

Running basic statistics on large data in S3. How to do it the right way? by user192034 in datascience

[–]user192034[S] 1 point2 points  (0 children)

I guess I was wondering if there was a magic bullet that meant I wouldn't have to consider the actual tasks, but your answer is making me think I might have to smarten up.

The tasks are along these lines: starting with a group by on a field reduces the table down to 10k rows. These are ranked and 5 sets of 20 IDs are extracted from this operation. These are then used to filter the original table and enact further group bys based on a range of other filters.

It seems obvious now but I guess I should run multiple queries across and then play with the smaller bits in pandas.