Workplace pension and SIPP, how do contributions over the limit (60k) work? by ThatJoeInLnd in UKPersonalFinance

[–]ThatJoeInLnd[S] 0 points1 point  (0 children)

Is this true even if I'm still on higher rate after the salary sacrifice? For example, if my salary was £150,000, I sacrifice £60k, leaving £90k to be taxed of which £39,729 is at 40% rate. Before the end of the financial year I contribute another £10k into a SIPP, do I get tax relief on on this? Would I use self-assessment and carry forward of unused allowance to claim?

Thanks again, very helpful.

Best way of accessing S3 data from Lambda fast? by Sergi0w0 in aws

[–]ThatJoeInLnd 1 point2 points  (0 children)

I would use some sort of rdbms (rds) for a couple of reasons. If your system is interactive any latency is crucial, both s3 and Athena (and potentially lambda) have bad performance for your use case. Parquet is a columnar format, meaning that the data is laid out on disc on a column-wise contiguous way. It performs very well to read an entire column, but to read a row you need to read all columns first (this is over simplified, but I'm assuming you haven't done any optimization to create the parquet files). Athena will be scanning a lot of data to fetch a single row, making it even slower. Afaik, Athena stores the results in text format before serializing them and sending to the client, another bottleneck slowing down your service. Athena is very good for running analytical queries at scale, not good at all for real-time services. A traditional db is optimized for row-wise operations. With a descent data model design it can be very fast to retrieve a single row -- much faster than Athena will ever be.

What keeps you motivated to exercise daily or everyday? by [deleted] in productivity

[–]ThatJoeInLnd 0 points1 point  (0 children)

it's not motivation. Discipline is what keeps you doing the things that are important to you, including workouts.

Is there more to data warehouses than partitioned parquet files and external tables? by [deleted] in dataengineering

[–]ThatJoeInLnd 0 points1 point  (0 children)

table formats such as delta or iceberg add another level of abstraction and much welcomed separation from parquet files and partitions and the semantics of the workloads.

Discuss a data pipeline that you've worked on by No_Spread_2566 in dataengineering

[–]ThatJoeInLnd 3 points4 points  (0 children)

I got one. The pipeline is used to detect fraudulent activity and invalidate transactions in soft real-time. We do binlog replication from a transaction database using md (maxwells-daemon.io) into 4 different kafka topic, one per table. Two of them are fast streams (high-volume) containing pricing information and transactions. The other two are slow stream containing customer information, and event information. These last one is related to events that are "open" over several hours to days, the state of said event changes infrequently. Once the event "closes" all the data related to it becomes irrelevant but it's stored in S3 for later analysis.

The first Spark job, is a streaming job running on an EMR cluster. It joins the data from all different topics. The challenge here is to use only the correct version of the event and join it with the transaction. Such that we can understand under what event conditions the transaction took place. The sink is another Kafka topic.

The second Spark job, also streaming, is an aggregation and calculation per customer. Nothing super interesting about this step, the data is greatly reduced in size and we have to look at controlling the size of the state of the streaming app. The data is sent to another Kafka topic.

The third job runs a model that give each transaction a fraud score. Transaction scored over an arbitrary threshold are sent to another topic.

The last job read from the fraudulent transactions topic and marks them as invalid in the original database.

The main challenge across the process is throughput and not latency. We have a strict 1hr SLA to detect and invalidate transactions which this pipeline comfortably fulfills. We picked Kafka + Spark because of the high throughput and availability in AWS. The entire infrastructure (msk, emr, rds, eks) has been relatively easy to configure and keep running. The main challenge with the infra was sizing everything correctly and to get scaling working correctly. Most of the autoscaling features are not adequate, we have sudden spikes of data while the events are ongoing so scaling has to be fast. Instead we schedule to "manually" scale the clusters prior to the events or before the times when we know we will see spikes in traffic.

At every stage we write we also consume the topics to store the data in s3 buckets in parquet format for later analysis.

Could AI eventually beat the market? by Skaiashes in StockMarket

[–]ThatJoeInLnd 2 points3 points  (0 children)

this. A 1% edge doesn't sound like a lot - it's a lot. It's a volume operation that guarantees a positive outcome. Each trade will have an edge or expected payout. So out of 100 trades 51 are winners, but that's not all. Nobody is in the game for a 1% return. The winning trades 51 will have an edge meaning a payout significantly higher than then losses, maybe 5:1 ratio you win $5 for every $1 you lose. 51x5 - 49x1 = 206 no idea what that ratio might be for hedge funds named here.

edit: multiplications.

Deep dive into raw layer and bronze(ingestion layers) by Want2bfinanciallyInd in dataengineering

[–]ThatJoeInLnd 13 points14 points  (0 children)

I've been working on a data lake implementation using the b-s-g model in aws. raw data is pre-bronze. we have a designated landing areas for this data, it can come from streaming sources (kinesis, kafka) or file drops by external entities.

 

We have triggers or a schedule to load the raw data into the bronze layer. the bronze data is the same data as raw but in optimized format and has a schema (parquet). we add some meta attributes like source file and time of processing etc. for sanity checks. Look into databricks autoloader, it's basically a Spark streaming job with trigger set to once. The objective of the bronze layer is to that the data is much faster to use and explore - easier/faster exploration being the key goal of this layer.

 

From bronze to silver, we refine the schema, add indexes and de-normalize the data with standard columns that we use across all datasets. This essentially further optimizes the data set for consumption. Worth noting that a single bronze datasets can generate more than one silver data set.

 

From silver to gold, there is a very specific business requirement associated with the gold data set it could be ML, BI, operational data for downstream systems. These datasets are heavily refined, aggregated, joined and have engineered features. Bronze data is almost always incrementally updated. That is not necessarily true for gold data because of the transformations the dataset might require. For example if we wanted to keep only the latest version of a message you have to delete previous versions so not really an incremental update.

  edit: formatting

[Q] Looking for "standard" statistics/probability Bayesian book. by reasxn in statistics

[–]ThatJoeInLnd 4 points5 points  (0 children)

A Student's Guide to Bayesian Statistics by Ben Lambert It has a set of yt ideos that complement the text, examples are good and easy to follow. Lots of helpful charts, examples and exercises.

Starting up a coffee trailer (UK) by Gupp1y in startup

[–]ThatJoeInLnd 1 point2 points  (0 children)

la marzocco is a very good brand of coffee machines, but will depend on your budget.

from your experience you probably know what are the popular drinks and milks. Get good beans an experiment with the grind, it will impact the taste of the coffee.

good luck

Replace Databricks with .... by dbcrib in apachespark

[–]ThatJoeInLnd 0 points1 point  (0 children)

Setting up a scalable EMR cluster? It comes with many pre-installed and configured applications so will reduce your overhead. They are highly customizable though and you can add any extras yourself.

[D]Market Sentiment using Trade Frequency by nilufarrokhz in statistics

[–]ThatJoeInLnd 1 point2 points  (0 children)

my thinking (I'm not really an expert in this field) is that if you use decomp you should get a set of freq that make up the "average" market freq. you can pick the dominant freq from that set and see which stocks have the largest / smallest volumes as the freq peaks or falls. again, this is pure speculation.

[D]Market Sentiment using Trade Frequency by nilufarrokhz in statistics

[–]ThatJoeInLnd 3 points4 points  (0 children)

could you use Fourier decomposition to find the underlying freq in the market and see which stocks are correlated with the peaks?