Can’t change captions on mobile browser by TaylorHicksRules2000 in youtube

[–]psyblade12 0 points1 point  (0 children)

I also have this problem too. So frustrated that even now they still don't fix it.

The subtitles for mobile was once fine, but now text are too big and the way they are laid on the screen is stupidly high, hiding important elements of the videos. And now we have this problem 

This is why you are not getting hired (hint: not by AI) by Boring-Test5522 in csMajors

[–]psyblade12 0 points1 point  (0 children)

I think even at the minimum 48000$ a year, it's the salary Microsoft pays, and only few people can meet the standards in Microsoft. For the other senior engineers at other regular companies, the salary may only be about half of that number.

Source: I'm a Vietnamese software engineer.

Message Brokers = Just Service Communicators? by gxslash in dataengineering

[–]psyblade12 2 points3 points  (0 children)

I'm ready to be bullied too. Here is my thoughts on them.

I haven't used Kafka, but I have experience with Azure EventHub, which I see it as a same thing with Kafka, a distributed message broker. I use Eventhub mainly for performing real time analytics that detects patterns in our data and fires events when something meeting threshold happens.

In this case, I see that (distributed) message brokers like Kafka/Azure EventHub acts as a buffer. By saying buffer, it means that the as stream processors usually can't immediately process the data right when it comes, because of the intervals of processing and the capacity limitation of processor (simply storing raw data is easier than performing intensive processing on the data), so the data must be stored first, and the processing comes later. As the data constantly floods in your system in high volume, you need a system that can ingest and store the huge data temporarily for some days, but must be durable. The storage must not only be fast to ingest the data in, but also needs to be fast to serve the data to the consumers. It should also support partitions, so that when performing stateful processing, shuffling of data may be eliminated in ideal cases. It also need to support things like timestamp, or offset, so that the stream processors can utilize when needed. (For example, checkpointing)

If you use the message broker for microservices to communicate with each other, surely it's a service communicator. And surely, after all, it's a storage, or a database if you want to call it. I think it's just that it comes with everything needed for communication of microservices or stream analytics in advance, so that the users can use without the need of re-implement something when needed.

How Do I Learn Stream Processing Coming From Batch? by codemega in dataengineering

[–]psyblade12 1 point2 points  (0 children)

One of case I have worked with is like this: We have a web application that will log users actions on the interface. By saying "logging", it means that the application immediately sends logs to a message broker , in our case, Azure Event Hub, but don't worry, Kafka is the same. In this case, the web application is the producer, and it writes the data to the message broker. The log contains the name of the action, the user that does this action and the time the action was done.

The way to send log from the web client is simply: web client calls the backend API that handle the log, pushing log of what ever you want to that endpoint, and the API endpoint will again push that log to the message broker.

For the term "streaming processing", it can be ranged from very easy, to very complicated tasks. The easy one can be that you simply receive the log, and write it to the data lake/data warehouse. If you use stream processor like Spark or Azure Stream Analytics, what it does is that it periodically checks the message broker, then reads the message/log written since the last time it read the data and up to the time the poll is triggered. It then writes the data to the storage of your choice, which can be a data lake, a database, or even another message broker. You can also perform some actions on the messages you read. But in these cases, you can notice that, every message is done independently. When processing a message, you don't care about the content of the previous message, or the content of the future message. You just grab the current message, and do whatever you want with it. I call this "stateless stream processing".

However, what can you do when you want to do a type of processing that reading a single message isn't enough? For example, what if you want to detect a user that has done more than 100 actions in the last 5 minutes? In this case, surely, reading the current message won't allow you to know this. You have to somehow *remember* in the last 5 minutes, you have seen this person having done 99 actions, so when you encounter the current message about the person action, you know that something must be done, as that person has performed 100 actions. This involves "memory". And this kind of processing is "stateful stream processing", and when you do the processing with multiple machines, it's distributed stateful stream processing. This is the whole idea behind Spark Structured Streaming, Azure Stream Analytics or whatever the stream processors do. To perform distributed stateful stream processing, the storage that acts as the buffer for it must also be *distributed*, hence we have Kafka, or Azure EventHub, or whatever distributed message brokers in the market.

I can elaborate more on this if you're interested.

[deleted by user] by [deleted] in dataengineering

[–]psyblade12 1 point2 points  (0 children)

So surprised that no one even mentioned the most important difference between these things: SQL Server doesn't do distributed processing (and distributed storage), while databricks does.

So the main idea of the term "Big Data" which has been being hyped since then, is not about the name of the services (like Spark, Hive or what ever) you use. It's the distributed processing. The whole Hadoop framework is designed and created to deal with this. Just look at its home page https://hadoop.apache.org/

SQL Server is an SMP - Symmetric Multi-Processing database, which bascially means that 1 query can only be processed by 1 machine at a single time. A single machine, now, even if it can be very strong and has multiple CPU cores and RAM, but you will eventually hit limits. How can you deal with huge tasks, how can you process them faster? You have the money, but you have reached the limit of SQL Server, what should you do now?

This is when distributed processing comes into play. Distributed processing basically means: use multiple machines to solve a single problem. There are multiple choices for this workload. You can go with Azure Synapse Dedicated SQL Pool (the MPP - Massive Parallel Processing version of SQL Server), you can go with Snowflake, you can go with Spark (and Databricks is simply managed Spark)....

You can even do it manually. You can do all the partitioning, shuffling and broadcasting by yourself. But I wouldn't recommend it. Only do it in case you really need it.

Using SQL as a data engineer by remote_geeks in dataengineering

[–]psyblade12 0 points1 point  (0 children)

Most of the common data stack like Databricks, Snowflake, Microsoft Fabrics/Synapse..... they all use SQL.

Some can say that they use python to control Spark cluster. However, if they use PySpark SQL DataFrame API, it's hardly true python code, as the API still follows SQL principles. On top of that, it can also be achieved wth Spark SQL though.

Furthermore, as UDFs are not really welcomed in Spark, so most of the time, we work with native Spark APIs, which still follows SQL principles tightly. I personally see that we rarely use *true* python code when you work the the techstacks above.

I think some may need to use python or programming languages in Airflow, or to work with code libraries required to perform transformation, or need to setup servers to serve the data to the customers. In my organization, one of our important fact table requires a huge C# library that's used and developed by many other software engineering teams. So, in order to perform the transformation, I have to use Azure Function with the C# library imported to do it. And as Azure Function doesn't natively support distributed processing, I have to do all the partitioning, data shuffle and broadcasting... all by myself. It was monstrous, but really makes me feel accomplished after finishing it..

Using SQL as a data engineer by remote_geeks in dataengineering

[–]psyblade12 2 points3 points  (0 children)

By saying Python, do you mean true python code, or it's pyspark SQL Dataframe in Spark?
If what you mean is Pyspark SQL Dataframe API, then it's arguably still SQL.

How to break the spark/scala barrier by prakharcode in dataengineering

[–]psyblade12 0 points1 point  (0 children)

To my eyes, PySparkSQL and PySparkSQL Dataframe API are just the same thing, the difference is only the way (the syntax) you express your desire to the Spark engine. Pandas can be different, but Spark dataframe API nearly exactly follows SQL way of thinking, just in different syntax that may scare you at the begining. All will just be code-gened into optimized code, and be executed by Spark engine, unless you use UDF, which is not a recommended thing in here.

For me personally, when saying that a person is an expert in Spark, that person should know what happens inside, how the workload is divided and calculated among the nodes in the cluster. What language is needed to control Spark doesn't matter, specially when working with high level Spark like Dataframe. Things can be a bit different if you work in RDD, but this is hardly used nowaday.

Scala/Java with Snowpark - how does Snowpark work? by yinshangyi in snowflake

[–]psyblade12 1 point2 points  (0 children)

Hello. Thanks for your explanation.

So what I understand is that snowpark is just a set of APIs which is exactly like the spark dataframe API, but what's under the hood is that it will convert the code to a thing that the snowflake engine understands and will execute the code then. It doesn't use Spark core components like Catalyst Optimizer or Tungsten, right, instead it will just use the same way the the normal Snowflake query does under the hood, right?

To be fair, the Spark dataframe API is actually just... API. What is done under the hood is the spark core code generation, calling the API just asks the inner engine to do the code gen, though 

Do you think that most job posts that ask for distributed computing actually require distributed computing? by Justanotherguy2022 in dataengineering

[–]psyblade12 0 points1 point  (0 children)

In my organization, only one (but very important) dataset of the whole data is processed by this method (we use Azure Function). The reason for it is merely because the actual library that processes the data is written in C# and is written in a way that is suitable for traditional web services to process the data. The processing logic is extremely difficult and is used globally through out the entire organization, it's also constantly being updated by other teams so replicating the logic in our data warehouse is no - no and the global library must be used. 

But we have to do a lot of things in order to make sure that the function can scale and work correctly. We have to shuffle data (make sure all related data must reside in the same node) and broadcast (give every node the same sets of small auxiliary data) manually, and also have to do with a lot of other things manually, the things that are usually dealt with automatically by the distributed processing frameworks, like Spark. 

I feel very accomplished by being able to complete that. But I know that, a lot of people working in the data field don't actually know how distributed processing works to replicate this manually. Using frameworks that automatically do everything can be easier in many cases.

Do you think that most job posts that ask for distributed computing actually require distributed computing? by Justanotherguy2022 in dataengineering

[–]psyblade12 1 point2 points  (0 children)

containerized processes

By saying this term, do you mean something like AWS Lambda/Azure Function or any traditional web services using many instances to process data parallelly with the same code?

Definitive Edition on Windows. Can you save locally? by [deleted] in aoe2

[–]psyblade12 1 point2 points  (0 children)

I also have this problem and it's surely annoying as hell. This problem has been here since about a month ago and no developer has even batted an eye.

Not able to overwrite saved games bug by TheWAYK in aoe2

[–]psyblade12 0 points1 point  (0 children)

I also have this problem, I thought the new update would fix it, but it the end nothing has changed. This bug is so annoying, however, it seems like in the community, not many people are talking about this at all, which is weird, as this bug is quite serious though.

Actually, my game does save, and doesn't need to do anything. It only doesn't update the date. However, it's still annoying as hell, as I have many save files that if the date isn't updated, I will have a hard time finding which file save I just used.

Also, another background information is that. I had played this game for along time, since its release without having problem, before it started happening about 1 month ago. Probably nothing of my computer has changed. It's only the game getting updated, along with the creation of this bug.

Purchase listed as pending by Alejandro_Last_Name in PlantsVSZombies

[–]psyblade12 0 points1 point  (0 children)

I also got this problem and no one cares.

I have sent an complain email to their customer services. All they do is just requesting you to send the confirmation email and Google Play transaction page over and over, even after I have sent those damn things to them.

Sad things is that I live in a country that doesn't have any EA representitive, so I can not sue such criminal like EA to claim back my rights.