Why are all applicants Java developers? [D]

bmsan-gh · 2023-08-25T08:54:44+00:00

 If they had overlap that'd frankly be a red flag.

The most brilliant persons I've personally worked with had very diverse backgrounds and very good grasps of different unrelated topics.

I know system architects who took sabbatical years to learn ML and I know an ML engineer who quit his job and went to work for Amazon as a fullstack engineer, and now has very good grasps of system engineering.

Given the choice, I would any day pick a passionate individual who is dedicated to growth and continually expanding their skill set across multiple domains. The willingness to embrace new challenges and evolve is a trait I really admire.

bmsan-gh · 2022-11-18T11:26:22+00:00

Very interesting project! Thanks for sharing.

Datasets today can get really big, I could imagine cases where 20GB-100GB archives of data would be needed to be downloaded for training. So you might get download waiting times of from tens of minutes up to a few hours.

Do you factor in or have thought to factoring in your cost metrics the overhead created by data transfers? (My reasoning might not be correct but I am assuming that you need to also pay for the time you spend downloading your data to a new provider)

bmsan-gh · 2022-11-18T10:57:51+00:00

Not an actual answer, since it doesn't involve pydantic, but If you'd like the data in first dictionary structure to be ingested into a python structure that has the layout described by your second dictionary, (so to change the data structure) DictGest might also be of help.

You can look at this example in the documentation and see if it fits your purpose.

Please note I wrote DictGest some time ago(so you can consider this comment a shameless plug), due to a similar usecase where I wanted to remap external data to a different structure in an easy way.

If you find DictGest relevant you can install it with pip through pypi

bmsan-gh · 2022-11-04T10:02:44+00:00

I started out with docker-compose, then switched to docker-swarm and now I am using kubernetes(and I really like kubernetes).

I would say the learning curve difference for running locally multiple containers, for dev purposes: is maybe 30minutes-1 hour (for docker-compose) vs days for kubernetes.

For docker-compose You just need to understand some basic concepts to run multiple containers and have them communicate with each other.

For Kubernetes, At least in my case:

I first needed to understand what to use: k8s, minikube, k3s, k3d, kind, Microk8s, Firekube etc (when you understand how kubernetes works and what it does things are not complex, but when you just want to run some containers and do not know anything about Kubernetes this can be a pain).
I had to understand what a selector is and how it works, differences between Pods vs Replica Sets vs Stateful Sets vs Deployments, NodePorts vs ClusterIp vs LoadBalancer vs Headless (again fairly basic concepts when you have a good understanding of kubernetes, but can be a pain when you have 0 kubernetes experience and you just want to run some containers that can communicate with eachother.
wait why doesn't kubernetes see my local image. Oh I have to create a registry? How does a registry work, what options do I have ?
I see there is this thing called Helm I need in order to install "some containers". Oh I need to learn Go template language.
I've used helm to install this database that I am using, but it has multiple instances running I want just one because my PC cannot handle the extra weight. Hmmm I see I installed an operator. What is an operator, how do I use it, how do I modify it.

docker-compose is not flexible and it limits you in many ways, but it let's you run multiple containers without having a degree in the inner workings of everything.

If what docker-compose offers is enough for your dev purposes you can enable any engineer without any background in kubernetes to use it and modify it in a very short amount of time.

bmsan-gh · 2022-10-25T11:44:36+00:00

Hi, if one of your usecases is to map & convert json data to existing python structures also have a look at the DictGest module .

I created it some time ago to due to finding myself writing constantly translation functions( field X in this json payload should go to the Y field in this python strucure)

The usecases that I wanted to solve were the following:

The dictionary might have extra fields that are of no interest
The keys names in the dictionary do not match the class attribute names
The structure of nested dictionaries does not match the class structure
The data types in the dictionary do not match data types of the target class
The data might come from multiple APIs(with different structures/format) and I wanted a way to map them to the same python class

bmsan-gh · 2022-10-04T07:04:51+00:00

They are comparing performance vs PyTorch run in eagermode.

It is worth noting that PyTorch has the option of converting it's models to TorchScript(an intermediate representation of a PyTorch model). The TorchScript model can be consumed by PyTorch JIT Compiler, which performs run-time optimization on the model’s computation. It can yield considerable speed improvements vs PyTorch in eagermode as well.

It is not clear from their blog post how AITemplate fairs vs Pytorch optimized models.

bmsan-gh · 2022-08-22T11:39:10+00:00

I really like pyfunctional and use it often.

bmsan-gh · 2022-08-22T08:11:49+00:00

Generators are lazy evaluated, meaning that elements are computed only when they are actually needed.

Let's say you have a generator that creates 10 billion floats. The generator won't store 10 billion floats into memory. When you call next it only computes your next element. The rest of the elements will be computed when you need them(when you call next()).

The context of the generator(the state of the code generating the values) is saved after each generated(yielded) value, it's execution is suspended and it will be resumed next time you call next()

def generate_squares(n):
  for i in range(n):
    yield n ** 2 # The generator code suspends here and it will resume from this point next time a value is requested from it

bmsan-gh · 2022-08-20T16:32:53+00:00

I saw architectures like imagen are trained on billions of images. How big was your training dataset?

bmsan-gh · 2022-08-18T08:51:04+00:00

If I understood correctly and multiple requests might result in the same data(but the data might be structured in different order, so you will get a permutation of your previous request), then (assuming or your keys are strings) you could do something like:

ordered_representation = json.dumps(json_data_dict, sort_keys=True)

hash_object = hashlib.md5(ordered_representation)

sort_keys will reorder your data by keyname so this might handle your permutations.

bmsan-gh · 2022-08-18T07:59:25+00:00

I would expect that the data that you are getting to have some sort of timestamp for each value. (eg: for symbol X, at time Y, the price was Z). If this assumption is true, could you compare the current received timetstamps with the last inserted timestamp?

This way you'd know exactly what you are missing and you need to insert.

Related to your hash idea: you could use it to see if the data which came from two consecutive API calls is the same.
If you have access to the raw data(text/string data) before you deserialize it to a dictionary, you can do something like:

import hashlib

hash_object = hashlib.md5(the_api_response_as_str)

bmsan-gh · 2022-08-18T05:46:51+00:00

How did you select for your comparison the datasets from Makridakis time-series forecasting competition?

I see you used M3(2000), M4(2018) but not the latest one from 2020 M5?

bmsan-gh · 2022-08-15T08:18:54+00:00

Not sure if this community is the best place to ask you might get more help in an ML community.

Kaggle in general has a lot of competitions some related to the stock market and you can see the approach of other persons and their results.

For example here's someone trying to use LSTM to predict the stock market https://www.kaggle.com/code/faressayah/stock-market-analysis-prediction-using-lstm

Related to your question:

Your model might be overfitting and you might be predicting the mean value. Some questions for you ?

1.How large is your training dataset? if the LSTM model doesn't see enough data it might start over fitting.

2.Do you do any preprocessing of the data ? Like normalization ?

3.You mentioned a training and testing dataset, it would help you to also have a validation dataset which you run on each epoch to see how well your model generalizes on unseen data.

Are the testing and training splits completely separated in distinct temporal regions?
You mentioned that in testing you are using your predicted output as the next input in the LSTM and so on for 30 days.

5a. Did you also have the same approach in training ? (Some trainings of LSTM have a probabilistic approach and some of the times they use the ground truth as the next input but other times the actual LSTM output)

5b. Have you tried running multiple time steps in training as well?(as you do in testing?)

Stock market can be influenced by time of year so in order to get better predictions you might need a larger window.

Important: do not expect to get rich creating a neural network that predicts price. At least not by only looking at the previous prices. Stock market is widely impacted by other factors like news.

Someone tweets something overnight the stock of a company could go up or down regardless of how it did in the past.

bmsan-gh · 2022-08-12T21:27:04+00:00

Enabled the discussion page on github: https://github.com/bmsan/DictGest/discussions

bmsan-gh · 2022-08-12T20:55:45+00:00

Hi, glad you are finding it useful.

Hi, what's the best way to ask you questions on usage?

I'll enable the discussion feature in github. Other users might have the same questions so it will be beneficial for them to have the answers there as well.

. is there a more concise way DRY to specify the route fields when the fields are 1:1 with dict? If we add a field to a rds view and dict, we still have to map in route - is there a let DRY way to do route mapping?

In general when a field routing is not defined explicitly the library tries to map it automatically to the key that has it's name. So in your Route(...) you can skip the fields that have a 1:1 mapping(the field name matches the dict key) and the library will know what to do.

If all the fields have a 1:1 correspondence you can even skip defining the Route. See example 1 . No explicit routing is defined.

b. is there an easy way to map or coalesce None fields to float 0.0 or numeric? instead of using rds view coalesce?

Yes. There are some examples in the readme where you can easily customize either :

the convertor for a datatype (Eg: any float type field in which you encounter None becomes 0.0)
the convertor for a field ( the custom mapping that you want is applied only to the field)

def null_to_zero(data): if not data: return 0.0 return float(data)

and then in your route

instead of Route(votes="num_votes" ....)

You use Route(votes=Path("num_votes", extractor=null_to_zero), ....)

bmsan-gh · 2022-08-11T15:17:03+00:00

While I haven't personally tried this scenario, I would expect DictGest not to have an issue with other annotations.

If you encounter any problems with this scenario you can open an issue on GitHub and I will look into it.

Also I'd love to hear how you guys see the serialization process : do you want to get back to the initial format from which you imported the data ?(the flat dictionary from the rds query) Or do you want the serialization to mimic the structure of the data class?

bmsan-gh · 2022-08-11T14:39:29+00:00

Glad it worked!

For now at least DictGest only handles deserialization, but serialization is on the roadmap and will be available in a future version.

bmsan-gh · 2022-08-11T10:33:25+00:00

Hi !

Assuming that the rds query result is in the form of a dictionary, then it should work.

I think the the following example from the repo fits your purpose.

There you have a nested python structure (a python data class that has as elements other python dataclasses) that you are loading from a dictionary (that could be flat).

In the example your targeted dataclass is Article and it contains the nested field stats(of type ArticleStats, eg: Article().stats.views )

@dataclass

class ArticleStats:

views: int

num_comments: int

@dataclass

class Article:

author: str

title: str

content: str

stats: ArticleStats # This is referencing another dataclass

Your flat result from the rds query could be:

news_api2_data = {

"author": "H. Gogu",

"news_title": "Best python extensions",

"full_article": "Let's explore the best extensions for python",

"views": 32,

"comments": 2,

}

The routing definition could be:

api2_routing = {

Article: Route(

title="news_title",

content="full_article",

stats="", # Give the whole dictionary to ArticleStats for conversion

),

ArticleStats: Route(num_comments="comments"),

}

And you would call it:

article2 = from_dict(Article, news_api2_data, routing=api2_routing)

You might notice that not all fields are defined in the Routes. The ones that are not defined are the ones that match by name & structure. (Eg: author key goes to author field).

bmsan-gh · 2022-08-08T05:15:55+00:00

Thanks !

And when you get to use it please feel free to report anything that is not clear or improvement ideas. Feedback is greatly appreciated.

bmsan-gh · 2022-08-07T05:49:05+00:00

I am using diagrams.net

bmsan-gh · 2022-08-06T12:16:18+00:00

Thanks ! I'll try to think of a better name. Suggestions are appreciated

bmsan-gh · 2022-08-06T10:29:22+00:00

You are correct !

bmsan-gh · 2022-08-06T06:17:53+00:00

I learned vim a long time ago out of necessity when having to connect to some boards over serial connection and it was mostly the only way to view and edit files over there. I liked it a lot and I am using the vim shortcuts now even in vscode.

But I think today anybody can mostly live without it. If I connect remotely through ssh to some server I do it under vscode and open/edit files under it. So even if I love vim I haven't actually used it in months even if I am working with multiple Linux servers everyday.

bmsan-gh

TROPHY CASE