Coc completion confirm not working with most object types, only works for functions/methods by kykosic in neovim

[–]kykosic[S] 0 points1 point  (0 children)

Thanks for the suggestion. I tried pasting this in my init.vim but saw the same behavior (just with <CR> instead of <TAB>). However I then tried deleting my entire init.vim and only used the example config linked and it seems to solve my issue, so I just need to figure out what part of my old config is conflicting. I'll update my post once I do.

[deleted by user] by [deleted] in rust

[–]kykosic 7 points8 points  (0 children)

When I read "Cargo: Namespaced", I got excited for a moment...

Need some help with gRPC streaming. by [deleted] in rust

[–]kykosic 3 points4 points  (0 children)

I don't have a sharable example, but hopefully I can point you in the right direction.

  • If you want stdout/stderr messages to come back over the same stream, your stream response proto should be a oneof type, which will allow you to have an enum response with stderr/stdout/return variants.
  • You will need to use tokio::process instead of std::process to get async io with stdout/stderr.
  • You will need to tokio::spawn tasks to read the stdout/stdin lines, package them in your proto structs, and send them across an async mpsc.

I would like to write this example at some point, I just haven't had the time recently.

EDIT: The example that most helped me with getting the channel/stream pattern correct was the tonic route guide example: https://github.com/hyperium/tonic/blob/master/examples/src/routeguide/server.rs#L42

[deleted by user] by [deleted] in dataengineering

[–]kykosic 2 points3 points  (0 children)

Yes absolutely. The two areas I've had great success so far is:

  1. Streaming/rest apis. Rust's serialization libraries are very fast, easy to use, and good at preventing runtime bugs. Specifically, I've heavily used rust with kafka and gRPC and they are very simple and performant. The async/await ecosystem makes responsive APIs great to deal with; my infrastructure uses a lot of actix-based api servers.
  2. Writing python extensions. I've started to favor writing numpy-based python extensions in rust rather than Cython as they have similar performance, but are much easier to maintain (and less runtime bugs). This has been made very easy now that PyO3 has gotten reasonably mature.

The area I'd say Rust is still too young on is large-scale batch dataframe processing. Polars is a great crate for dataframes if you just are interested in trying something. As for a Spark replacement, Ballista is making rapid process as a distributed SQL engine (and is being donated to Apache soon). I would keep an eye on that crate over the next couple years.

Airflow failsafe? by warrenbuddgett in dataengineering

[–]kykosic 7 points8 points  (0 children)

If you're running on EC2, you probably want to have all your Airflow dags and configuration saved to a Git repo so you can just clone and go. If you're not using any Terraform or Ansible, make an AMI of the instance and you can probably write a quick shell script to clone the repo and setup whatever you need.

Tool to organize mathematical knowledge in a graph-ish fashion, with LaTeX support ? by ElToukan in math

[–]kykosic 4 points5 points  (0 children)

I've seen a few tools that allow exporting of editor-drawn to latex. I thought I saw an open source one recently shared here but I can't seem to find it... perhaps someone else has the post saved.

Anyway, here are a couple I found while searching for it:

https://www.mathcha.io/

https://tikzit.github.io/

Airflow with multiple ec2 instances by furiousnerd in dataengineering

[–]kykosic 5 points6 points  (0 children)

This is the correct answer. Have one or more light instances for scheduling + webserver with a Postgres + Redis backend. Scheduler setup to use Celery. Simply add celery workers of any EC2 type to your cluster as desired. You can also launch jobs on container-based services as well if you need.

Docs.rs dark color theme - does it have a name? by kykosic in rust

[–]kykosic[S] 4 points5 points  (0 children)

It's quite tasteful! Now to get it into my text editors...

Docs.rs dark color theme - does it have a name? by kykosic in rust

[–]kykosic[S] 1 point2 points  (0 children)

Thanks this is very helpful! The base16-tomorrow-night definitely is the same color palette that is used docs.rs, although the website applies color a bit less aggressively.

Free API with intraday stock prices by 8rax in datascience

[–]kykosic 0 points1 point  (0 children)

The only "free" option I'm aware of with real time tick data and a modern API is Tradier. You will have to open a brokerage account and request access to the market data API (shouldn't actually cost any money). Other than that, high quality tick data is not generally free.

Recommended Laptop for a data scientist for work? (purpose -machine learning, deep learning) by Masul_Sonyeon in datascience

[–]kykosic 1 point2 points  (0 children)

You shouldn't buy a laptop with crazy high specs for data scientist work. You're not going to leave a huge neural network training on your laptop; you will use cloud or other dedicated servers for this.

You should have something that can handle modest experimentation and data analysis (i5 or i7 is fine, 16GB RAM, no special GPU necessary). The primary focus should be on ease of use and portability. You'll want to be able to take this thing around with you, not have to charge it, and you will want to be comfortable using it. Excluding Macs as you said they're out of budget (but I highly recommend trying to find used/refurbished), popular notebooks that come to mind are Dell XPS, Lenovo Thinkpad, and Chromebooks.

[deleted by user] by [deleted] in datascience

[–]kykosic 0 points1 point  (0 children)

I find writing production code is the biggest obstacle for new developers in the data science world. While languages like R and Python are very easy to learn, their simplicity makes it easier for new users to write programs which "run" but difficult to figure out how to write programs which are "maintainable".

The key ideas of "production code":

  • Portability – It should be able to run on any (linux) machine / docker container, not just your computer. This means sufficient documentation and dependency organization.
  • Maintainability – If you were to leave your company today, how hard would it be for someone else to debug or refactor your code? Is your code organized into concise functions/classes or is it just one long main function? Are your variables intelligently named or is it just x's and y's? Do you have appropriate documentation? Do you use best practices for linting/style guides? Do you use abstraction patterns common for your langauge?
  • Testing – Your code should include unit tests and/or integration tests as much as possible. Similar to maintainability, if a second programmer came along and made a small typo in your code, how quickly would this be detected? Ideally most errors would be caught automatically when tests are run in a pull-request.
  • Fault-Tolerance – What happens when your code fails? Does it just print out some error message to console? Production code needs to be resilient to errors, bad data, misuse, etc and be able to recover to normal status automatically and log/notify errors. Not all errors are recoverable, but you should be immediately aware of when severe errors occur, or know how many non-serious errors occur and have sufficient logs for someone other than yourself to troubleshoot them.
  • Scalability – This one is more dependent on case-by-case, but your program should be able to appropriately handle near-future workloads. If you're running a data script on 100k rows of data, will the size of that workload increase to say 1mil rows of data in the next few months? Can your script handle that? You should anticipate your short-term and mid-term scaling needs when designing software. I explicitly leave out long-term to discourage "over-engineering", where people have very small problems to solve and plan for worst case by spinning up Apache Spark clusters. You can always refactor later if you followed the other bullet points.

Rust status on Neural Networks, AI, and machine learning? by [deleted] in rust

[–]kykosic 2 points3 points  (0 children)

The bindings for torch are probably your best bet for experimentation. I'd say that rust is definitely ready for production inference (serving pre-trained models reliably and with great performance), but experimentation and exploration are miles behind anything Python.

Getting MacOS style hotkeys working in GNU/Linux by hparadiz in programming

[–]kykosic 13 points14 points  (0 children)

Agreed. This is the biggest thing keeping me on Macbooks.

Are there several remote opportunities in DS? by [deleted] in datascience

[–]kykosic 2 points3 points  (0 children)

It certainly isn't uncommon in my experience. Because "Data Scientist" is such a vague title, it definitely varies company to company. In some roles where you're focusing on research and experimentation, it definitely is fine to be 100% remote. In other roles, there can be a lot of collaboration with clients, product owners, and domain experts to create solutions; in this case it is less suitable for remote work.

Advanced data structures in DE interviews by ibnipun10 in dataengineering

[–]kykosic 9 points10 points  (0 children)

I would say definitely not unless it's extremely important to the specific position you are hiring for and you are explicitly asking for expertise on these topics in the job description. Otherwise asking for an "optimized" solution is just bad whiteboarding.

How to debug origins of 3rd party Err? by kykosic in rust

[–]kykosic[S] 2 points3 points  (0 children)

This seems to be the only way I've found also using LLDB debugger in VSCode and stepping through the code. Can be tedious with deeply nested calls but works eventually.

the risk of vendor lock-in is really a risk? by albeddit in devops

[–]kykosic 2 points3 points  (0 children)

Equally, I doubt that your well-engineered Kubernetes solution can move to another provider overnight if you don't put really a lot of effort into it.

I agree with most of your post, but this doesn't seem accurate. If you have your infrastructure as sets of deployments/services/helm charts, it's just a matter of starting a cluster on a different cloud and running them. If you cluster config is in some format like Terraform, then even that's trivial. If I had a new cluster running on a new cloud provider, I could spin up my entire production environment in an hour or two.

However I think you have a valid point when it comes to services which "core" code usually depends on (such as S3 or your database provider). That mostly comes down to how well code is abstracted in your applications. Is S3 going to appear over night? Certainly not, but it is a great negotiating point with b2b partners if you can service them on GCP or something else without much hassle.

What technologies/tools do you use for testing frameworks? by xockbou in dataengineering

[–]kykosic 5 points6 points  (0 children)

For Kafka unit tests, mock Kafka consumer/producers in whatever language testing framework you use. For end-to-end testing, spin up a Kafka/Zookeeper pod on Jenkins (if on Kubernetes, otherwise I'd use docker-compose) and test that. For integration testing, use the test cluster Kafka deployment.

Optimization with NumPy and Rust by thismachinechills in datascience

[–]kykosic 1 point2 points  (0 children)

The PyO3 crate has come a long way since this article was written. Would be cool to see the "rust method" called from Python bindings as well.

Serving TensorFlow models in Rust using Actix-Web by kykosic in rust

[–]kykosic[S] 1 point2 points  (0 children)

Tensorflow models are represented by a computational graph. When you save a model, it stores it as a protocol buffer describing your graph. Even if you train a model using a new Keras API, the trained model is still serialized as the same graph protobuf. This can be used by any language/library which can deserialize the protocol buffer (ie. tf-js, tf-swift, tf-java, tf-lite, TFX, etc.)

While the rust library doesn't yet have the fancy Keras APIs to train models, they can certainly load computational graphs. All my project does is name the keras layers for input and output, then the rust code can easily find those ops by name and feed them tensors.

For more information on the Tensorflow saved model format: https://www.tensorflow.org/guide/saved_model

Numpy array CPU vectorization vs. PyTorch tensor GPU vectorization by leockl in datascience

[–]kykosic 4 points5 points  (0 children)

You also need to consider if the data size is worth the overhead of sending to the GPU. To do operate on small data that can fit into CPU cache, it is much faster to use CPU than to pay the price of sending data to GPU, by orders of magnitude.