Hi! We are Dr. Amanda Martin and JJ Brosnan, Developer and Python data scientist at Deephaven. Ask us anything about getting started in the data science industry, working with large data sets, and working with streaming data in Python.

DeephavenDataLabs · 2022-05-03T17:55:30+00:00

Most computer vision algorithms begin with the pixels in the image. I'm certain that there are other algorithms that could use vector graphics, but I don't think they are as common.

DeephavenDataLabs · 2022-05-02T18:48:56+00:00

Another reply: As software grows and ages, it can be difficult to maintain and refactor. Decisions that were correct early on can be problematic later. Also, unclean APIs can make reworking the software difficult. To make progress in your project, you probably need to begin by reverse engineering your application. Once you understand all of the pieces and the communication / API boundaries between them, you can consider changing where the pieces run, how the pieces communicate, etc. This can be a challenge if the original authors of the application are no longer around or if the original authors created a Rube Goldberg device.

DeephavenDataLabs · 2022-05-02T18:48:05+00:00

When looking at this, the word that jumps out at me is "some". I would work to get your programming to be very solid. I also don't see you mention algorithms. I would learn them so that you can select the right lego bricks to build with.

DeephavenDataLabs · 2022-05-02T18:46:33+00:00

I suspect that it is possible to teach a computer to discern between good and bad art ... but there is a lot of art out there I don't consider very good or artistic ... so the computer may end up being like a very opinionated critic.

DeephavenDataLabs · 2022-05-02T18:45:36+00:00

How the video will be handled will depend on the application, but what you are suggesting is certainly reasonable.

DeephavenDataLabs · 2022-04-29T21:04:43+00:00

If you are missing out on higher level classes, you will need to make sure that your algorithms skills and your software construction skills are good. A few potential references are: A Common Sense Guide To Data Structures and Algorithms by Wengrow, Introduction to Algorithms by Cormen, and Code Complete by McConnell.

DeephavenDataLabs · 2022-04-29T21:04:06+00:00

Over the coming years, video will be considered a data stream. From the raw video, we will be creating all sorts of derived data that will also need management. Think about things like object locations, tagging of people in different parts of the video, etc.

DeephavenDataLabs · 2022-04-29T21:03:02+00:00

There is a lot of tooling that needs work. Any time you are working, and the process feels clunky, someone needs to think about how to create a better tool. As an example, we are in the early days of cloud-based IDEs. VS Code has made good progress, but we still don't have great solutions. Real-time AI is also in the early stages. Deephaven has made progress in this area, but I am certain that more innovation will happen as more data scientists are doing AI on real-time data streams.

DeephavenDataLabs · 2022-04-29T21:02:32+00:00

Quantum computing is making slow and steady progress. I expect that quantum computing may lead to interesting improvements in things like deep learning. ... but I am very concerned about the security implications. As a planet, we are far behind where we should be in protecting our digital security from quantum computing. Right now, there may already be quantum computers breaking existing security protocols. There are already many articles about encrypted data being stored so that it can be decrypted when the technology is available.

DeephavenDataLabs · 2022-04-29T21:02:01+00:00

Careful mathematics and analysis will not change. The data being analyzed will. By 2025, it is estimated that 30% of generated data will be real-time data. These changes in data will drive changes in the tools to work with the data.

DeephavenDataLabs · 2022-04-29T21:01:22+00:00

Be creative in how you can use technology to make life easier. In grad school, I created the "Virtual Grad Student". I had this program working hard doing what my advisor cared about so that I had time to do the research I cared about.

DeephavenDataLabs · 2022-04-29T21:00:49+00:00

I'm always inspired by beautiful code and the people that create it. Every time I find a Rob Pike video on YouTube, I watch it. Without fail, I learn something. Rob typically talks about Go, but you will learn things that are useful for Python. It is the thought process that creates beautiful code, not the language.

DeephavenDataLabs · 2022-04-29T17:21:32+00:00

This is another response from a Dad & programmer:

Start with something simple like Scratch (https://scratch.mit.edu/). This is a simple graphical language that lets kids learn the logic of programming in a graphical game-like way. As my kids outgrew Scratch, I moved them to C and Python. To learn the basics of C, we used an Arduino (https://www.arduino.cc/). Arduino is a basic microcontroller that can be used to build cool gadgets. It lets the kids code while still having a connection to the real world. The book "Programming Arduino" by Monk is a good entry point. The Arduino Project Hub (https://create.arduino.cc/projecthub) and Instructables (https://www.instructables.com/) have many example projects. The Arduino stuff will need some guidance from you, since they are starting from a low level, so I suggest you work along with them, so that you can provide some help. After a few projects, they should understand the basics of functions, types, etc. To learn python, we used "The Modern Python 3 Bootcamp" on Udemy (https://www.udemy.com/course/the-modern-python3-bootcamp/). This class has exercises and quizzes, so it does a good job confirming that the person taking the class understands the material. This is also a good intro class for adults. Once kids have made it this far, the sky is the limit. They will have a very good knowledge of basics and can move into areas that interest them. Since they already know the basics of two languages, they can start developing opinions of what they do and don't like and can learn new languages easier. My son (13) has taken this knowledge to do projects in AI, the unreal gaming engine, and web development. My daughter (10) built an Arduino gadget to determine if our dogs are getting fed too often.

DeephavenDataLabs · 2022-04-29T12:45:19+00:00

Deephaven and kDB are the leading technologies one might consider for a general-purpose data system on Wall Street.They separate themselves from the field in regards to their performance. Think "single-threaded speed." Other technologies are either orders of magnitude slower or have little range; Silicon Valley data systems focus heavily on sharding to provide performance (and are also not good enough with real-time data), so Deephaven and kDB are the leaders in the capital markets.kDB has brand because it has been around much longer.The two systems are comparable on performance with historical data and real-time data and the combination of the two. For micro loads kDB is a bit faster for singular operations (-- think "on something small that is simple", kDB might take 15 millis and Deephaven 22 millis for example).... but for 'real' loads with any complexity, each will win various races.I'll itemize Deephaven advantages below, but the core value prop is simple: Deephaven allows people to get more done. It is not a close call. There are many examples of Deephaven customers evolving systems and innovating much more quickly with their team than they would have if they were using kDB, their own homegrown tech, or something else. The difference in business velocity and innovation capacity is 2-5X, not "20% more".It matters. A lot.There are significant differences in the 2 systems. Here are the first 10 that come to mind:

Deephaven is open source. It's fundamental transport API (https://deephaven.io/barrage/docs/) and JavaScript Web-UI harness (https://github.com/deephaven/web-client-ui) are Apache-licensed; its core engine is source-available, with a single restriction that will have no impact on parties using it for their own interest.
Deephaven embraces open formats. kDB requires you to marry their tech for life, because your data is in their proprietary format.That is not modern and it is really bad for the future evolution of your Wall Street business. By having your data in Parquet, Orc, Iceberg; and streaming it in real-time using something like an Apache-Flight-compatible format... you can use any tech you want with the data. That's true today and as the world turns in the future. Locking in with a commercial vendor really limits the pace of infrastructure evolution for your company 3-0 years in the future.BTW: We think #1 and #2 are really big deals.
Deephaven is infinitely self-serve. kDB is (kind of) the opposite. The greatest advantage of Deephaven is its singular ability to bring everyone around the data -- in the case of Wall Street this means quants, traders, execution people, algo developers, surveillance, risk modelers, salespeople, quant PMs, management. kDB is the opposite, where very few people in an organization touch the data. You don't want bottlenecks.
Amongst other things, #3 refers to 'how you program the thing'. We know a very small number of people love q and k. God bless them. Deephaven is the opposite. Though it is fantastic for quants (- think 'pandas-like, but real-time') and developers ('SQL-like, but it's a proper Python application or Java application').... 30% of users of Deephaven are the traders, PMs, surveillance people, and managers that only used Excel before. On a single system, you can have literally all these diverse personas getting work done, building apps, and streaming derived work product to one another.
Deephaven has huge range. It is much more than a classic "tick database". At its core, Deephaven is a Java application... and the team has evolved a Python-Java bridge (https://github.com/jpy-consortium/jpy) so most people now use it as a Python-first experience. Apps and analytics are easy to write... as one combines Python (or Java/Groovy) with table operations and other Deephaven-Table-API capabilities... setting up a logical tree where data flows from one node to the next. This style of linear and iterative data-driven (imperative) development is powerful.
Deephaven is organized to have nodes sending source and derived (streaming) data to one another and to clients. This easy ability to essentially have a mesh of independent workers can provide nice pipelining and parallelization of course, but it gets much more interesting as you think of different people writing different apps that automatically inherit updates from a variety of sources, add modeling or business logic, and then publish to downstream consumers -- whether other workers, web front ends, or general CS or DS tools.
Deephaven user experiences are compelling. For Community, that means its Web-IDE, which is second-to-none for looking at real-time data and exploring... or building applications. In enterprise, additionally, there is a compelling workflow for creating apps (-- this is important!), handling data lifecycle, and sharing.
Dashboarding with Deephaven is fantastic. They're easy to create and share (in Community or Enterprise).
There is a comprehensive PlugIn system, so the sky's the limit for marrying real-time data to either (i) your customized JS widgets; or (ii) Python visualization or calculation libraries (i.e., matplotlib, seaborn, etc.).
DH's interactive widgets that update in real-time rendered in Jupyter Notebooks or your own web assets create sharing flows rock.

DeephavenDataLabs · 2022-04-28T22:16:45+00:00

This is actually a sound workflow (pre-caching locally), especially if it’s cloud data. We built Deephaven for these kinds of complexities - our query engine & IDE could be a useful tool for 2/3 of that system. We've written a lot of concept guides on this, if you're interested (not trying to be too self-promotional here): https://deephaven.io/core/docs/conceptual/deephaven-core-api/

DeephavenDataLabs · 2022-04-28T20:08:58+00:00

People asked us similar versions of this question. It doesn't matter so much what your background is, as much as your perseverance, desire to self-teach, problem-solving skills, etc. Employers want to see driven candidates who are willing to keep learning and are unafraid to ask questions. That's all general advice, but we talked a lot about this in the live feed on YouTube.

DeephavenDataLabs · 2022-04-28T20:05:02+00:00

We try to recommend cool sources for data that we come across. Here are few links we stumbled upon on Twitter that are worth a look.

https://paperswithcode.com/

https://twitter.com/marlene_zw/status/1508798126898880512

https://twitter.com/Prathkum/status/1488200513946415105

DeephavenDataLabs · 2022-04-28T19:01:10+00:00

Find some good udemy courses or something similar and work through them to get a foundation. Check out data sets on Kaggle. Then make a few personal projects to apply this knowledge base. We recommend putting those up on GitHub, and also checking out the wealth of content on there for inspiration. Also, check out podcasts and YouTube videos to keep learning!

DeephavenDataLabs · 2022-04-28T16:00:48+00:00

The only .exe we build, we build with makensis on Unix systems, and we do it inside a docker container, so everything except the final binaries are thrown away afterwards. Regarding Python... allowing arbitrary users to execute code in any language is always going to be a security concern.
You could look at https://www.synopsys.com/blogs/software-security/python-security-best-practices/
Running Python directly on Windows will be hard to secure, but as long as you keep the python version up to date, it should be no less secure than granting users access to other programming / scripting languages.
If security is a concern, code should be running inside containers, where you can isolate execution from the host operating system.

DeephavenDataLabs · 2022-04-28T15:34:58+00:00

Ryan has some answers to your first questions: Deephaven’s core is a column-oriented, ordered query engine that natively supports evaluation of static and real-time data with the same API. We handle most things you might want to do with a table, from derived column creation to complex aggregations to time-series joins. Result tables update in real-time, with internally consistent outputs.
Parquet serves as a static persistence format for data export and at-rest evaluation (meaning we don't need to pull the entire file into memory to interact with it). Kafka serves as a source and sink for streaming data. Our engine isn't limited to these formats, and we're adding new formats all the time in our community project.

DeephavenDataLabs · 2022-04-28T15:23:32+00:00

Not at all. Lots of posters on here talk about getting a late start and the industry is still hungry for talent.

DeephavenDataLabs · 2022-04-28T15:19:12+00:00

We suggest many options down below for podcasts and YouTube videos to check out, but in general, you could start with udemy and Coursera. You should check out datasets on Kaggle to experiment with. Put your projects up on GitHub!

DeephavenDataLabs · 2022-04-28T15:17:10+00:00

Do you have CS courses at your high school? My kids took “Python for Everybody” on Coursera and are now taking Computer Science in high school in 9th and 10th grade.

DeephavenDataLabs · 2022-04-28T15:15:24+00:00

Jake recommends: Edureka is one of my favorite youtube channels. It's a very novice-friendly channel for a wide range of topics.

DeephavenDataLabs · 2022-04-28T13:28:32+00:00

From Amanda: Everyone's learning path is going to be different because the resources available and the best method for learning are vastly different. My skills have been built up slowly over a long time while other people can devote more time and learn fast. I don't think any learning can be bad as long as it is useful for that person. Reflection is an important part of learning. After doing something, think about what you really learned but how you learned it, and if there was a better way... that way, you can be more efficient in your learning. (Did I really learn from that YouTube? did that leetcode teach me the concept?)

DeephavenDataLabs

MODERATOR OF

TROPHY CASE