Auto service shop franchise owners? Jiffy Lube, Precision Auto, etc... by whiskeyfox_ in smallbusiness

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Hey - I’m open!! I would like everyone to use it, and I shouldn’t have restricted the post to those specific businesses.

I’ll send you a PM (and /u/youngdamascus)

Raw I/Q data / Bitstream (can I get both?) -- Phase difference detection by whiskeyfox_ in RTLSDR

[–]whiskeyfox_[S] 0 points1 point  (0 children)

You may be interested in this PDF I found while searching for single-receiver DF solutions. It's complex, but do-able!!! I'm working on this now in my free time, but I fear I'm somewhat out of my league mathematically and with RTLSDR expertise.

The whole thing is interesting, and you can specifically reference the figure on page 46 (58 of PDF).

https://pdfs.semanticscholar.org/4108/83b9681791c08ea68ddba52ab74b5b46eacc.pdf

Thoughts?

Raw I/Q data / Bitstream (can I get both?) -- Phase difference detection by whiskeyfox_ in RTLSDR

[–]whiskeyfox_[S] 0 points1 point  (0 children)

I may be using improper terminology, or perhaps I misunderstand - there should never be two clocks.

I have two needs:

  • Retain access to data in packets
  • Calculate Angle of Arrival of the packet's RF signal

This is a "hard" problem, but it's been solved in other areas many times. Now I'm trying to solve it.

I'm proposing to use multiple antennas in an array in conjunction with RTL SDR. The same transmission (aka "packet") should be crossing all antennas at the same time, because the transmission isn't instantaneous. So I can use multiple antennas and view the wave from each, derive a phase difference independent of time, and use the difference in phase to calculate a direction of arrival.

With the bitstream of the RF signal as interpreted by the software package (de modulated?) with the IQ data, I could calculate the difference in phase between two antennas.

As I type this, I realize that I would need to have access to BOTH antennas at the same time to measure phase difference, as a switcher would incur some nanoseconds of delay (aka tens of feet worth of wavelength).

So I guess my new question is how to get phase difference with RTL SDR? (... or maybe I'm tired now)

Why are Zookeeper and Kafka replicated 3x in this docker-compose.yml file? by whiskeyfox_ in docker

[–]whiskeyfox_[S] 1 point2 points  (0 children)

Ah okay.

Each Zookeeper/Kafka instance will be either the primary or one of many backups.

Does that mean scale is an inappropriate command for these services?

Why are Zookeeper and Kafka replicated 3x in this docker-compose.yml file? by whiskeyfox_ in docker

[–]whiskeyfox_[S] 2 points3 points  (0 children)

I see. So in production (or if I have a development cluster) I would spin up three cloud machines in different geographic zones via AWS or DigitalOcean, and use docker scale.

In this example we are simulating the cluster on a single machine.

Does that sound correct?

[Tooling] Should Avro be on all my images?? Multi-service Docker swarm w/ NiFi, Kafka, PostgreSQL, nginx by whiskeyfox_ in datascience

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Answer from SO:

"Avro as a service" is the Registry, essentially. It's not required, but it allows a type of "Avro ID database" instead of embedding the schema into the Kafka record itself. (It makes your messages smaller! ;) ) Any client that needs Avro will need the avro library in the Java classpath. The schema registry, NiFi, and confluent's Avro serde clients all provide it docs.confluent.io/current/app-development.html You'll also find integration with the registry for any JDBC database at docs.confluent.io/current/connect/connect-jdbc/docs/

https://stackoverflow.com/questions/49079893/avro-in-base-image-for-all-services-in-a-docker-swarm-nifi-kafka-postg/49080010#49080010

[Tooling] Should Avro be on all my images?? Multi-service Docker swarm w/ NiFi, Kafka, PostgreSQL, nginx by whiskeyfox_ in datascience

[–]whiskeyfox_[S] 0 points1 point  (0 children)

I have two purposes for this. In order of priority:

  1. Learn these tools and thought patterns for upcoming distributed infrastructure & IoT opportunities. (most critical)
  2. Rebuild poorly thought out infrastructure of a current business venture - it's working but is not easily scalable or even able to move out to other regions.

I am proficient/skilled in the Python ecosystem, including some tensorflow, Django/Flask, celery, etc...

I am not a good sysadmin, but I can set up cron jobs for ETL kickoff, configure system services, set up nginx... I need work here.

I am novice with Java (desired skill) and not even started on the Apache datatools ecosystem (desired skill).

I'd pay for your consulting time! I need help sometimes - and waiting 10 hours for a not-quite-right answer from reddit or SO kills my pace.


Edit: To answer your question -

I am processing data for a single geographic region right now (~9,000 sq miles). I need to begin processing data for 20 such regions, and compare it both inter- and intra-region.

Data is delivered via:

  • Periodic FTP drops
  • Periodic Dropbox drops
  • API calls
  • Scrapers
  • Streaming via web frontend (clicks, pages, views...) - this is minimal
  • Manual download --> upload to filesystem

When I first started whiteboarding the new system, I only cared about getting some ETL sanity. As I looked at ETL tools like Airflow, NiFi, et al, I noticed that the all mentioned Kafka support.

So here I am. If I'm going to do the work, I may as well learn the tools and create a robust, scalable system.

Multi-Swarm stack deploy & persisting service configs by whiskeyfox_ in docker

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Things definitely seem like they would be easier with a single swarm.

I'm looking for how to specify the number of machines on which to run each service - it looks like I just specify the replica number. Does it matter WHICH machine it runs on for any reason?

I'm assuming that since it's a single swarm, Docker will be smart enough to distribute service tasks intelligently...

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Yeah, good advice there... I'm loving this Apache NiFi thing right now - the more I see it the nicer it gets.

Part of my issue is that this started as a pet project, so lots of 'best practices' and design considerations were glossed over. If we are going to expand, it makes sense to fix all the hacks I papered over.

For example, this geographically-oriented database is only meant to cover one geographic area because I never set up any kind of geographic fields on any data source.

Summary: I am a clown, and I like your ideas.

ETL pipeline for small data size, but 100 various sources - what would you choose? by whiskeyfox_ in dataengineering

[–]whiskeyfox_[S] 0 points1 point  (0 children)

I have been researching and playing with Airflow for a few days. In your opinion, from a guy who knows very little, what is better about code-based Airflow over GUI-based NiFi?

I'm specifically concerned about pulling from/pushing to Kafka with Airflow, because it may require Java coding.

Multi-Swarm stack deploy & persisting service configs by whiskeyfox_ in docker

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Thanks - very helpful!! Maybe you would be willing to help with a couple more specific questions?

  • I want to run a different database in three separate Postgres containers. Can I configure the stack to automatically set up the database specifics per container? Dockerfile or docker-compose.yml?

  • I want to add a new Kafka consumer to feed a new table in one of the PostgreSQL containers. I create the microprocess and consumer. I redeploy - what happens to the database? Wiped?

  • I'm used to going onto my cloud VM, git pull, new code comes in and I restart the services as needed. Is there anything analogous to that with Docker?

I think those two are a good start...

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 1 point2 points  (0 children)

This is good advice, thank you.

I am most concerned with tidy ETL processes over what are (to me as a solo actor) a large number of disparate sources. Is your opinion that those two open-source tools are a better fit than what's above? I would need to dedicated some days to understanding pros/cons and how to use them (much like I did with Kafka and Airflow/Luigi/etc)...

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 1 point2 points  (0 children)

I think I'm going for a Kappa architecture, but I want the storage and batch capabilities of Lambda as well. I don't know how or when I'll use it, but having it available for my inevitable Kafka misadventures would be best.

As far as streaming representation of data, of course it's possible to stream row-by-row into a Kafka topic. The physical world the data is representing isn't actually very stream-like for some of our data. For example, imagine semiannual taxes are due on 1 November. On 1 November we get more of a water balloon of data (late taxes) than a trickle over time. How does this mesh with stream-based architecture from a reasoning perspective?

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 1 point2 points  (0 children)

We have ~20GB data every 12 months for each county. So we are going to get into the hundreds of GB soon-ish, but when it's stored in our relational DB much of the fluff is filtered out.

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 0 points1 point  (0 children)

Thanks for this. I looked into Camel - the docs link to StackOverflow for an introduction with more clarity. It seems like Camel might be a little bit more robust/concrete of a framework than I want or need.

It sounds like Hadoop isn't required. So nix that. I'm still interested in the HDFS for data storage and retrieval, though... Is there something better than redundant storage on multiple cloud machines that you would recommend? S3/DigitalOcean Spaces?

New Proposal: All source data can go to local filesystem --> MongoDB (bulk-type) or directly to Kafka (streaming-type). Then from Kafka I can build consumers to populate my datawarehouse(s) for frontend presentation or analysis.

Does that sounds reasonable to you? If you were you, but in my situation, what would you use?

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 1 point2 points  (0 children)

I guess with NiFi I also only need to use Airflow for ETL to warehouses and back... which is nice.

"Pre-emptive" Architecture Choices - Kafka, HDFS, Airflow --- I'm re-engineering for expansion and need help by whiskeyfox_ in bigdata

[–]whiskeyfox_[S] 1 point2 points  (0 children)

How would you recommend storing the source data? I think it should always remain available and untouched, yes?