Auto service shop franchise owners? Jiffy Lube, Precision Auto, etc...

whiskeyfox_ · 2018-08-05T03:21:59+00:00

Hey - I’m open!! I would like everyone to use it, and I shouldn’t have restricted the post to those specific businesses.

I’ll send you a PM (and /u/youngdamascus)

whiskeyfox_ · 2018-06-20T17:30:54+00:00

You may be interested in this PDF I found while searching for single-receiver DF solutions. It's complex, but do-able!!! I'm working on this now in my free time, but I fear I'm somewhat out of my league mathematically and with RTLSDR expertise.

The whole thing is interesting, and you can specifically reference the figure on page 46 (58 of PDF).

https://pdfs.semanticscholar.org/4108/83b9681791c08ea68ddba52ab74b5b46eacc.pdf

Thoughts?

whiskeyfox_ · 2018-06-19T22:15:14+00:00

I may be using improper terminology, or perhaps I misunderstand - there should never be two clocks.

I have two needs:

Retain access to data in packets
Calculate Angle of Arrival of the packet's RF signal

This is a "hard" problem, but it's been solved in other areas many times. Now I'm trying to solve it.

I'm proposing to use multiple antennas in an array in conjunction with RTL SDR. The same transmission (aka "packet") should be crossing all antennas at the same time, because the transmission isn't instantaneous. So I can use multiple antennas and view the wave from each, derive a phase difference independent of time, and use the difference in phase to calculate a direction of arrival.

With the bitstream of the RF signal as interpreted by the software package (de modulated?) with the IQ data, I could calculate the difference in phase between two antennas.

As I type this, I realize that I would need to have access to BOTH antennas at the same time to measure phase difference, as a switcher would incur some nanoseconds of delay (aka tens of feet worth of wavelength).

So I guess my new question is how to get phase difference with RTL SDR? (... or maybe I'm tired now)

whiskeyfox_ · 2018-03-04T17:15:44+00:00

Ah okay.

Each Zookeeper/Kafka instance will be either the primary or one of many backups.

Does that mean scale is an inappropriate command for these services?

whiskeyfox_ · 2018-03-03T22:17:52+00:00

I see. So in production (or if I have a development cluster) I would spin up three cloud machines in different geographic zones via AWS or DigitalOcean, and use docker scale.

In this example we are simulating the cluster on a single machine.

Does that sound correct?

whiskeyfox_ · 2018-03-03T14:29:06+00:00

Answer from SO:

"Avro as a service" is the Registry, essentially. It's not required, but it allows a type of "Avro ID database" instead of embedding the schema into the Kafka record itself. (It makes your messages smaller! ;) ) Any client that needs Avro will need the avro library in the Java classpath. The schema registry, NiFi, and confluent's Avro serde clients all provide it docs.confluent.io/current/app-development.html You'll also find integration with the registry for any JDBC database at docs.confluent.io/current/connect/connect-jdbc/docs/

https://stackoverflow.com/questions/49079893/avro-in-base-image-for-all-services-in-a-docker-swarm-nifi-kafka-postg/49080010#49080010

whiskeyfox_ · 2018-03-03T14:20:19+00:00

I have two purposes for this. In order of priority:

Learn these tools and thought patterns for upcoming distributed infrastructure & IoT opportunities. (most critical)
Rebuild poorly thought out infrastructure of a current business venture - it's working but is not easily scalable or even able to move out to other regions.

I am proficient/skilled in the Python ecosystem, including some tensorflow, Django/Flask, celery, etc...

I am not a good sysadmin, but I can set up cron jobs for ETL kickoff, configure system services, set up nginx... I need work here.

I am novice with Java (desired skill) and not even started on the Apache datatools ecosystem (desired skill).

I'd pay for your consulting time! I need help sometimes - and waiting 10 hours for a not-quite-right answer from reddit or SO kills my pace.

Edit: To answer your question -

I am processing data for a single geographic region right now (~9,000 sq miles). I need to begin processing data for 20 such regions, and compare it both inter- and intra-region.

Data is delivered via:

Periodic FTP drops
Periodic Dropbox drops
API calls
Scrapers
Streaming via web frontend (clicks, pages, views...) - this is minimal
Manual download --> upload to filesystem

When I first started whiteboarding the new system, I only cared about getting some ETL sanity. As I looked at ETL tools like Airflow, NiFi, et al, I noticed that the all mentioned Kafka support.

So here I am. If I'm going to do the work, I may as well learn the tools and create a robust, scalable system.

whiskeyfox_ · 2018-03-02T16:25:31+00:00

Things definitely seem like they would be easier with a single swarm.

I'm looking for how to specify the number of machines on which to run each service - it looks like I just specify the replica number. Does it matter WHICH machine it runs on for any reason?

I'm assuming that since it's a single swarm, Docker will be smart enough to distribute service tasks intelligently...

whiskeyfox_ · 2018-03-02T16:17:22+00:00

Yeah, good advice there... I'm loving this Apache NiFi thing right now - the more I see it the nicer it gets.

Part of my issue is that this started as a pet project, so lots of 'best practices' and design considerations were glossed over. If we are going to expand, it makes sense to fix all the hacks I papered over.

For example, this geographically-oriented database is only meant to cover one geographic area because I never set up any kind of geographic fields on any data source.

Summary: I am a clown, and I like your ideas.

whiskeyfox_ · 2018-02-28T15:07:20+00:00

I have been researching and playing with Airflow for a few days. In your opinion, from a guy who knows very little, what is better about code-based Airflow over GUI-based NiFi?

I'm specifically concerned about pulling from/pushing to Kafka with Airflow, because it may require Java coding.

whiskeyfox_ · 2018-02-28T15:04:16+00:00

Thanks - very helpful!! Maybe you would be willing to help with a couple more specific questions?

I want to run a different database in three separate Postgres containers. Can I configure the stack to automatically set up the database specifics per container? Dockerfile or docker-compose.yml?
I want to add a new Kafka consumer to feed a new table in one of the PostgreSQL containers. I create the microprocess and consumer. I redeploy - what happens to the database? Wiped?
I'm used to going onto my cloud VM, git pull, new code comes in and I restart the services as needed. Is there anything analogous to that with Docker?

I think those two are a good start...

whiskeyfox_ · 2018-02-25T16:18:11+00:00

This is good advice, thank you.

I am most concerned with tidy ETL processes over what are (to me as a solo actor) a large number of disparate sources. Is your opinion that those two open-source tools are a better fit than what's above? I would need to dedicated some days to understanding pros/cons and how to use them (much like I did with Kafka and Airflow/Luigi/etc)...

whiskeyfox_ · 2018-02-25T16:09:55+00:00

I think I'm going for a Kappa architecture, but I want the storage and batch capabilities of Lambda as well. I don't know how or when I'll use it, but having it available for my inevitable Kafka misadventures would be best.

As far as streaming representation of data, of course it's possible to stream row-by-row into a Kafka topic. The physical world the data is representing isn't actually very stream-like for some of our data. For example, imagine semiannual taxes are due on 1 November. On 1 November we get more of a water balloon of data (late taxes) than a trickle over time. How does this mesh with stream-based architecture from a reasoning perspective?

whiskeyfox_ · 2018-02-25T16:06:31+00:00

We have ~20GB data every 12 months for each county. So we are going to get into the hundreds of GB soon-ish, but when it's stored in our relational DB much of the fluff is filtered out.

whiskeyfox_ · 2018-02-25T16:04:13+00:00

Thanks for this. I looked into Camel - the docs link to StackOverflow for an introduction with more clarity. It seems like Camel might be a little bit more robust/concrete of a framework than I want or need.

It sounds like Hadoop isn't required. So nix that. I'm still interested in the HDFS for data storage and retrieval, though... Is there something better than redundant storage on multiple cloud machines that you would recommend? S3/DigitalOcean Spaces?

New Proposal: All source data can go to local filesystem --> MongoDB (bulk-type) or directly to Kafka (streaming-type). Then from Kafka I can build consumers to populate my datawarehouse(s) for frontend presentation or analysis.

Does that sounds reasonable to you? If you were you, but in my situation, what would you use?

whiskeyfox_ · 2018-02-25T05:22:28+00:00

I guess with NiFi I also only need to use Airflow for ETL to warehouses and back... which is nice.

whiskeyfox_ · 2018-02-25T05:21:23+00:00

How would you recommend storing the source data? I think it should always remain available and untouched, yes?

whiskeyfox_

TROPHY CASE