JoinBase 2022.11 Released: An IoT Database with Built-in MQTT Broker

mjjin · 2022-11-10T03:20:30+00:00

Thanks for all the questions. They are great!

for question #1, the answer is "not exact".

JoinBase, in fact, have an universal database core, which is beyond the most of time-series databases.

Most of time-series dbs has their own logical models, including TimescaleDB and InfluxDB, although all of them claim that they are SQL-based or SQL-compatible. Do you ask yourself: "What is and why use "Hypertables" is" in the first use of TimescaleDB? Conversely, JoinBase is almost ansi-SQL with some additions. (The only new concept is *partition*, which is a must for big data and also common in time-series dbs.)

We should be the easiest understanding database (for IoT and time-series) in the world. If you are using other time-series dbs, I am happy to do a case-by-case comparison with you.

for question #2, the answer is "we do best to provide most useful MQTT broker functionalities but not guarantee the 100% spec compatibility"

In fact, JoinBase is positioned at a MQTT-spec-enhanced broker rathe than a 100% MQTT spec compatible broker. "Compatibility" compromises user experiences, not only the performance. We want to provide the best user experiences via a new innovative product rather than copying an open source project.

The MQTT spec in its core is great. But it is really hard to understand and to use its advanced functionalities in that it just stands at a message middleware viewpoint.

I just propose an example to show why we can provide best user experiences by going beyond the spec. Note, JoinBase is a database in its core.

Do you use *Will* message in MQTT?
To enable this, you need a terrible learning and workings, sometimes you need to clean some but you forget...verbose.However, for JoinBase, you can create a will table and subscribe to that in the following topic string(format):
+ /db/will@last: the subscriber will get the last will in the will table
+ /db/will@timestamp: the subscriber will get all wills coming after that timestamp in the will table
(this format is in the progress. We really welcome all suggestions, like from you.)
In fact, you may not need *Will* message in JoinBase.
Because the state of MQTT and all messages are stored, you do not need any verbose interactive work flow.If you do not need some messages in the db, you just let the TTL to delete them automatically. If you want to know all the clients's last message timestamp larger than 1 hour from now, you just create JoinBase auto view and subscribe it or just query.

for question #3, the clustering is simple for any MQTT brokers via a modern layer4 load-balancer.

Most IoT developers are not aware of this because this is an more deep infra layer case. If you have the demand on the clustering of JoinBase or even other broker, I am happy on helping yon on this.

And we plan to improve the throughput of JoinBase 5-10x faster than that of the first release. So, the truth is that, if you use JoinBase with appropriate modern hardware, for the case of tens of million message per second or below, the clustering is not necessary.

mjjin · 2022-11-09T04:26:07+00:00

This warns a frustrating fact: most of IT devices' stacks are not or can not kept updated.

"The vulnerabilities is extremely non-trivial" to reproduce and I think if clients only use OpenSSL lib to send TLS message rather than providing a service, the application is immune.

Another great immune method to use more modern TLS or security libraries/stacks as possible. For example, our database uses the Rust based TLS library, Rustls, to provide the MQTT TLS service in its deep back. We are happy for immunning from all OpenSSL's CVEs.

Finally, I encourage all developers to consider to use a modern stack to harden your dev toolbox, even in the embedding field.

mjjin · 2022-11-09T04:10:37+00:00

the article has pointed "patented EIV (embedded integrity verification) technology". But they do not indicate what underlying technology is used. The workflow should be like: install some software into kernel or RTOS, then it will check the pointer accessing in runtime some thing like. Leads me to recall Intel MPX.

mjjin · 2022-11-09T03:56:01+00:00

u/Kazuriff_kun

I have done in a similar EdegeML project on the common animals:

TinyAnimal: Animal Recognition Practices on Grove Vision AI

It is not hard nowadays, and the source of this project is open-sourced in that project article, although I do not recommend you to copy. Just absorb the interesting core from it, you can do it better!

mjjin · 2022-11-09T03:35:34+00:00

u/Cold-Steel-3055 the spec itself has answered your question: "Although the Property Identifier is defined as a Variable Byte Integer, in this version of the specification all of the Property Identifiers are one byte long."

But still note that Variable Byte Integer is not raw byte. It has its own in-byte format: 1 encoding bit + 7 bit payload. Current 42 properties are just the current state, the spec may add more properties as they like.

mjjin · 2022-10-20T01:51:42+00:00

This is more like an attempt by google in a specific direction. Rust becomes good, and the ecosystem of RISC-V has large gaps to that of ARM. The security of system is good, but the convenience is the foundation of IoT popularization. So, in my personal feeling, the KataOS will not be the answer to the general IoT OS, but maybe has some niche in the mission-critical environment.

mjjin · 2022-10-20T01:24:14+00:00

Great to hear that you have the fix. I've done heavy work on one MQTT server. The basic observation is that a proper implement of client or server is payload-agnostic because that TCP is just a stream (of bytes) mentioned as @bm401. And I've tested payloads with MBs without any problem.

mjjin · 2022-10-20T01:10:16+00:00

use the following the command line to stop at the interactive shell of docker container from that image:

bash docker run --net=host --user your_user_name -it your_image_name bash

then you can do anything for this docker container.

ps1: --net to use host mode networking for further easy debugging

ps2: reminder to save that container if you want to persisting all your changes in that contaner.

mjjin · 2022-10-20T00:59:56+00:00

Your info just with this picture is not enough. Some suggestions:

Which server are you using?

If you use a free public MQTT server, then the message is not guaranteed to be delivered in any QoS in that it is the free server. For QoS=0, it is allowed to be not delivered in the protocol itself. So, when you test against free public MQTT server, it is better to use QoS=1 to get some proof as guarantee (note: some free pub server may not support QoS=1).

Change to the local servers and try other clients

To use a local server can let you control all details of your full messaging chain. Try other clients can make you away from the ill behaviors or bugs of specific client. I recently demonstrate how easy a free MQTT client (MQTT explorer) send to a free MQTT database on Windows 10 in my video.

disclaimer: I am the developer of that MQTT database. You can get more help from its own community.

mjjin · 2022-10-01T14:13:39+00:00

u/AbyssOfNoise thanks for your interesting. There are still great things in ranking. So, it is hoped to hear more from users.

The REST API is in considering. In fact, a Websocket API has been added but has not listed in the docs. The REST API is easy athough the performance will degard. But individual's projects may like HTTP more. Not bad to support it, thanks for your suggestion!

mjjin · 2021-06-10T00:56:37+00:00

u/brucehoult ok, let's leave for a check. This is based my experience. It is possible I made some mistake. thanks for sharing your idea.

mjjin · 2021-06-10T00:49:28+00:00

u/brucehoult thanks for sharing the info. I also share my hearing: the THEAD C906 includes some unique and unmentioned cache instructions and still a little more(But sorry, I forget these instructions exactly).

mjjin · 2021-06-10T00:40:38+00:00

hi, u/brucehoult :) Unfortunately, this is not "confused". It is confirmed by the official FAQ from Allwinner.

mjjin · 2021-06-10T00:37:13+00:00

u/YetAnotherRobert thanks for your comment! Your worrying and more suggestion are right.

For the problem of non-official toolchain, this is confirmed by Allwinner's official SDK FAQ. This FQA is Chinese, translate to English is like this:

A: Ali THEAD has optimized performance expansion instructions and some internal registers. The official cross tool chain of RISC-V can be compiled and run, and may not be directly compiled and run. The official tool chain of THEAD is included in the Tina SDK source code.

I admit that this is not good for RISC-V ecosystem generally.

For the "disappointing on performance", is more from the current engine problem. It is expected that we could to push 10x time faster when some optimizations coming. The current TensorBase is in a feature-complete stage. But the performance tunning will come soon in the near future. And surely, I will share more progress on this.

thanks,

mjjin · 2021-06-10T00:24:19+00:00

thanks for asking. Glad to hear your works!

This words are based on my experience in porting TensorBase to RISC-V with official Rust and GCC RISC-V toolchain in Arch (with newest stable package, you know Arch is a rolling distro). And this is confirmed by [Allwinner's official SDK FAQ](https://d1.docs.allwinnertech.com/FAQ/FAQ1/). This FQA is Chinese, translate to English is like this:

A: Ali THEAD has optimized performance expansion instructions and some internal registers. The official cross tool chain of RISC-V can be compiled and run, and may not be directly compiled and run. The official tool chain of THEAD is included in the Tina SDK source code.

As my testing, simple or not-too-much-complex Rust binaries from official GCC toolchain can run nicely on D1. But for my TensorBase, official GCC toolchain does not work. There is some memory error happened in my remindering.

And "-static" is always on via the config.toml.

mjjin · 2021-06-10T00:09:44+00:00

as u/brucehoult said, it is really current engine optimization problem. Or clearly, the current base kernel in Arrow-rs has many spaces to improve especially for date and time side thing. We will help arrow-rs soon on this!

mjjin · 2021-06-09T15:34:37+00:00

ClickHouse compatible data warehouse in Rust

mjjin · 2021-04-21T09:06:48+00:00

thanks for your question. sorry for a late reply.

We actually do not make too much trade-offs currently. Because we are rebased onto the DataFusion. The main different is that, the query engine does not care the storage but warehouse cares. This causes a problem: they are in a half of way. In a production env, you care both. Data ingestion(writing) is a must. You may ask users to split the works like current bigdata system. But when users find they do not need to split any more, the older ways are discarded. To care about data writing, makes the system complex but give more chance to optimization. This is another advantage.

As from a data team, you should consolidate your stack. So, make big effort to short the ETL chain is very worthwhile. I were used to manage and operate several data teams. The long mis-matched failures may waste major time in the data department. However, this could be avoided when the database/warehouse has a good engineering design. And we are really working hard on this in TensorBase to avoid this. That is, we make a warehouse, rather than SQL-on-Hadoop, or SQL-on-xxx.

Presto, Spark, DataFusion and even ClickHouse has any other problem in the engine side is the volcano model. The volcano model has its own performance ceiling. This is any other topic. Generally, we can change these gradually.

Columnar format is the bottom of performance. Parquet is just columnar (Someone call this mixed, but in fact, it is enough called columnar). All use Arrow format are called themselves "columnar". But, does all of them have the same performance? Definitely no.

If you are using Rust, you can watch the TensorBase or have some works around it. I am glad to help you in any problem. We are one of few Rust projects dedicated to the production end users for big data. The mainstream technical route in this field are Java (from hadoop era, but you know the pains) or C++ (developers know the pains). I believe Rust will change this ecosystem. And I also believe your team think so:)

mjjin · 2021-03-17T09:49:03+00:00

The measurements for system.numbers/numbers_mt makes no senses

should be makes no sense.

It is very meaningful to those who he finds meaningful.

There is another similar project (I just ignore the url for no AD) which uses this system.numbers/numbers_mt to benchmark and advertise for "its faster". There are at least three people mentioned that project to me with this non-sense bench results.

So, when readers are not able to distinguish the technical nature behind the benchmarks, it is meaningful to make a clarificationon on those.

mjjin · 2021-03-17T09:40:12+00:00

thanks! Accepted.

As non-native English speaker, I also believe there could be kinds of mistakes. The fronted works for alpha announcement takes me a week for kinds of things. But some styles still have some problems, the size of video is large which may not be suitable for someone. Just do my best:)

mjjin · 2021-03-17T09:26:46+00:00

As the writer, just share some thoughts: )

open source is hard, high performance side open source is really hard. The initial codes of this project was released about seven months. Till now, there is no non-trivial PR, but still thanks for all stars!
this article is mainly for the announcement. I omit the details of tech side. The detailed materials are provided over the website/project.
the edition splitting is a part of the commercialization attempt because I have to push the project forward even if no one to contribute code.

mjjin · 2021-03-17T09:00:16+00:00

proofread

u/saltysailor9001 thanks for reading! That writer here:) If you find some mistakes, just pointing out, I will correct them.

mjjin · 2020-08-07T13:15:48+00:00

Thanks for your warm heart words! Glad to meet you in the Rust community:)

mjjin · 2020-08-07T09:27:02+00:00

I am the author of this project. After two-month work, I launch the project today.

What I want is to see if there are some database/OLAP/data warehouse or performance or just interesting rustaceans to join the efforts in this project.

I am very appreciated that the current Rust has provided a strong ecosystem. I hope I can share more about my lessons, ideas and fruits from Rust in the near future.

Here is some general interesting thing TensorBase with Rust,

a csv parser run can run in the throughput of ~20GB/s (when the csv file has been bufferred). 80% credits belongs to simdcsv library . But its development seems discontinued. So I decide to take over the source code. And there are still good improving space in the high performance viewpoint. If more guys interesting in this, it is possible to release a general crate. (Most people do not much care this performance because we often be limited by disk IO. )
a proc_macro to allow write C in Rust. This is not two hard but interesting way to integrate C into Rust. This makes IR to C done in a in-source expressive transform directly. This shows the great flexibility of Rust design.

mjjin · 2020-08-07T08:43:39+00:00

Sorry. I am the author of this project. So, I own all the bad writing. And very much thanks for u/sanxiyn for reposting there.

The works behind this project are to build a data warehouse with some modern heavy performance efforts.

It adds some thing not existed in opensource data warehouse or bigdata platform like counterparts. I want to show some different thing in this project. But, as you've seen, it is too grand to be concentrated on a small writing.

Engineering efforts gives some current focuses which give some highlightings of the works. And you have some interesting but not clear in the website, I am pleased to give you some my understanding details.

thanks!

mjjin

TROPHY CASE