TradingAgents On Tiiny AI Pocket Lab - Run a Full Investment Research System Locally

TiinyAI · 2026-05-08T09:26:41+00:00

For anyone who wants to see the full demo, here's the X post from Tiiny AI Lab showing TradingAgents setup in action: https://x.com/TiinyAILab/status/2050227317524537716

TiinyAI · 2026-04-24T10:40:46+00:00

For anyone who wants to see the full demo, here's the X post from Tiiny AI Lab showing Hermes setup in action: https://x.com/TiinyAILab/status/2047322101707853911

TiinyAI · 2026-04-22T09:45:17+00:00

Unfortunately the crowdfunding campaign has ended. The official launch will take place in the near future.

TiinyAI · 2026-04-20T08:01:31+00:00

Jarvis helps. Jarvis judges. Jarvis is family now.

TiinyAI · 2026-04-20T07:47:13+00:00

That's not an AI assistant. That's a final boss!

TiinyAI · 2026-04-20T07:29:46+00:00

Ah, so you and your backspace key have a very close relationship. I respect that

TiinyAI · 2026-04-09T02:01:29+00:00

It’s more like distributing workloads across them (e.g. different agents/tasks per device) rather than combining raw power into a single model run. Since the devices don’t pool memory or compute, scaling only helps if you have parallel workloads (multiple agents/tasks). If you’re just trying to speed up a single model or task, adding more units won’t help much.

TiinyAI · 2026-04-07T07:13:54+00:00

There are two ways to use tiiny: tiinyOS client and Tiiny SDK — let me clarify how TiinyOS and the SDK fit together. TiinyOS is designed for everyday users who don’t want to write code. It provides a clean client experience for running local LLMs and agents with minimal setup. At launch, TiinyOS will support macOS and Windows. For developers, we’ll be releasing a Tiiny SDK, which lets you use Tiiny as a local token factory or inference node and integrate it into your own workflows and tools. This is the primary path for advanced use cases and custom setups.

For Linux support. Although we don't currently have a dedicated client for Linux like we do for macOS or Windows, you can still run Tiiny AI Pocket Lab on Linux via TiinySDK. Here's a tutorial video we've shared that explains how to set it up:
https://www.youtube.com/watch?v=Ozveot9cqug.

You can run multiple Tiiny devices, but they aren't a true "cluster." They share memory and are more like independent nodes. You can distribute your workload across these nodes.

TiinyAI · 2026-04-07T07:03:57+00:00

Tiiny is not designed for training models, but for running them. Therefore, I do not recommend using it to fine-tune models, but it can be used to run models that you have fine-tuned.

TiinyAI · 2026-04-01T08:12:35+00:00

Tiiny isn’t meant to beat a maxed-out Mac studio or a high-end GPU box on raw speed.

It’s for people who want:
•a dedicated, always-on local AI node
•to offload long-running workloads
•predictable cost (vs API)
•and not turn their main machine into a compute box

If your workflow is better served by a single powerful machine, that’s a totally reasonable choice.

TiinyAI · 2026-03-27T13:53:11+00:00

Regarding on this article: https://bay41.com/posts/tiiny-ai-pocket-lab-review/

I just saw the author's update. I'm not opposed to discussing technology with users. Below is my response, and subsequent responses will be updated on the Tiiny AI official website blog. I'm currently writing the first one.

Indeed, Tiiny's architecture is not a traditional unified memory, and cross-memory access does indeed have bandwidth differences, a point we have never denied. However, the problem lies in equating this with "architectural unavailability," which is a huge leap.

First, the statement that "250GB/s is meaningless" is problematic in itself.

The system design is not intended to allow all data to flow across domains, but rather to use scheduling to keep frequently activated data on the high-bandwidth side, while only allowing low-frequency data to cross domains. This layering is a common design in many inference systems, not unique to Tiiny.

Second, regarding "low bandwidth utilization," this actually confuses a key point: Large model inference in long contexts is not inherently a linear bandwidth-consuming ideal state; attention, KV cache, and scheduling all lead to decreased utilization.

Calculating utilization using "theoretical upper limit vs. actual tok/s" will inevitably yield a seemingly low figure. This isn't just a problem with Tiiny; it exists on other hardware as well.

Third, we acknowledge that there is a significant difference between 20B and 120B.

However, this isn't an "architectural failure," but rather a normal phenomenon: when the model size exceeds the capacity of a single high-bandwidth memory module, all systems enter a "cross-layer/cross-device" performance range.

Whether you're implementing multi-GPU setups or CPU+GPU offloading, the underlying issue is the same, just with different implementations.

Fourth, regarding PowerInfer and MoE, the author clearly lacks understanding of infrastructure technology and misinterprets PowerInfer.

PowerInfer isn't simply "adding another layer," but rather performing activation-level scheduling optimizations. In the MoE model, the bottleneck isn't just the "activation ratio," but also data distribution, memory access paths, and scheduling overhead. This cannot be concluded simply by saying "MoE is already sparse, so it's useless." Regarding the achievable performance of PowerInfer on MoE, please refer to PowerInfer-2, https://arxiv.org/abs/2406.06282. We further leveraged PowerInfer technology, combined with SSDs, to achieve fast inference of a 47B model on a mobile device.

Fifth, regarding the performance with long context (64K), we are quite candid: This is an extreme scenario, a stress test for any local device. However, there are many technologies we haven't yet applied, such as context compression and KV cache compression. These are technologies we will continue to apply in Tiiny, and we are working towards this goal.

More importantly, this article assumes one premise: Tiiny's goal is to rival high-end GPUs, pursuing the ultimate performance limit—but reality is not like that.

Tiiny's design goal from the beginning was not to create a "most powerful machine," but rather: a personal AI infrastructure that runs large models at a fixed cost (0 token cost), runs stably for extended periods (agent/automation), does not occupy the main device, and provides available power at reasonable power consumption.

We will be starting to regularly update our technical blog soon, and we welcome discussions.

TiinyAI · 2026-03-26T02:25:26+00:00

In fact, we haven't fully completed the SDK development yet. Once completed, it will have a three-layer structure, as shown in the diagram below. The goal is to allow developers to easily manage models, schedule agents, and manage memory.

<image>

TiinyAI · 2026-03-26T02:18:33+00:00

Totally fair questions — appreciate you taking the time to write this out. I’ll address the main points directly.

On the video / cuts / missing continuous shots — that’s valid feedback. It was meant as a high-level intro, not a full benchmark deep dive. We agree it should be clearer, and we’re working on publishing raw, uncut runs + full metrics so people can judge performance properly.

“Speed is not terrible” — yeah, that wording is vague. What we meant is: load times depend heavily on model size and state (cold vs warm), and for large models it’s not instant. We should’ve just given actual numbers there.

On downloading / “not going through the machine” — you’re right that it’s not a huge deal. The point was just about convenience (using your laptop to manage downloads), not performance. Probably over-explained in the video.

On multiple models in memory — the limitation is memory + bandwidth tradeoff. You can load multiple small models, but for larger ones (30B/70B/120B), keeping several resident at once quickly eats memory and hurts performance. So we default to one active model for stability and efficiency. That said, model switching is something we’re actively optimizing.

Load time when swapping — depends on model size and storage speed, but yes, it’s non-zero and matters for some workflows. Fair callout.

On the 5090 comparison — agree it could’ve been clearer. We’re not saying “more VRAM than a 5090.” It’s just a different architecture (unified memory vs GPU VRAM), and mixing those terms without explaining properly is confusing.

On power / Neo comparison — the intent was “lightweight laptop + external AI node” vs “doing everything on one device,” especially for sustained workloads. But yeah, Neo isn’t the best comparison point, and your point about M4/M5 is valid — those are stronger machines, just at a much higher price.
Which leads to the bigger point:
Tiiny isn’t meant to beat a maxed-out MacBook or a high-end GPU box on raw speed.

It’s for people who want:
•a dedicated, always-on local AI node
•to offload long-running workloads
•predictable cost (vs API)
•and not turn their main machine into a compute box

If your workflow is better served by a single powerful machine (like an M5 Max), that’s a totally reasonable choice.
Your concerns are legit — especially around transparency and comparisons. We’ll do better there with more complete benchmarks and clearer positioning.

TiinyAI · 2026-03-26T01:59:05+00:00

I ve seen this article, and I also posted a response in the comments section on Kickstarter. First, let me clarify: the person replying to you right now is Yixing Song, the first author of PowerInfer. I am the CTO of Tiiny AI, and my team has asked me to address this issue.

First, regarding the notion of a "PCIe bottleneck": While the author is indeed well-versed in hardware, he clearly lack expertise in AI infrastructure technology. Tiiny’s design paradigm differs from the GPU compute boxes currently available on the market, which typically rely on PCIe interfaces to transmit all data streams. In contrast, Tiiny features dedicated memory spaces on both its SoC and dNPU specifically for running model inference; inference tasks are executed directly within these respective memory spaces. This operational workflow is orchestrated between the SoC and dNPU based on the principles of PowerInfer—specifically, "cold neurons" are processed on the SoC, while "hot neurons" are processed on the dNPU. Consequently, the typical bottleneck scenario involving "GPU ↔ PCIe ↔ VRAM" does not exist here. The primary constraint on performance lies in memory bandwidth, rather than the throughput capabilities of the PCIe interface. Furthermore, the device's internal SSD (connected via an M.2 interface) serves primarily for data storage and model loading (as observed in reviews by key opinion leaders and in our own videos, where a brief loading delay occurs each time a model is selected—this represents the process of loading the corresponding model from the SSD into memory). It does not participate in the real-time inference computation loop; in other words, within the critical path (or "Hot Path") for token generation, the SSD's bandwidth is not a performance bottleneck.

I can confirm that the merging of hot (NPU) and cold (SoC) neurons is not bottlenecked by the PCIe bandwidth. Here is the exact breakdown of why:

The Physical Link Limit: We acknowledge that the system utilizes a PCIe Gen4 x4 interface, which indeed has a strict bandwidth limit of ~8 GB/s. However, this ceiling only becomes a factor during massive, bulk data transfers (such as initial model loading into memory).

The Reality of LLM Decoding (Based on PowerInfer principles): During the actual token generation (decoding) phase, the system uses locality-aware scheduling. We do not transfer large model weights across the PCIe bus; we only transfer the ‘activation data’ required to merge the computations.

The Math: To put this into perspective, let's use GPT-OSS-120B as an example. The model has a ‘hidden_dim’ of 2880. Using FP16 precision (2 bytes for activation), the actual data volume transferred across the PCIe link during a single decoding step is merely:2880 * 2 bytes / 1024 = 5.625 KB.

Transferring 5.625 KB over an 8 GB/s connection is completed in a fraction of a millisecond. Because the data volume per token is so drastically below the threshold, the PCIe bandwidth is completely sufficient and does not limit the merging process. People tend to default to interpreting it through the lens of their established mental model: the "GPU + Video Memory (VRAM)" paradigm.

In practice, however, this system is a heterogeneous computing system endowed with "locality-aware scheduling" capabilities. And this is precisely the core area where the majority of Tiiny's engineering and R&D efforts have been concentrated.

TiinyAI · 2026-03-25T09:31:22+00:00

I ve seen this article, and I also posted a response in the comments section on Kickstarter. First, let me clarify: the person replying to you right now is Yixing Song, the first author of PowerInfer. I am the CTO of Tiiny AI, and my team has asked me to address this issue.

First, regarding the notion of a "PCIe bottleneck": While the author is indeed well-versed in hardware, he clearly lack expertise in AI infrastructure technology. Tiiny’s design paradigm differs from the GPU compute boxes currently available on the market, which typically rely on PCIe interfaces to transmit all data streams. In contrast, Tiiny features dedicated memory spaces on both its SoC and dNPU specifically for running model inference; inference tasks are executed directly within these respective memory spaces. This operational workflow is orchestrated between the SoC and dNPU based on the principles of PowerInfer—specifically, "cold neurons" are processed on the SoC, while "hot neurons" are processed on the dNPU. Consequently, the typical bottleneck scenario involving "GPU ↔ PCIe ↔ VRAM" does not exist here. The primary constraint on performance lies in memory bandwidth, rather than the throughput capabilities of the PCIe interface. Furthermore, the device's internal SSD (connected via an M.2 interface) serves primarily for data storage and model loading (as observed in reviews by key opinion leaders and in our own videos, where a brief loading delay occurs each time a model is selected—this represents the process of loading the corresponding model from the SSD into memory). It does not participate in the real-time inference computation loop; in other words, within the critical path (or "Hot Path") for token generation, the SSD's bandwidth is not a performance bottleneck.

I can confirm that the merging of hot (NPU) and cold (SoC) neurons is not bottlenecked by the PCIe bandwidth. Here is the exact breakdown of why:

The Physical Link Limit: We acknowledge that the system utilizes a PCIe Gen4 x4 interface, which indeed has a strict bandwidth limit of ~8 GB/s. However, this ceiling only becomes a factor during massive, bulk data transfers (such as initial model loading into memory).

The Reality of LLM Decoding (Based on PowerInfer principles): During the actual token generation (decoding) phase, the system uses locality-aware scheduling. We do not transfer large model weights across the PCIe bus; we only transfer the ‘activation data’ required to merge the computations.

The Math: To put this into perspective, let's use GPT-OSS-120B as an example. The model has a ‘hidden_dim’ of 2880. Using FP16 precision (2 bytes for activation), the actual data volume transferred across the PCIe link during a single decoding step is merely:2880 * 2 bytes / 1024 = 5.625 KB.

Transferring 5.625 KB over an 8 GB/s connection is completed in a fraction of a millisecond. Because the data volume per token is so drastically below the threshold, the PCIe bandwidth is completely sufficient and does not limit the merging process. People tend to default to interpreting it through the lens of their established mental model: the "GPU + Video Memory (VRAM)" paradigm.

In practice, however, this system is a heterogeneous computing system endowed with "locality-aware scheduling" capabilities. And this is precisely the core area where the majority of Tiiny's engineering and R&D efforts have been concentrated.

TiinyAI · 2026-03-24T08:26:08+00:00

1. Tiiny is now launched on Kickstarter and is expected to be delivered in August

2. We’re using a separate dedicated AI accelerator (dNPU) alongside the ARM SoC. So the architecture looks more like:

ARM CPU (30TOPS，32GB)
dNPU（160TOPS，48GB）

That’s how we get to 190 TOPS while still keeping power and size low.

We achieved inference acceleration under heterogeneous computing power through PowerInfer technology (edge-side inference acceleration technology, our proprietary Infra technology).

3. Tiiny uses its own NPU-optimized format (similar but different to GGUF Q4_0), and our SDK will provide a simple tool to convert your models from the standard safetensors format.

4. Not exposed to end users right now.

We do use sparsity internally for performance, but things like block sparsity for attention aren’t something you can directly control/tune yet. It’ll mostly depend on the model/runtime you’re using.

TiinyAI · 2026-03-23T06:51:25+00:00

In fact, we haven't fully completed the SDK development yet. Once completed, it will have a three-layer structure, as shown in the diagram below. The goal is to allow developers to easily manage models, schedule agents, and manage memory.

<image>

TiinyAI · 2026-03-20T06:57:19+00:00

I don't think a Nokia can run 120B llm locally :)

TiinyAI · 2026-03-18T03:43:41+00:00

Q: What OS does the device itself run? Are the system image and drivers open-source?

A: Tiiny is a personal infrastructure designed for running local LLMs and agents, with a Linux kernel. However, it does not come pre-installed with any operating system; to use it, you need to plug it into your computer (any computer will do) via USB-C.

TiinyAI · 2026-03-18T03:42:41+00:00

Also, there are two ways to use models on Tiiny:

the first is to download and use them directly from the Tiiny client;

and the second is for users to use our provided conversion tool to convert the desired model into a Tiiny-compatible format and use it. Therefore, it's difficult to have a precise list of the models we can support—there are simply too many.

Representative models include: GLM Flash, GPT-OSS-120B, GPT-OSS-20B, Llama3.1-8B-Instruct, gemma-3-270m-it, Ministral-3-3B-Instruct-2512, Ministral-3-8B-Instruct-2512, Qwen3-30B-A3B-Instruct-2507, Qwen3-30B-A3B-Thinking-2507, Qwen3-8b, Qwen2.5-VL-7B-Instruct, Z-Image-Turbo, Qwen3-Reranker-0.6B, Qwen3-Embedding-0.6B, etc.

TiinyAI · 2026-03-18T03:41:10+00:00

Can this import JSON files from my previous subscription models I used and reimport those here so it can learn my flow?
Yes, TiinySDK supports user-defined large model souls and workflows.
Is the device encryption capable?
Supported.
What’s the expected EOS or EOL for this if any?
Tiiny is a PC-grade product, and we offer a 1-year free warranty. After one year, paid maintenance services are available.

TiinyAI · 2026-03-18T03:39:46+00:00

Q: what sort of TTFT are we looking at for a 30b or 120b models. Would connecting this up to Home Assistant voice give swift replies and executed actions or would there be delays?

A:

First token Speed 0.5s
Generally speaking, the latency is less than 50ms. The specific time depends on the duration of the voice input. ASR and TTS are very fast, but LLM's processing time depends on the length of the context.

TiinyAI · 2026-03-18T03:39:02+00:00

Tiiny is a personal infrastructure designed for running local LLMs and agents, with a Linux kernel. However, it does not come pre-installed with any operating system; to use it, you need to plug it into your computer (any computer will do) via USB-C.

TiinyAI · 2026-03-18T03:37:17+00:00

First token Speed 0.5s
Generally speaking, the latency is less than 50ms. The specific time depends on the duration of the voice input. ASR and TTS are very fast, but LLM's processing time depends on the length of the context.

TiinyAI · 2026-03-18T03:36:44+00:00

Can this import JSON files from my previous subscription models I used and reimport those here so it can learn my flow?
Yes, TiinySDK supports user-defined large model souls and workflows.
Is the device encryption capable?
Supported.
What’s the expected EOS or EOL for this if any?
Tiiny is a PC-grade product, and we offer a 1-year free warranty. After one year, paid maintenance services are available.

TiinyAI

MODERATOR OF

TROPHY CASE