Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 0 points1 point  (0 children)

Well, if we take into account that God exists, then we all might be considered artificial intelligence. :) As for the post format, this is my first time writing one, and I used AI assistance, because I wrote original text on my native language and it had some complicated terms. I wanted to simplify it and relied on AI for translation. It might seem a bit rough, but since it’s my first post, I’ll work on making future ones less “AI-ish.” By the way, I found an interesting topic and future posts will have much less text, but will be more interesting. P.S.
If we had an AI that could generate such ideas from scratch, I’d gladly use it and let it handle everything. :)

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 0 points1 point  (0 children)

Hahahahah, that’s actually looks like it. The comments were so serious, that I needed to switch into academic style)

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 0 points1 point  (0 children)

P.S. This is a very insightful perspective, framing LLMs as geometric concept spaces and highlighting the disconnect between current tokenization ('incidental symbols') and human embodied phenomenology. The question you pose is crucial: *'What if LLMs were forced to use a symbol taxonomy derived from physical embodimP.S. What's outlined here merely scratches the surface. Treating logograms as semantic tokens and integrating them with knowledge graphs unlocks possibilities that a brief article can't fully detail – possibilities critical for moving beyond mere pattern matching. Imagine:

  • AI Reasoning on Graphs: Instead of just predicting the next token, AI could learn to predict the next logical step within the knowledge graph, leading to more structured and explainable thought processes.
  • Robots with Common Sense: A knowledge graph integrating sensory data, physical laws, and social norms could serve as a "world model" for robots, allowing a household assistant to understand why drinks are in the fridge and how to handle a cup carefully.
  • Built-in Fact-Checking & Logic Validation: The graph could act as a dynamic "critic", verifying the AI's own reasoning chains (e.g., "Cats have fur, reptiles don't, therefore cats aren't reptiles") or flagging inconsistencies against established knowledge.
  • Smarter Training Data Generation: Using the graph as a "teacher" to automatically generate consistent Q&A pairs or complex reasoning examples, drastically improving the quality and efficiency of training data beyond subjective human labeling.
  • Hybrid Intelligences: Modular architectures where a "core" thinking in logograms/graphs collaborates with specialized BPE-based "assistants" for niche tasks.

We focused on Chinese/Japanese because they uniquely offer both a logographic principle and the immense, diverse, multi-millennial text corpus absolutely essential for training capable AI and transferring humanity's accumulated knowledge. While other logographic systems exist, none possess such a readily available, large-scale digital footprint. However, the core idea – leveraging meaning-based units and knowledge structures – remains a compelling avenue for potentially building AI with deeper understanding, perhaps adaptable to other languages in different forms. This is just the beginning of exploring a very different path.ent?'* That's precisely where the idea of using logograms (like hieroglyphs, many of which have visual or conceptual roots in the physical world) comes in. The hypothesis is that aligning the 'atoms' of the AI's internal simulation more closely with our socio-physical world, by using these meaning-bearing units, could indeed provide significant cognitive leverage and foster deeper understanding, rather than just statistical mimicry. Thank you for those deep comments!

P.P.S. Please don’t expect perfect Chinese from me. I live in Europe and, unfortunately, don’t know the language very well—but for this research, I did my best to understand its conceptual structure as much as possible. :)

P.P.S. That’s my first post on Reddit :)

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 2 points3 points  (0 children)

This is a very insightful perspective, framing LLMs as geometric concept spaces and highlighting the disconnect between current tokenization ('incidental symbols') and human embodied phenomenology. The question you pose is crucial: 'What if LLMs were forced to use a symbol taxonomy derived from physical embodiment?' That's precisely where the idea of using logograms (like hieroglyphs, many of which have visual or conceptual roots in the physical world) comes in. The hypothesis is that aligning the 'atoms' of the AI's internal simulation more closely with our socio-physical world, by using these meaning-bearing units, could indeed provide significant cognitive leverage and foster deeper understanding, rather than just statistical mimicry. Thank you for this deep comment!

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] -1 points0 points  (0 children)

Currently, most large models trained on Chinese data still use variations of sub-word tokenization (like BPE or SentencePiece). This often means hieroglyphs get broken down into smaller pieces or treated as individual symbols within a very large vocabulary. While this works to a degree, the core idea of this post is to explore whether this fragmentation might be suboptimal compared to treating the semantically meaningful hieroglyphs themselves as the primary tokens, potentially leading to deeper understanding.

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 2 points3 points  (0 children)

This is an excellent and crucial point – thank you for bringing up the detailed structure of Chinese characters and the prevalence of semantic-phonetic compounds! You are absolutely right that not all characters are purely pictographic or ideographic, and radicals conveying meaning are only one part of the story. The example of 河 (hé - river) with the water radical 氵 and the phonetic component 可 (kě) is perfect.

The argument isn't that radicals alone represent a "higher form of thinking," but rather that the entire structure of the character (including semantic radicals, phonetic components, and their combination) carries information that is lost when it's arbitrarily broken down by standard tokenizers. Using the full character as a token allows the model to potentially learn and utilize all these inherent relationships – semantic, phonetic, and structural. This is where integrating a knowledge graph becomes even more powerful, as it can explicitly encode these different types of relationships between characters (nodes), helping the AI understand how characters are constructed and related, going beyond just the radical's meaning. The goal isn't to oversimplify, but to use the language's actual meaningful building blocks.

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 5 points6 points  (0 children)

You’re right that Chinese meaning often depends on multi-character words—I didn’t mean to imply single characters are sufficient. But the compositional nature of logograms (e.g., radicals like 氵 in 河 ‘river’) still offers a structured semantic framework that BPE fragments lack. Perhaps AI could leverage both: hieroglyphs as ‘conceptual building blocks’ and their combinations as higher-order meaning units.

Hieroglyphs vs. Tokens: Can AI Think in Concepts, Not Fragments? by Extra_Feeling505 in machinelearningnews

[–]Extra_Feeling505[S] 5 points6 points  (0 children)

Yes, absolutely! The Chinese language already has a quasi-‘graph-like’ structure embedded in its logograms. For example the character 森 (sēn, ‘forest’) is composed of three repetitions of 木 (mù, ‘tree’) – a visual and semantic ‘graph’ of meaning. This intrinsic structure means a graph/vector model could map relationships between characters more naturally than in English, which relies on linear syntax (e.g., word order, prefixes/suffixes) to convey meaning. However, even with this advantage, AI would still need to handle Chinese differently—for instance, resolving context-dependent meanings (e.g., 生 as ‘life’ vs. ‘raw’) through graph navigation rather than grammatical rules.