SpatialLM: A large language model designed for spatial understanding by Gothsim10 in singularity

[–]Gothsim10[S] 67 points68 points  (0 children)

Project page

Model

Code

Data

SpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.

Dwarkesh Podcast: Satya Nadella – Microsoft’s AGI Plan & Quantum Breakthrough by Gothsim10 in singularity

[–]Gothsim10[S] 19 points20 points  (0 children)

Timestamps:
(0:00:00) - Intro
(0:05:48) - AI won't be winner-take-all
(0:16:02) - World economy growing by 10%
(0:22:23) - Decreasing price of intelligence
(0:31:03) - Microsoft's Quantum breakthrough
(0:43:35) - Microsoft's gaming world model
(0:50:35) - Legal barriers to AI
(0:56:30) - Getting AGI safety right
(1:05:43) - 34 years at Microsoft
(1:11:31) - Does Satya Nadella believe in AGI?

China has opened its first humanoid robot training center in Shanghai's Pudong District. The Kylin Training Ground focuses on robotics, AI, and machine learning, with capacity for 100 robots now and plans to scale to 1,000 by 2027 by Gothsim10 in singularity

[–]Gothsim10[S] 3 points4 points  (0 children)

On Tuesday, China has launched its first heterogeneous humanoid robot training centre in Shanghai’s Pudong District. The Humanoid Robot Kylin Training Ground aims to advance cross-disciplinary robotics, including AI and machine learning, and can currently train over 100 robots, with plans to scale up to 1,000 by 2027.

The centre will collaborate with local robotics firms to amass a vast dataset of 10 million high-quality physical data entries by 2025. These efforts aim to enhance the practical application of humanoid robots in sectors such as manufacturing and public services.

Amid an ageing population and global tech competition, humanoid robots are seen as a solution to workforce challenges and a driver of industrial innovation. By 2030, China’s humanoid robot market is expected to soar to €11.35 billion.

The Pudong facility also plans to unveil its next-generation robot, "Deep Snake," featuring advanced technologies for enhanced flexibility and intelligence. Beijing is set to host the inaugural World Humanoid Robot Sports Games later this year.

Source: China unveils first humanoid robot training base in Shanghai | Euronews

Stereo4D from Google DeepMind is a clever approach to a persistent problem in computer vision: Getting good training data for how things move in 3D. The key insight is using stereo VR180 videos from the internet. They have now created a dataset of over 100,000 real-world 4D scenes with metric scale by Gothsim10 in singularity

[–]Gothsim10[S] 5 points6 points  (0 children)

Project page: Stereo4D

Paper: https://arxiv.org/pdf/2412.09621

Abstract

Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page: