DeepSeek has introduced a novel approach to visual multimodal technology by incorporating 'visual primitives' to improve AI's spatial reasoning capabilities. Unlike traditional methods that focus on enhancing image resolution, DeepSeek's approach uses bounding boxes and points as fundamental units of thought, allowing AI models to 'point' at objects during reasoning. This method addresses the 'Reference Gap' in multimodal reasoning, where language alone is insufficient for precise spatial references. The company also highlights its efficiency in processing images, using a Compressed Sparse Attention mechanism to significantly reduce the number of tokens required. This results in faster inference speeds and lower memory usage, crucial for real-time applications like robotic vision and autonomous driving. Despite these advancements, DeepSeek acknowledges challenges such as trigger word dependency and resolution limits, indicating areas for future development.