Xiaomi has released ControlFoley, an open-source framework designed to enhance video sound effects generation. Unlike traditional AI dubbing models that infer sounds from visuals, ControlFoley offers creators precise control over audio style by generating sounds based on video content and accepting text descriptions or reference audio. This allows for transformations such as converting a knock into a "metal strike" while maintaining synchronization with video visuals. ControlFoley utilizes a spatiotemporal audiovisual encoder and a "time-timbre decoupling" strategy, achieving state-of-the-art performance on standard video dubbing benchmarks. It competes closely with commercial systems like Kling-Foley in metrics such as semantic alignment and synchronization, though it slightly underperforms in certain KL divergence metrics. The framework's technical report, code, and demo are now publicly accessible.