Modern vision-language-action models (VLAs) predict actions in chunks, but each new chunk starts from a stale observation, causing visible shaking at chunk boundaries. We traced this problem back to three compounding delay sources, measured each via system identification, and shifted the training target to compensate.


