simplified:
"predict the next action"
its same what llms do
"predict the next token"
except, we use more focused wording. we tell the model that we specifically want action that user will be doing next. then we perform the action.
with this method we can use existing visual language models and turn them into general action models