Building a General-Purpose Physical AI System for Food Manipulation

Watch a line cook make a burger during lunch rush, and you’re watching thousands of tiny decisions unfold in real time: which bun to grab, which hand to use, how hard to press, when to catch a slipping tomato, and many more.

‍

We’ve been teaching a physical AI system to make these decisions on its own. Today, our system can pick, place, and stack a complete burger with buns, patty, cheese, lettuce, and tomato in under a minute. It does so autonomously and consistently, burger after burger, the way commercial kitchens need it, and it was several orders of magnitude faster to train than a traditional modular robotics stack.

‍

Why are burgers surprisingly difficult to assemble?

Making a burger might sound simple until you ask a physical AI system to assemble one.

‍

Burgers contain several ingredients. Each one behaves differently and is challenging in its own way. A burger bun is light, soft, and delicate, and it squishes if you press too hard. A meat patty is firm and heavy, but if you grab it in the wrong spot, it might rip or crumble. Cheese slices might stick together when you try to pick them up. Lettuce is floppy, unpredictable, and irregular in size. Tomato slices are slippery and vary in size and thickness.

‍

To add to this complexity, food ingredients are organic. The same ingredient might look and feel different day to day, hour to hour, and even minute to minute. Burger buns might be more delicate on some days and more crumbly on others. Each patty behaves differently, based on its fat and water contents, temperature, and how well it’s cooked. Cheese slices become softer as they warm up to room temperature over the course of a shift, while lettuce leaves and tomato slices vary so much that we’ll spare you the details.

‍

As if that wasn’t challenging enough, our physical AI system now needs to stack all six ingredients (counting the two burger buns as separate steps) on top of one another with precision while keeping up with the fast pace of a real kitchen.

‍

The burger assembly problem is many orders of magnitude more complex than scooping rice into a meal tray. That’s why we built a new physical AI system to solve this challenge—one that doesn’t just reason about the physical world but interacts with it to get the job done, similar to how a human would.

‍

How do you teach AI to think like a line cook?

Our physical AI system is split into two specialized components, much like different parts of the human brain, to handle perception and action separately.

‍

The first part, our Food Foundation Model (FFM), a vision-language model (VLM), perceives the physical world around it and decides what to do next: which ingredient to pick up and where to place it. The second part, an action policy network, handles the “doing.” It performs the precise hand movements needed to handle each ingredient correctly without fumbling. The FFM runs a little more slowly, while the action policy network runs fast, keeping the system’s arms moving smoothly from one step to the next. Together, they let our physical AI system plan actions like a person and move like one, too.

*Three camera feeds (top-down, left wrist, right wrist) and the system’s own spatial awareness feed into a two-stage AI model that translates what it perceives into real-time movement.*

FFM: Seeing and understanding the physical world

VLMs are similar to the types of AI models that can look at a photo and describe what’s in it. Three different cameras feed into our FFM, which builds a mental picture of the physical world in front of it: where each ingredient is located, which ingredients have already been assembled, and which one needs to be placed next. This mental picture, or scene context, is what our action policy network turns into arm and hand movements.

‍

Action policy network: making moves

The action policy network combines the FFM’s scene context with our system’s current arm positions and camera views. It then outputs a sequence of motions, or action chunks: small bursts of movement that the system’s arms carry out immediately, one after the other. We train our system for these movements to be:

Smooth: with minimal jerk that could knock a tomato out of place, squash a bun, or disturb what has already been stacked.‍
Precise: with sub-centimeter accuracy (about the thickness of a slice of cheese), even when no two ingredients have the same shape or size.

‍

Learning by doing

When we trained our two-part AI model how to assemble a burger, we never actually sat down and typed out step-by-step instructions for it. Instead, skilled operators guided the system’s arms through the full assembly process by hand again and again, and our AI model learned from those demonstrations. This process, called “imitation learning” or “learning by demonstration,” is how a new line cook learns, too: by shadowing a more experienced person until the movements become natural.

‍

This is how our system picked up on the small details that are nearly impossible to spell out in a rule book: how to grab a soft tomato slice more gently than a firmer cheese slice, or how to nudge a burger bun back into place when it lands a little off-center. We recorded these demonstrations across a wide range of variations, including different setups, ingredient sizes, and tray layouts. As a result, our physical AI system learned a real skill, not a single memorized routine.

‍

After 8 days of training, with just over 26 hours of demonstration data provided to a pre-trained AI model, our system has achieved a 75% full-task success rate and an 81% sub-task success rate. This approach to AI model training is not only much faster than training a traditional modular robotics stack; it’s also easier for our system to adjust and handle a wider range of tasks beyond the exact task we taught it, and it will be faster for us to expand and train it to handle completely new tasks.

‍

Seeing results

Put it all together—the FFM, the action policy network, and the imitation learning—and our physical AI system assembles a full burger in under a minute. It does so:

Reliably: Our system consistently assembles burgers despite natural variation in food ingredients and regardless of how each ingredient is positioned in its container during picking.
Repeatedly: Every burger is structurally sound and visually consistent.‍
Robustly: Our system can handle a range of configurations, such as varying bun sizes, self-correction, intra-class ingredient swaps, and red vs. green lettuce, without reprogramming.

‍

What’s next

Burger assembly is just one example of a much wider range of tasks our Food Foundation Model can handle. While our physical AI system can quickly and reliably assemble burgers today, it will soon be able to assemble other sandwich types, tacos, and burritos, and eventually a much wider range of foods. These capabilities are only possible thanks to how AI models learn: by watching, practicing, and getting a little better with every attempt.

‍

For readers interested in the technical story behind this topic, we’re preparing a deep dive into how our AI models handle the delay between seeing and acting. Stay tuned.

‍

Curious about bringing physical AI into your industrial or commercial kitchen, or just want to follow along as we keep building? Get in touch with our team!

‍

Contributors

Inkyu Sa, Xiaoyi (Sherry) Chen, Somudro Gupta, Kartheek Chandu, Luis Rayas, Tina Kao, Nick LaBounty, Konstantin Stulov, Krishna Teja

Why are burgers surprisingly difficult to assemble?

How do you teach AI to think like a line cook?

Seeing results

What’s next

Contributors

Related Articles

We would love to collaborate with you