GAM: Grasp-Anything-Model

Home
GAM

Project information

Category: Master Thesis Project
Project date: Apr 2025
Client: PAL Robotics
GitHub: cannot share..
Full Thesis: GAM
Main tools: ROS2 - Behavior Tree

Abstract

A Foundational ROS2 Grasping Pipeline for Modular, Vision-Based Manipulation

GAM is a modular grasping pipeline designed to combine modularity, robustness, flexibility, and extensibility within the ROS2 ecosystem. Developed during my Master's thesis at the UniPD in collaboration with PAL Robotics, the project integrates independent modules for Perception, 3D reconstruction, and Motion Planning, enabling each component to be replaced or upgraded without disrupting the overall system; a feature introduced by the BehaviorTree library. The aim was not to build the single most performant grasping solution, but to deliver a foundational, interpretable, and deployable architecture for vision-based manipulation — one that can adapt to evolving technologies and operate in realistic, unstructured environments.

Overall Flow of the BT — Behavior Tree overall high-level structure.

Perception & Object Segmentation

The Perception module leverages zero-shot segmentation models — integrating GroundingDINO with the Segment Anything Model (SAM) — to detect objects from natural language prompts such as “blue mug” or “water bottle on the right”. This enables operation on both simulated and real-world data, handling variations in shape, size, and texture without the need for dataset-specific retraining.

3D Reconstruction

After segmentation, the point cloud is processed to generate a clean, geometrically accurate 3D reconstruction of the target object. Filtering, noise reduction, and surface refinement ensure the output is optimized for the grasp detection stage, even under challenging conditions like partial occlusion or sensor inaccuracies.

Grasping Pose Detection

The Grasp Pose Detection (GPD) module was customized to address practical constraints of tabletop manipulation. Initially, grasps were generated from all directions — including from under the table — leading to infeasible plans. By augmenting the input cloud with a portion of the table surface and tuning the approach vector, grasps are now biased toward top/front approaches, reducing collision risks and improving execution success.

Food Image after the meal — Example of a **Grasp Pose Detection** output. Fig. on the left shows *n = 5*; Fig. in the middle shows *n = 15* Fig. on the right *n = 30*; where n represents the number of grasping poses to detect.

Motion Planning & Execution

Integration with MoveIt Task Constructor (MTC) enables the pipeline to build transparent, modular grasping tasks. MTC decomposes the process into logical stages (pre-grasp, approach, grasp, retreat), each validated independently, allowing for fallback strategies in case of failure. This structure greatly simplifies debugging and increases execution reliability.

Attempt of TIAGo to grasp a cylinder — Attempt of TIAGo to grasp a *cylinder* in simulation (Shown in Rviz).

Testing & Evaluation

The pipeline was validated on PAL Robotics' Tiago robots, first in simulation (Gazebo + RViz) and later in real-world experiments. Test objects ranged from simple cylinders and bottles to irregular shapes such as pears and joysticks. Despite the modular design being in its early stages, GAM demonstrated strong generalization to unseen objects and a reliable workflow from perception to execution.

Food Image before the meal — Some quantitative results. Fig. on the left shows the results in *Simulation*, while the Fig. on the right results from the real world.

Demo

Since GAM is a huge project and I cannot explain all the details, check my Thesis to see the full project documentation.