Computer Vision for Robotics: 3D Perception, Object Detection & Visual Servoing

ROBOTICS January 2026 28 min read Technical Depth: Advanced

Table of Contents

1. Executive Summary
2. Camera Technologies
3. 3D Perception & Point Cloud Processing
4. Object Detection & Recognition
5. Visual Servoing
6. Bin Picking
7. Quality Inspection
8. Calibration
9. Edge AI Platforms
10. Software Frameworks
11. Integration with ROS2

1. Executive Summary

The global machine vision market for robotics is projected to reach $21.4 billion by 2028, growing at a compound annual growth rate (CAGR) of 7.6%. In the APAC region, growth is notably faster at 9.8% CAGR, fueled by accelerating manufacturing automation in Vietnam, Thailand, China, and South Korea. Computer vision has evolved from a supplementary sensor modality into the primary perceptual backbone of modern robotic systems, enabling capabilities that were considered research-only problems just five years ago.

The convergence of three technology waves is reshaping what robots can see and do. First, affordable high-resolution 3D cameras now cost under $500 for industrial-grade depth sensors that deliver sub-millimeter accuracy. Second, real-time deep learning inference on edge AI accelerators (NVIDIA Jetson Orin delivers 275 TOPS) allows sophisticated neural networks to run directly on the robot without cloud dependency. Third, foundation models like Meta's Segment Anything Model (SAM) and open-vocabulary detectors such as Grounding DINO enable robots to recognize and manipulate objects they have never been explicitly trained on, collapsing the deployment timeline from months of data collection to hours of prompt engineering.

This technical guide provides a complete reference for integrating computer vision into robotic systems. We cover the full pipeline from camera selection and calibration through 3D perception, object detection, visual servoing control loops, and production deployment on edge hardware. Every section includes practical code examples, comparison tables, and architecture recommendations drawn from our deployment experience across 30+ industrial vision systems in APAC manufacturing and logistics facilities.

$21.4B

Global Machine Vision Market by 2028

275 TOPS

NVIDIA Jetson Orin INT8 Performance

<2ms

YOLOv8-nano Inference on Edge GPU

0.1mm

Structured Light 3D Camera Accuracy

2. Camera Technologies

2.1 2D Cameras

Two-dimensional cameras remain the workhorse of industrial machine vision, providing high-resolution texture and color information essential for inspection, barcode reading, and feature-based detection. The two primary architectures serve different application profiles.

Area Scan Cameras: Capture a complete 2D frame in a single exposure using a rectangular sensor array. Resolutions range from VGA (0.3 MP) to 151 MP in current industrial models from Basler, FLIR (Teledyne), and Allied Vision. Global shutter sensors (e.g., Sony IMX sensor family) freeze motion without rolling shutter artifacts, making them essential for inspecting fast-moving objects on conveyor lines. For robotics pick-and-place, 5-12 MP area scan cameras with GigE Vision or USB3 Vision interfaces provide the optimal balance of resolution, field of view, and frame rate. A 12 MP camera with a 16mm lens at 600mm working distance delivers approximately 0.15mm/pixel resolution, sufficient for most grasping applications.

Line Scan Cameras: Use a single row of pixels (or trilinear RGB rows) to build images line by line as the object moves beneath the camera. Resolutions reach 16K pixels per line (16,384 pixels), delivering continuous inspection of materials without the field-of-view limitations of area scan sensors. Line scan excels at web inspection (textiles, paper, film), printed circuit board inspection, and any scenario with continuous linear motion. DALSA (Teledyne), Basler racer, and e2v line scan cameras dominate this segment. Triggering requires an encoder signal synchronized to material transport speed, making integration more complex than area scan but yielding superior results for high-speed continuous processes.

2.2 3D Cameras

Depth perception is the gateway to robotic manipulation in unstructured environments. Three principal 3D sensing technologies compete for different application niches.

Structured Light: Projects a known pattern (typically a sequence of coded stripes or speckle patterns) onto the scene and triangulates depth from the observed pattern deformation. Delivers the highest accuracy among 3D technologies, with leading systems (Photoneo PhoXi, Ensenso, Zivid) achieving point cloud accuracy of 0.05-0.3mm. Structured light excels at bin picking, precision assembly, and quality measurement. The primary limitation is sensitivity to ambient infrared light, which can interfere with the projected pattern in outdoor or brightly lit environments. Acquisition time ranges from 0.1s to 2s depending on exposure settings, making it best suited for quasi-static scenes.

Time-of-Flight (ToF): Measures depth by calculating the round-trip time of emitted infrared light pulses. Modern ToF sensors (Basler blaze, Lucid Helios2+, ifm O3D) deliver real-time depth maps at 30-60 fps with accuracy of 5-15mm. ToF cameras operate well in varied lighting conditions and provide consistent frame rates regardless of scene complexity. Their lower accuracy compared to structured light limits precision applications but makes them ideal for navigation, obstacle avoidance, volumetric measurement, and coarse bin picking where cycle time matters more than sub-millimeter precision.

Stereo Vision: Computes depth by triangulating corresponding features across two or more calibrated cameras. Intel RealSense D400 series and Stereolabs ZED 2i are popular embedded stereo systems. Stereo vision works outdoors, produces dense depth maps, and scales to long range (10m+). Active stereo systems project an IR texture pattern to assist correspondence matching on textureless surfaces, significantly improving reliability for indoor robotics. Accuracy depends on baseline distance and resolution, typically 1-5% of range, placing it between ToF and structured light.

2.3 Specialized Sensors

Event Cameras (Dynamic Vision Sensors): Neuromorphic sensors that independently report per-pixel brightness changes asynchronously, with microsecond temporal resolution. Prophesee and iniVation DVS sensors deliver 120dB dynamic range (versus 60dB for conventional cameras) and zero motion blur. In robotics, event cameras enable high-speed tracking of fast-moving objects, visual servoing at kilohertz rates, and reliable perception in extreme lighting conditions. Research demonstrations show event-camera-based grasping of objects thrown at the robot, with reaction times under 10ms. Adoption is accelerating as ROS2 drivers and deep learning frameworks mature.

Thermal Cameras (LWIR): Long-wave infrared cameras (8-14 micron wavelength) from FLIR, Seek Thermal, and InfiRay detect thermal radiation independent of visible lighting. Critical applications in robotics include weld seam tracking on hot workpieces, food safety inspection, predictive maintenance (detecting overheating bearings or motors), and human detection for safety systems in collaborative robot cells. Thermal cameras are also invaluable in agricultural robotics for crop stress monitoring and fruit ripeness detection.

Technology	Accuracy	Range	Speed	Ambient Light	Cost Range	Best For
Structured Light	0.05-0.3mm	0.2-2.0m	0.1-2s	Indoor only	$3K-$15K	Bin picking, precision assembly
Time-of-Flight	5-15mm	0.1-6.0m	30-60 fps	Good	$1K-$5K	Navigation, volume measurement
Active Stereo	1-5% of range	0.3-10m	30-90 fps	Good	$300-$2K	Mobile robots, SLAM
Event Camera	N/A (temporal)	0.1-10m	1M events/s	120dB HDR	$2K-$8K	High-speed tracking
Thermal (LWIR)	N/A (thermal)	0.5-50m+	30-60 fps	Independent	$500-$10K	Inspection, safety

3. 3D Perception & Point Cloud Processing

3.1 Point Cloud Fundamentals

A point cloud is an unstructured set of 3D coordinates, optionally augmented with color (XYZRGB), normals, or intensity values. Raw point clouds from depth cameras contain hundreds of thousands to millions of points and must be processed through a pipeline of filtering, segmentation, and feature extraction before they are useful for robotic manipulation decisions. Two open-source libraries dominate point cloud processing in robotics.

Point Cloud Library (PCL): The original C++ point cloud processing toolkit, deeply integrated with ROS/ROS2 through the pcl_ros bridge. PCL provides mature implementations of voxel grid downsampling, statistical outlier removal, RANSAC plane fitting, Euclidean clustering, ICP registration, and surface reconstruction. While development has slowed, PCL remains the standard for production ROS2 deployments due to its C++ performance and extensive ROS message support.

Open3D: A modern Python/C++ library from Intel with GPU-accelerated processing, superior visualization, and cleaner APIs than PCL. Open3D excels at reconstruction tasks (TSDF fusion, Poisson surface reconstruction), provides native tensor-based operations for deep learning integration, and supports CUDA-accelerated ICP and RANSAC. For rapid prototyping and research, Open3D is the preferred choice, while PCL remains stronger for hard real-time ROS2 applications.

3.2 Standard Processing Pipeline

A typical robotic perception pipeline processes raw point clouds through the following stages: passthrough filtering (crop to workspace volume), downsampling (voxel grid to reduce density), outlier removal (statistical or radius-based), plane segmentation (RANSAC to remove table/floor), Euclidean clustering (separate individual objects), and feature extraction (centroid, bounding box, surface normals for grasp planning).

# Point Cloud Processing Pipeline with Open3D
import open3d as o3d
import numpy as np

def process_point_cloud(pcd_path, voxel_size=0.003):
    """
    Complete point cloud processing pipeline for bin picking.
    Input: raw point cloud from structured light camera
    Output: list of segmented object clusters with centroids and normals
    """
    # Load raw point cloud
    pcd = o3d.io.read_point_cloud(pcd_path)
    print(f"Raw points: {len(pcd.points)}")

    # 1. Passthrough filter - crop to workspace volume (meters)
    bbox = o3d.geometry.AxisAlignedBoundingBox(
        min_bound=np.array([-0.4, -0.3, 0.01]),   # XYZ min
        max_bound=np.array([0.4, 0.3, 0.35])       # XYZ max
    )
    pcd = pcd.crop(bbox)

    # 2. Voxel grid downsampling
    pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size)
    print(f"After downsampling: {len(pcd_down.points)}")

    # 3. Statistical outlier removal
    pcd_clean, inlier_idx = pcd_down.remove_statistical_outlier(
        nb_neighbors=20, std_ratio=2.0
    )

    # 4. Estimate normals (required for plane segmentation)
    pcd_clean.estimate_normals(
        search_param=o3d.geometry.KDTreeSearchParamHybrid(
            radius=0.01, max_nn=30
        )
    )

    # 5. RANSAC plane segmentation (remove table surface)
    plane_model, inliers = pcd_clean.segment_plane(
        distance_threshold=0.005,
        ransac_n=3,
        num_iterations=1000
    )
    objects_pcd = pcd_clean.select_by_index(inliers, invert=True)
    print(f"Objects after plane removal: {len(objects_pcd.points)}")

    # 6. DBSCAN clustering to separate individual objects
    labels = np.array(objects_pcd.cluster_dbscan(
        eps=0.015, min_points=50, print_progress=False
    ))
    n_clusters = labels.max() + 1
    print(f"Detected {n_clusters} object clusters")

    # 7. Extract per-object properties
    objects = []
    for i in range(n_clusters):
        cluster_idx = np.where(labels == i)[0]
        cluster = objects_pcd.select_by_index(cluster_idx)
        centroid = cluster.get_center()
        obb = cluster.get_oriented_bounding_box()
        objects.append({
            'cluster_id': i,
            'centroid': centroid,
            'num_points': len(cluster_idx),
            'bounding_box': obb,
            'dimensions': obb.extent,
            'orientation': obb.R
        })

    return objects
            

3.3 Surface Reconstruction

For applications requiring solid geometry rather than point samples, surface reconstruction converts point clouds into triangular meshes. Poisson Surface Reconstruction produces watertight meshes suitable for volume calculation, CAD comparison, and physics simulation. The algorithm requires oriented normals and works best on uniformly sampled surfaces. TSDF (Truncated Signed Distance Function) Fusion integrates multiple depth frames into a volumetric representation, producing dense and accurate reconstructions ideal for multi-view scanning setups. Open3D implements GPU-accelerated TSDF with ScalableTSDFVolume, supporting real-time reconstruction from RGBD streams.

3.4 6-DOF Pose Estimation

Determining the full six-degree-of-freedom pose (X, Y, Z, roll, pitch, yaw) of known objects is essential for precision pick-and-place and assembly tasks. Classical approaches use feature-based matching: extract 3D features (FPFH, SHOT) from the scene point cloud, match against a CAD model template, and refine with ICP. Modern deep learning methods trained on synthetic data (e.g., DenseFusion, PVN3D, FoundationPose) directly regress 6-DOF poses from RGBD input, achieving state-of-the-art accuracy on benchmarks like BOP Challenge. NVIDIA's FoundationPose (2024) introduced a foundation model approach that generalizes to novel objects using only a single reference image or a CAD model, eliminating per-object training entirely.

Practical Tip: Choosing Between Classical and Deep Learning Pose Estimation

For high-accuracy applications with known CAD models and controlled lighting (e.g., automotive assembly), classical ICP-based methods with structured light cameras deliver repeatable sub-millimeter accuracy. For high-variety applications with unknown or changing objects (e.g., e-commerce fulfillment), deep learning methods like FoundationPose provide superior generalization with 2-5mm accuracy sufficient for grasping. Hybrid pipelines that use deep learning for coarse detection and ICP for fine refinement combine the best of both approaches.

4. Object Detection & Recognition

4.1 YOLO Family for Real-Time Detection

The YOLO (You Only Look Once) architecture family remains the dominant choice for real-time object detection in robotics due to its single-pass inference speed and excellent accuracy-latency tradeoff. The progression from YOLOv5 through YOLOv8 to the current YOLOv11 has delivered consistent improvements in both accuracy and inference efficiency.

YOLOv8 (Ultralytics): The current production standard, offering five model scales (nano, small, medium, large, extra-large) that span the full spectrum from embedded edge deployment (YOLOv8n at 3.2M parameters, 1.8ms on Jetson Orin) to maximum accuracy (YOLOv8x at 68.2M parameters). YOLOv8 natively supports detection, segmentation, pose estimation, and oriented bounding boxes (OBB) from a unified training framework. The Ultralytics Python API provides single-line training, validation, and export to ONNX, TensorRT, CoreML, and OpenVINO formats.

YOLOv11: The latest release introduces improved C3K2 backbone blocks and attention mechanisms that push accuracy 2-3% higher on COCO while maintaining comparable inference speed. For robotics deployments where retraining is feasible, YOLOv11 is now the recommended starting point.

# YOLO Object Detection for Robotic Pick-and-Place
from ultralytics import YOLO
import cv2
import numpy as np

# Load trained model (exported to TensorRT for Jetson deployment)
model = YOLO("best.engine", task="detect")  # TensorRT engine

def detect_objects(image_bgr, conf_threshold=0.7):
    """
    Detect graspable objects in camera frame.
    Returns list of detections with class, confidence, and pixel coordinates.
    """
    results = model.predict(
        source=image_bgr,
        conf=conf_threshold,
        iou=0.45,           # NMS IoU threshold
        imgsz=640,
        device=0,            # GPU device
        verbose=False
    )

    detections = []
    for r in results:
        for box in r.boxes:
            x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
            detections.append({
                'class_id': int(box.cls[0]),
                'class_name': model.names[int(box.cls[0])],
                'confidence': float(box.conf[0]),
                'bbox_xyxy': [int(x1), int(y1), int(x2), int(y2)],
                'center_px': [int((x1+x2)/2), int((y1+y2)/2)],
                'area_px': int((x2-x1) * (y2-y1))
            })

    # Sort by confidence descending for pick priority
    detections.sort(key=lambda d: d['confidence'], reverse=True)
    return detections

# Usage in robot pick loop
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
objects = detect_objects(frame)
if objects:
    target = objects[0]  # Highest confidence
    print(f"Pick target: {target['class_name']} at {target['center_px']}")
    print(f"Confidence: {target['confidence']:.2f}")
            

4.2 Detectron2 for Instance Segmentation

Meta's Detectron2 framework provides state-of-the-art instance segmentation using Mask R-CNN, Cascade R-CNN, and PointRend architectures. While slower than YOLO (typically 5-15 fps on desktop GPU), Detectron2 produces pixel-precise segmentation masks that are essential for deformable object manipulation, suction cup grasp planning (identifying flat surfaces within the mask), and measuring object dimensions from segmented regions. Detectron2's model zoo includes pre-trained weights on COCO (80 classes), LVIS (1203 classes), and Cityscapes, providing strong transfer learning baselines for custom robotic vision datasets.

4.3 Segment Anything Model (SAM)

Meta's SAM and its successor SAM 2 represent a paradigm shift toward foundation models for visual perception. SAM segments any object in an image given a point prompt, bounding box, or text description, without object-specific training. For robotics, this capability is transformative: a robot can segment novel objects it has never seen during training by providing a rough location (from YOLO detection or user click) and receiving a pixel-precise mask. SAM 2 extends this to video with real-time temporal tracking, enabling consistent object segmentation across frames as the robot moves. The primary deployment challenge is computational cost; SAM's ViT-H backbone requires 2.5GB of GPU memory and runs at 2-4 fps on Jetson Orin. Distilled variants (MobileSAM, FastSAM, EfficientSAM) reduce this to near-real-time on edge hardware.

4.4 Open-Vocabulary and Foundation Models

The frontier of robotic perception has moved beyond fixed-class detectors to open-vocabulary models that accept natural language descriptions of target objects. Grounding DINO combines a DINO-based detector with grounded language understanding, allowing queries like "the red screw on the left side of the PCB" to produce bounding box detections without any task-specific training. OWLv2 (Google) provides open-world localization with text and image prompts. For robotic manipulation, these models are combined with SAM in a detect-then-segment pipeline: Grounding DINO localizes the object from a language prompt, and SAM produces the precise segmentation mask for grasp planning.

Model	Task	Speed (Jetson Orin)	Accuracy (COCO)	Custom Training	Best For
YOLOv8-nano	Detection	1.8ms / 555 fps	37.3 mAP	Easy (Ultralytics)	Edge real-time
YOLOv8-large	Detection	12ms / 83 fps	52.9 mAP	Easy (Ultralytics)	Accuracy-critical
YOLOv8-seg	Segmentation	15ms / 66 fps	44.6 mask mAP	Easy (Ultralytics)	Grasp planning
Detectron2 Mask R-CNN	Segmentation	80ms / 12 fps	46.3 mask mAP	Moderate	Precision masks
SAM (ViT-H)	Segmentation	250ms / 4 fps	N/A (zero-shot)	None needed	Novel objects
Grounding DINO	Open-vocab Det.	180ms / 5 fps	52.5 mAP (zero-shot)	Optional fine-tune	Language-guided

5. Visual Servoing

5.1 Fundamentals

Visual servoing (VS) is a control technique that uses real-time visual feedback to guide robot motion, closing the loop between perception and actuation. Unlike open-loop pick-and-place (detect, plan, execute), visual servoing continuously adjusts the robot's trajectory based on what the camera currently observes, enabling the robot to correct for calibration errors, object drift, and environmental perturbations. This makes VS indispensable for tasks requiring sub-millimeter precision, dynamic target tracking, or operating with imprecise kinematic models.

5.2 Image-Based Visual Servoing (IBVS)

IBVS operates entirely in the 2D image plane, computing velocity commands from the error between current and desired image features (typically point coordinates, line parameters, or image moments). The control law minimizes the feature error e = s - s* using the image Jacobian (interaction matrix) L that maps camera velocities to feature motion. The velocity command is v = -lambda * L_pinv * e, where lambda is the control gain.

Advantages of IBVS: no 3D model required, inherently robust to calibration errors since control happens in image space, and guaranteed convergence for small displacements. Disadvantages: camera trajectory in 3D space is not predictable (may produce unintuitive Cartesian paths), singularity risks when image Jacobian loses rank, and difficulty handling large rotations (particularly 180-degree rotations where features leave the field of view).

5.3 Position-Based Visual Servoing (PBVS)

PBVS first estimates the full 3D pose of the target from visual features, then computes a Cartesian velocity command to move the robot end-effector toward the desired 3D pose. The control law operates on the 6-DOF pose error in SE(3), producing straight-line Cartesian trajectories that are intuitive and predictable. PBVS requires accurate camera calibration and a 3D model of the target for pose estimation.

Advantages of PBVS: predictable Cartesian motion paths, natural handling of large displacements and rotations, and easy integration with collision avoidance systems that operate in Cartesian space. Disadvantages: sensitivity to calibration errors (camera intrinsics, hand-eye transform), reliance on accurate 3D pose estimation, and potential for features to leave the camera's field of view during servoing if the initial pose error is large.

5.4 Hybrid Approaches

Modern implementations increasingly adopt hybrid visual servoing strategies that combine IBVS and PBVS advantages. A common approach uses PBVS for translational motion (predictable Cartesian path) and IBVS for rotational control (robust to calibration errors). Partitioned approaches decouple the control into translational and rotational components, each using the most appropriate servoing strategy. Deep learning-based visual servoing replaces hand-crafted features with learned feature representations, using convolutional neural networks to directly predict velocity commands from raw images. These end-to-end methods show promise for unstructured environments but remain less reliable than classical approaches for precision industrial tasks.

When to Use Visual Servoing vs. Open-Loop Pick-and-Place

Use open-loop when: camera and robot are well-calibrated, objects are stationary, cycle time is critical, and 1-2mm accuracy is sufficient (e.g., standard bin picking with vacuum grippers).

Use visual servoing when: objects may move during approach (conveyor tracking), sub-millimeter accuracy is required (PCB insertion, precision assembly), calibration drift is expected (thermal expansion, mobile robots), or the robot must react to real-time visual feedback (welding seam tracking, wire insertion).

6. Bin Picking

6.1 Random Bin Picking

Random bin picking is widely considered the most challenging practical application of computer vision in industrial robotics. The task requires the robot to identify, localize, and grasp individual parts from a disordered heap of randomly oriented objects in a bin. The difficulty arises from severe occlusion (objects partially hidden by other objects), entanglement (parts interlocking), reflective or dark surfaces that challenge 3D cameras, and the need to plan collision-free approach paths into a confined bin volume.

A production random bin picking pipeline typically follows this sequence: (1) acquire 3D point cloud of the bin contents from a structured light camera mounted above the bin; (2) segment the bin walls and remove background points; (3) detect individual objects using 3D instance segmentation or 2D detection projected onto the 3D cloud; (4) estimate the 6-DOF pose of the topmost (most accessible) objects; (5) score candidate grasps based on collision clearance, grasp quality metrics, and reachability; (6) execute the highest-scored grasp with the robot; (7) verify grasp success using force/torque feedback or re-scan.

Leading commercial bin picking solutions (Photoneo Bin Picking Studio, Mech-Mind Mech-Vision, SICK PLB, Zivid + Pickit) bundle calibrated 3D cameras with integrated perception software, reducing deployment from months of custom development to days of configuration. These systems achieve 99%+ pick success rates on well-characterized parts at cycle times of 6-12 seconds per pick.

6.2 Structured Picking and Depalletizing

Structured Bin Picking: When parts arrive in known arrangements (trays, blister packs, organized layers), simplified vision algorithms suffice. Template matching or CAD-guided detection locates parts with known spacing, requiring only compensation for tray misalignment and layer height variation. Cycle times drop to 2-4 seconds per pick with higher reliability than random picking.

Depalletizing with Vision: Vision-guided depalletizing uses overhead 3D cameras to detect the top layer of cases or bags on a pallet, determine pick points, and generate layer-by-layer unloading sequences. ToF cameras (for speed) or structured light cameras (for accuracy on shiny packaging) provide the 3D scene understanding. The algorithm must handle mixed-SKU pallets, damaged cartons, shrink-wrapped loads, and slip sheets between layers. Modern depalletizing systems from Cognex, SICK, and Mech-Mind achieve 99.5%+ reliability at 600-1000 cases per hour.

99%+

Pick Success Rate (Commercial Systems)

6-12s

Random Bin Pick Cycle Time

1000

Cases/Hour Depalletizing Rate

0.5mm

Typical Pose Estimation Accuracy

7. Quality Inspection

7.1 Defect Detection with Deep Learning

Deep learning has fundamentally transformed automated visual inspection, replacing hand-crafted feature engineering with data-driven defect models that generalize across variations in lighting, positioning, and material surface properties. The dominant architectures for industrial defect detection fall into three categories.

Supervised Classification/Segmentation: When labeled defect data is available, models like U-Net, DeepLabv3+, and YOLO-seg directly segment defect regions at pixel level. Training requires 200-2000 annotated defect images per class, which can be augmented with synthetic data generation. This approach delivers the highest accuracy (95-99.5% defect detection rate) but requires labeled datasets for each defect type.

Anomaly Detection (Unsupervised): When defect samples are rare or unknown, anomaly detection models learn the distribution of normal (good) parts and flag deviations. PatchCore, PaDiM, and FastFlow architectures from the anomalib library achieve 95-98% AUROC on the MVTec Anomaly Detection benchmark using only defect-free training images. This approach dramatically reduces data collection requirements and naturally handles novel defect types not seen during training. For production deployment, anomaly detection is particularly valuable in industries like semiconductor fabrication and precision machining where new defect modes emerge unpredictably.

Few-Shot and Foundation Models: Vision-language models (CLIP, BLIP-2) and segment-anything approaches enable defect detection with minimal labeled data. An inspector can describe a defect type in natural language ("scratch on polished surface", "solder bridge between pads") and the model identifies matching regions. While accuracy lags behind fully supervised approaches, the near-zero setup time makes this attractive for low-volume, high-mix manufacturing common in Vietnamese contract manufacturing facilities.

7.2 Surface Inspection Systems

Surface inspection requires specialized illumination strategies to reveal defects. Bright-field illumination (direct on-axis light) highlights color defects, contamination, and markings. Dark-field illumination (low-angle grazing light) reveals surface topography defects like scratches, dents, and texture irregularities by scattering light at defect edges. Dome illumination provides diffuse, shadow-free lighting for inspecting curved or reflective surfaces. Structured illumination (photometric stereo with multiple light directions) reconstructs surface normal maps that reveal micro-topography invisible under standard lighting. Production surface inspection systems from Cognex (In-Sight), Keyence, and ISRA Vision combine optimized illumination with specialized optics and real-time deep learning inference for throughputs exceeding 10 parts per second.

7.3 Dimensional Measurement

Vision-based dimensional measurement replaces contact gauging (calipers, CMMs) with non-contact optical methods. 2D measurement using calibrated telecentric lenses achieves 5-20 micron accuracy for in-plane dimensions. 3D measurement using structured light or laser triangulation extends this to height, flatness, and volumetric dimensions with 10-50 micron accuracy. Critical success factors include thermal stability (camera and lens expand with temperature), vibration isolation, and traceable calibration against certified reference artifacts. For GD&T (Geometric Dimensioning and Tolerancing) compliance, vision measurement systems must be validated per MSA (Measurement System Analysis) protocols with documented Gage R&R studies.

8. Calibration

8.1 Intrinsic Calibration

Camera intrinsic calibration determines the internal parameters that map 3D world points to 2D pixel coordinates: focal length (fx, fy), principal point (cx, cy), and lens distortion coefficients (radial k1-k6, tangential p1-p2). Standard calibration uses Zhang's method with a planar checkerboard pattern captured from 15-30 viewpoints. OpenCV's calibrateCamera() function implements this with sub-pixel corner detection. For production-grade calibration, use a machine-printed target (not laser-printed) on flat glass or ceramic substrate. Reprojection error below 0.3 pixels indicates good calibration; below 0.1 pixels is excellent. Recalibrate whenever the lens is adjusted, the camera is remounted, or operating temperature changes significantly.

8.2 Hand-Eye Calibration

Hand-eye calibration determines the rigid transformation between the robot end-effector (hand) and the camera (eye). This transform is essential for converting object poses detected in camera coordinates to robot base coordinates for manipulation. Two configurations exist.

Eye-in-Hand: Camera mounted on the robot's wrist, moving with the end-effector. The calibration solves AX = XB, where A is the robot motion between two poses (known from forward kinematics), B is the camera motion (computed from observing a fixed calibration target), and X is the unknown hand-eye transform. At least 3 non-degenerate motions are required; 8-15 motions with diverse orientations yield robust results. OpenCV implements Tsai-Lenz, Park, Horaud, and Daniilidis solvers via calibrateHandEye().

Eye-to-Hand: Camera fixed in the workspace, observing the robot. The calibration solves AX = ZB, where Z is the transform from robot base to camera. This configuration is standard for overhead bin picking cameras. Calibration requires the robot to present a calibration target (mounted on the flange) at 8-15 diverse poses within the camera's field of view. OpenCV's calibrateRobotWorldHandEye() solves this variant.

8.3 Multi-Camera Systems

Complex robotic cells often employ multiple cameras for complete scene coverage: an overhead camera for coarse localization, an eye-in-hand camera for fine alignment, and side cameras for quality verification. Calibrating multi-camera systems requires establishing a common coordinate frame. Extrinsic calibration between cameras can use shared observations of a calibration target visible to both cameras simultaneously, or transitively through the robot coordinate frame (if each camera is independently hand-eye calibrated to the same robot). For multi-camera stereo setups, OpenCV's stereoCalibrate() jointly optimizes both camera intrinsics and the relative extrinsic transformation.

# Hand-Eye Calibration with OpenCV (Eye-in-Hand Configuration)
import cv2
import numpy as np

def perform_hand_eye_calibration(robot_poses, target_poses):
    """
    Compute hand-eye transform from paired robot and camera observations.

    Args:
        robot_poses: list of 4x4 homogeneous transforms (base_T_ee)
        target_poses: list of 4x4 homogeneous transforms (cam_T_target)

    Returns:
        4x4 hand-eye transform (ee_T_cam)
    """
    R_gripper2base, t_gripper2base = [], []
    R_target2cam, t_target2cam = [], []

    for pose in robot_poses:
        R_gripper2base.append(pose[:3, :3])
        t_gripper2base.append(pose[:3, 3].reshape(3, 1))

    for pose in target_poses:
        R_target2cam.append(pose[:3, :3])
        t_target2cam.append(pose[:3, 3].reshape(3, 1))

    # Solve AX = XB using Tsai-Lenz method
    R_cam2ee, t_cam2ee = cv2.calibrateHandEye(
        R_gripper2base, t_gripper2base,
        R_target2cam, t_target2cam,
        method=cv2.CALIB_HAND_EYE_TSAI
    )

    # Construct 4x4 homogeneous transform
    ee_T_cam = np.eye(4)
    ee_T_cam[:3, :3] = R_cam2ee
    ee_T_cam[:3, 3] = t_cam2ee.flatten()

    # Validate: reprojection error should be < 1mm
    print(f"Hand-eye translation: {t_cam2ee.flatten()}")
    print(f"Hand-eye rotation (rodrigues): {cv2.Rodrigues(R_cam2ee)[0].flatten()}")

    return ee_T_cam

# Validate calibration quality
def validate_calibration(ee_T_cam, robot_poses, target_poses, target_3d_pts):
    """Compute reprojection error across all calibration poses."""
    errors = []
    for i in range(len(robot_poses)):
        base_T_ee = robot_poses[i]
        cam_T_target = target_poses[i]
        # Chain: base_T_target = base_T_ee @ ee_T_cam @ cam_T_target
        base_T_target = base_T_ee @ ee_T_cam @ cam_T_target
        # Compare across pose pairs for consistency
        if i > 0:
            delta = np.linalg.norm(base_T_target[:3, 3] - prev_target[:3, 3])
            errors.append(delta)
        prev_target = base_T_target
    print(f"Mean consistency error: {np.mean(errors)*1000:.2f} mm")
    print(f"Max consistency error: {np.max(errors)*1000:.2f} mm")
            

9. Edge AI Platforms

9.1 Why Edge Inference for Robotics

Robotic vision systems demand low-latency, deterministic inference that cloud-based processing cannot reliably provide. Network round-trip latency (10-100ms to cloud) exceeds the response time requirements of visual servoing (5-10ms loop), safety-rated obstacle detection (under 20ms), and high-speed conveyor tracking. Edge AI accelerators colocated with the robot's camera system eliminate network dependency, provide deterministic latency, and operate in air-gapped manufacturing environments that prohibit external data transmission for IP protection. The economics also favor edge: after initial hardware cost, inference is essentially free, versus per-query cloud API costs that grow linearly with throughput.

9.2 Platform Comparison

Platform	AI Performance	Power	GPU/NPU	Price (Module)	Best For
NVIDIA Jetson Orin NX 16GB	100 TOPS (INT8)	10-25W	1024 CUDA + 32 Tensor	~$600	Multi-model pipelines
NVIDIA Jetson AGX Orin 64GB	275 TOPS (INT8)	15-60W	2048 CUDA + 64 Tensor	~$1,600	Autonomous robots, multi-cam
NVIDIA Jetson Orin Nano 8GB	40 TOPS (INT8)	7-15W	512 CUDA + 16 Tensor	~$250	Single-camera detection
Intel Neural Compute Stick 2	4 TOPS (INT8)	~1.5W	Myriad X VPU	~$70	Low-power classification
Google Coral Edge TPU (M.2)	4 TOPS (INT8)	2W	Edge TPU ASIC	~$30	TFLite single-model
Hailo-8 (M.2)	26 TOPS (INT8)	2.5W	Custom dataflow NPU	~$100	Multi-stream, high efficiency
Hailo-15H (coming)	20 TOPS + ISP	3W	NPU + vision proc.	~$40	Smart cameras

9.3 NVIDIA Jetson Ecosystem

The NVIDIA Jetson platform dominates robotic edge AI due to its combination of GPU compute, mature software ecosystem, and direct compatibility with training frameworks. The deployment pipeline typically flows: train on desktop/cloud GPU (PyTorch/TensorFlow) -> export to ONNX -> optimize with TensorRT -> deploy on Jetson. TensorRT optimization routinely delivers 2-5x speedup over native PyTorch inference through layer fusion, precision calibration (FP32 to FP16/INT8), and kernel auto-tuning. NVIDIA's Isaac ROS platform provides pre-built, GPU-accelerated ROS2 nodes for stereo depth estimation (using DNN-based disparity), visual SLAM (cuVSLAM), object detection (DOPE, CenterPose), and 3D perception (nvblox occupancy mapping).

For production robotics, the Jetson Orin NX 16GB represents the sweet spot: sufficient performance to run a YOLOv8-medium detector, a depth estimation model, and point cloud processing simultaneously at 15+ fps within a 25W power envelope. The AGX Orin 64GB is reserved for autonomous mobile robots running concurrent SLAM, multi-camera detection, and path planning, or for running large models like SAM alongside real-time detection.

9.4 Hailo and Emerging Alternatives

Hailo's dataflow architecture NPU has emerged as a compelling alternative for multi-stream edge applications. The Hailo-8 delivers 26 TOPS at just 2.5W, offering 10x better TOPS/Watt than Jetson Orin. The Hailo Dataflow Compiler converts ONNX/TFLite models with automatic quantization and scheduling. Particularly attractive for multi-camera quality inspection systems where 4-8 camera streams must be processed simultaneously on a single edge device. Hailo's partnership with Raspberry Pi (Hailo AI Kit for RPi5) is driving adoption in research and low-cost robotics applications.

10. Software Frameworks

10.1 Open-Source Frameworks

OpenCV (Open Source Computer Vision Library): The foundational library for computer vision across all domains. OpenCV 4.x provides 2500+ algorithms covering image processing, feature detection, camera calibration, stereo vision, object detection (DNN module for running ONNX/TensorFlow/Caffe models), ArUco/ChArUco marker detection, and optical flow. OpenCV's DNN module supports inference on CPU, CUDA GPU, and OpenVINO backends, making it the universal preprocessing and inference layer for robotics vision. Every serious robotics vision system uses OpenCV, either directly or through higher-level wrappers.

OpenCV Contrib: Extended modules include ArUco marker detection (essential for robot calibration and fiducial tracking), structured light pattern generation, surface matching (3D object recognition), and xfeatures2d (SIFT, SURF feature detectors). The cv2.aruco module is particularly valuable for robotics, providing robust 6-DOF pose estimation from printed markers for calibration validation, fixture alignment, and simple object tracking.

10.2 Commercial Frameworks

MVTec HALCON: The gold standard for industrial machine vision, used in over 90% of major automotive inspection systems worldwide. HALCON provides an integrated development environment (HDevelop) with 2000+ operators covering blob analysis, template matching (shape-based, correlation-based, deformable), 3D vision (surface matching, 3D pose estimation), deep learning (anomaly detection, classification, semantic segmentation), barcode/OCR, and calibration. HALCON's shape-based matching is unrivaled in speed and robustness, locating trained patterns in under 10ms even under significant rotation, scaling, and partial occlusion. License cost: $3,500-$8,000 per runtime seat, which is justified for high-reliability industrial deployments.

Cognex VisionPro: Cognex's PC-based vision software provides PatMax (geometric pattern matching), PatInspect (defect detection), IDMax (barcode reading), and deep learning tools. Cognex hardware (In-Sight cameras, DataMan readers) integrates tightly with VisionPro for turnkey inspection solutions. Cognex's deep learning edge (Cognex ViDi) requires minimal training data (as few as 20 images) and deploys to In-Sight cameras for standalone edge inference. Market strength: strongest in consumer electronics and semiconductor inspection.

Matrox Imaging Library (MIL): Matrox provides high-performance image processing with particular strength in multi-camera systems, line scan processing, and GigE Vision/Camera Link capture. MIL X is the latest generation supporting GPU-accelerated processing and deep learning inference. Matrox's SureDotOCR and PatternFinder are widely used in pharmaceutical packaging inspection and PCB assembly verification.

Framework	License	Strengths	Deep Learning	3D Vision	ROS2 Support
OpenCV	Free (Apache 2.0)	Universal, huge community	DNN inference only	Basic (stereo, calib)	Native via cv_bridge
HALCON	$3.5K-$8K/seat	Industrial reliability, matching	Integrated (train+deploy)	Excellent	C++ interface
Cognex VisionPro	$5K-$15K/seat	PatMax, turnkey hardware	Cognex ViDi	Good (3D-A5000)	Limited
Matrox MIL	$3K-$10K/seat	Multi-cam, line scan	MIL DL	Good	Limited
Open3D	Free (MIT)	Point clouds, reconstruction	Tensor integration	Excellent	Python bridge

11. Integration with ROS2

11.1 ROS2 Vision Architecture

ROS2 (Robot Operating System 2) provides the middleware layer that connects camera drivers, perception algorithms, and robot controllers into a coherent vision-guided manipulation pipeline. The key architectural components for vision integration in ROS2 Humble/Iron/Jazzy are organized into standardized packages with well-defined message types and topic conventions.

image_pipeline: The core set of packages for camera processing in ROS2. image_transport handles efficient image transmission with pluggable compression (raw, compressed JPEG/PNG, theora video). image_proc performs debayering (converting raw Bayer patterns to color), rectification (undistorting images using calibration parameters), and resizing. depth_image_proc converts depth images to point clouds and registers color onto depth. stereo_image_proc computes disparity maps and 3D point clouds from calibrated stereo camera pairs. These packages form the foundation that downstream perception nodes build upon.

cv_bridge: The bridge between ROS2 sensor_msgs/Image messages and OpenCV cv::Mat (C++) or NumPy arrays (Python). Every vision node that processes images uses cv_bridge for format conversion. In ROS2, cv_bridge supports zero-copy transport when publisher and subscriber are in the same process, eliminating the image copy overhead that can bottleneck high-resolution pipelines.

11.2 Point Cloud Topics and Processing

3D perception in ROS2 centers on the sensor_msgs/PointCloud2 message type, which carries dense or organized point clouds with arbitrary fields (XYZ, RGB, normals, intensity). The pcl_ros package bridges PCL data structures with ROS2 messages. Key topic conventions include /camera/depth/points for raw depth camera point clouds, /camera/depth_registered/points for color-registered clouds, and custom topic names for processed/filtered clouds. Downstream nodes subscribe to these topics for object detection, segmentation, and grasp planning.

For GPU-accelerated 3D perception, NVIDIA's Isaac ROS provides nvblox (real-time occupancy mapping and ESDF computation for collision avoidance), isaac_ros_depth_segmentation, and isaac_ros_cuMotion (GPU-accelerated motion planning aware of 3D obstacles). These nodes leverage CUDA and Jetson hardware for throughput that pure CPU implementations cannot match.

11.3 Complete Vision-Guided Pick Pipeline

# ROS2 Vision-Guided Pick-and-Place Node (Python)
# Subscribes to camera image, runs YOLO detection, publishes pick targets

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, PointCloud2, CameraInfo
from geometry_msgs.msg import PoseStamped, Point
from visualization_msgs.msg import Marker, MarkerArray
from cv_bridge import CvBridge
import cv2
import numpy as np
from ultralytics import YOLO
import sensor_msgs_py.point_cloud2 as pc2

class VisionPickNode(Node):
    def __init__(self):
        super().__init__('vision_pick_node')

        # Parameters
        self.declare_parameter('model_path', 'best.engine')
        self.declare_parameter('confidence_threshold', 0.75)
        self.declare_parameter('camera_frame', 'camera_color_optical_frame')

        model_path = self.get_parameter('model_path').value
        self.conf_thresh = self.get_parameter('confidence_threshold').value
        self.camera_frame = self.get_parameter('camera_frame').value

        # Initialize YOLO model
        self.model = YOLO(model_path, task='detect')
        self.bridge = CvBridge()
        self.camera_info = None
        self.latest_cloud = None

        # Subscribers
        self.create_subscription(
            Image, '/camera/color/image_raw',
            self.image_callback, 10
        )
        self.create_subscription(
            PointCloud2, '/camera/depth_registered/points',
            self.cloud_callback, 10
        )
        self.create_subscription(
            CameraInfo, '/camera/color/camera_info',
            self.caminfo_callback, 10
        )

        # Publishers
        self.pick_pub = self.create_publisher(
            PoseStamped, '/vision/pick_target', 10
        )
        self.marker_pub = self.create_publisher(
            MarkerArray, '/vision/detections_viz', 10
        )
        self.debug_pub = self.create_publisher(
            Image, '/vision/debug_image', 10
        )

        self.get_logger().info('Vision Pick Node initialized')

    def caminfo_callback(self, msg):
        self.camera_info = msg

    def cloud_callback(self, msg):
        self.latest_cloud = msg

    def image_callback(self, msg):
        # Convert ROS Image to OpenCV
        cv_image = self.bridge.imgmsg_to_cv2(msg, 'bgr8')

        # Run YOLO detection
        results = self.model.predict(
            source=cv_image,
            conf=self.conf_thresh,
            imgsz=640,
            device=0,
            verbose=False
        )

        detections = []
        for r in results:
            for box in r.boxes:
                x1, y1, x2, y2 = box.xyxy[0].cpu().numpy().astype(int)
                cx, cy = int((x1+x2)/2), int((y1+y2)/2)
                cls_name = self.model.names[int(box.cls[0])]
                conf = float(box.conf[0])
                detections.append({
                    'class': cls_name, 'conf': conf,
                    'center': (cx, cy), 'bbox': (x1, y1, x2, y2)
                })
                # Draw on debug image
                cv2.rectangle(cv_image, (x1,y1), (x2,y2), (0,255,0), 2)
                cv2.putText(cv_image, f'{cls_name} {conf:.2f}',
                    (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX,
                    0.6, (0,255,0), 2)

        # Publish debug image
        self.debug_pub.publish(
            self.bridge.cv2_to_imgmsg(cv_image, 'bgr8')
        )

        # Project best detection to 3D using point cloud
        if detections and self.latest_cloud is not None:
            best = max(detections, key=lambda d: d['conf'])
            cx, cy = best['center']

            # Read 3D point at detection center from organized cloud
            points = list(pc2.read_points(
                self.latest_cloud,
                field_names=('x','y','z'),
                skip_nans=True,
                uvs=[(cx, cy)]
            ))

            if points:
                x3d, y3d, z3d = points[0]
                pick_pose = PoseStamped()
                pick_pose.header.frame_id = self.camera_frame
                pick_pose.header.stamp = self.get_clock().now().to_msg()
                pick_pose.pose.position = Point(
                    x=float(x3d), y=float(y3d), z=float(z3d)
                )
                # Default orientation (approach from above)
                pick_pose.pose.orientation.w = 1.0
                self.pick_pub.publish(pick_pose)
                self.get_logger().info(
                    f'Pick target: {best["class"]} at '
                    f'[{x3d:.3f}, {y3d:.3f}, {z3d:.3f}]m'
                )

def main():
    rclpy.init()
    node = VisionPickNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()
            

11.4 Launch File and Configuration

# ROS2 Launch File: vision_pick_launch.py
from launch import LaunchDescription
from launch_ros.actions import Node
from launch.actions import DeclareLaunchArgument
from launch.substitutions import LaunchConfiguration

def generate_launch_description():
    return LaunchDescription([
        DeclareLaunchArgument('model_path',
            default_value='best.engine',
            description='Path to TensorRT YOLO model'),
        DeclareLaunchArgument('confidence',
            default_value='0.75',
            description='Detection confidence threshold'),

        # Intel RealSense D435i camera driver
        Node(
            package='realsense2_camera',
            executable='realsense2_camera_node',
            name='camera',
            parameters=[{
                'enable_color': True,
                'enable_depth': True,
                'align_depth.enable': True,
                'pointcloud.enable': True,
                'enable_sync': True,
                'color_module.profile': '1280x720x30',
                'depth_module.profile': '1280x720x30',
            }],
        ),

        # Image rectification
        Node(
            package='image_proc',
            executable='rectify_node',
            name='rectify_color',
            remappings=[
                ('image', '/camera/color/image_raw'),
                ('camera_info', '/camera/color/camera_info'),
                ('image_rect', '/camera/color/image_rect'),
            ],
        ),

        # Vision pick node
        Node(
            package='robot_vision',
            executable='vision_pick_node',
            name='vision_pick',
            parameters=[{
                'model_path': LaunchConfiguration('model_path'),
                'confidence_threshold': LaunchConfiguration('confidence'),
                'camera_frame': 'camera_color_optical_frame',
            }],
        ),

        # TF2 static transform: camera mount to robot base
        Node(
            package='tf2_ros',
            executable='static_transform_publisher',
            arguments=[
                '--x', '0.05', '--y', '0.0', '--z', '0.12',
                '--roll', '0.0', '--pitch', '0.785', '--yaw', '0.0',
                '--frame-id', 'tool0',
                '--child-frame-id', 'camera_link',
            ],
        ),
    ])
            

11.5 ROS2 Vision Topic Map

# Standard ROS2 Vision Topics for a Pick-and-Place System
#
# Camera Driver Output:
#   /camera/color/image_raw         [sensor_msgs/Image]       RGB image
#   /camera/color/camera_info       [sensor_msgs/CameraInfo]  Intrinsics
#   /camera/depth/image_rect_raw    [sensor_msgs/Image]       Depth map (16UC1, mm)
#   /camera/depth_registered/points [sensor_msgs/PointCloud2] XYZRGB cloud
#
# image_pipeline Output:
#   /camera/color/image_rect        [sensor_msgs/Image]       Undistorted RGB
#
# Vision Node Output:
#   /vision/detections              [vision_msgs/Detection2DArray]  2D detections
#   /vision/pick_target             [geometry_msgs/PoseStamped]     3D pick pose
#   /vision/debug_image             [sensor_msgs/Image]             Annotated image
#   /vision/detections_viz          [visualization_msgs/MarkerArray] RViz markers
#
# Point Cloud Processing:
#   /vision/filtered_cloud          [sensor_msgs/PointCloud2] Cropped + denoised
#   /vision/object_clusters         [sensor_msgs/PointCloud2] Segmented objects
#   /vision/plane_cloud             [sensor_msgs/PointCloud2] Detected surfaces
            

Deployment Checklist for Production Vision Systems

Before deploying a vision-guided robotic system to production, validate the following:

Camera intrinsic calibration reprojection error below 0.3 pixels
Hand-eye calibration consistency error below 0.5mm
Object detection model validated on 500+ held-out test images with target domain data
Edge inference latency benchmarked end-to-end (capture to pick pose output) under 100ms
Lighting variation tested: worst-case ambient conditions still yield reliable detection
Thermal stability verified: calibration accuracy after 4+ hours of continuous operation
Failure mode handling: behavior defined for zero detections, low confidence, sensor dropout
Logging and monitoring: all detections, confidences, and cycle metrics recorded for analysis

Ready to Integrate Computer Vision into Your Robotic Systems?

Seraphim Vietnam provides end-to-end computer vision and robotics engineering, from camera selection and calibration through deep learning model development, edge AI deployment, and ROS2 system integration. Schedule a consultation to discuss your machine vision requirements.