INITIALIZING SYSTEMS

0%
MACHINE VISION

Computer Vision for Robotics
3D Perception, Detection & Visual Servoing

A comprehensive technical guide to machine vision in robotics covering camera technologies, 3D point cloud perception, object detection with YOLO and foundation models, visual servoing control, bin picking, automated quality inspection, hand-eye calibration, edge AI deployment, and full ROS2 integration pipelines.

ROBOTICS January 2026 28 min read Technical Depth: Advanced

1. Executive Summary

The global machine vision market for robotics is projected to reach $21.4 billion by 2028, growing at a compound annual growth rate (CAGR) of 7.6%. In the APAC region, growth is notably faster at 9.8% CAGR, fueled by accelerating manufacturing automation in Vietnam, Thailand, China, and South Korea. Computer vision has evolved from a supplementary sensor modality into the primary perceptual backbone of modern robotic systems, enabling capabilities that were considered research-only problems just five years ago.

The convergence of three technology waves is reshaping what robots can see and do. First, affordable high-resolution 3D cameras now cost under $500 for industrial-grade depth sensors that deliver sub-millimeter accuracy. Second, real-time deep learning inference on edge AI accelerators (NVIDIA Jetson Orin delivers 275 TOPS) allows sophisticated neural networks to run directly on the robot without cloud dependency. Third, foundation models like Meta's Segment Anything Model (SAM) and open-vocabulary detectors such as Grounding DINO enable robots to recognize and manipulate objects they have never been explicitly trained on, collapsing the deployment timeline from months of data collection to hours of prompt engineering.

This technical guide provides a complete reference for integrating computer vision into robotic systems. We cover the full pipeline from camera selection and calibration through 3D perception, object detection, visual servoing control loops, and production deployment on edge hardware. Every section includes practical code examples, comparison tables, and architecture recommendations drawn from our deployment experience across 30+ industrial vision systems in APAC manufacturing and logistics facilities.

$21.4B
Global Machine Vision Market by 2028
275 TOPS
NVIDIA Jetson Orin INT8 Performance
<2ms
YOLOv8-nano Inference on Edge GPU
0.1mm
Structured Light 3D Camera Accuracy

2. Camera Technologies

2.1 2D Cameras

Two-dimensional cameras remain the workhorse of industrial machine vision, providing high-resolution texture and color information essential for inspection, barcode reading, and feature-based detection. The two primary architectures serve different application profiles.

Area Scan Cameras: Capture a complete 2D frame in a single exposure using a rectangular sensor array. Resolutions range from VGA (0.3 MP) to 151 MP in current industrial models from Basler, FLIR (Teledyne), and Allied Vision. Global shutter sensors (e.g., Sony IMX sensor family) freeze motion without rolling shutter artifacts, making them essential for inspecting fast-moving objects on conveyor lines. For robotics pick-and-place, 5-12 MP area scan cameras with GigE Vision or USB3 Vision interfaces provide the optimal balance of resolution, field of view, and frame rate. A 12 MP camera with a 16mm lens at 600mm working distance delivers approximately 0.15mm/pixel resolution, sufficient for most grasping applications.

Line Scan Cameras: Use a single row of pixels (or trilinear RGB rows) to build images line by line as the object moves beneath the camera. Resolutions reach 16K pixels per line (16,384 pixels), delivering continuous inspection of materials without the field-of-view limitations of area scan sensors. Line scan excels at web inspection (textiles, paper, film), printed circuit board inspection, and any scenario with continuous linear motion. DALSA (Teledyne), Basler racer, and e2v line scan cameras dominate this segment. Triggering requires an encoder signal synchronized to material transport speed, making integration more complex than area scan but yielding superior results for high-speed continuous processes.

2.2 3D Cameras

Depth perception is the gateway to robotic manipulation in unstructured environments. Three principal 3D sensing technologies compete for different application niches.

Structured Light: Projects a known pattern (typically a sequence of coded stripes or speckle patterns) onto the scene and triangulates depth from the observed pattern deformation. Delivers the highest accuracy among 3D technologies, with leading systems (Photoneo PhoXi, Ensenso, Zivid) achieving point cloud accuracy of 0.05-0.3mm. Structured light excels at bin picking, precision assembly, and quality measurement. The primary limitation is sensitivity to ambient infrared light, which can interfere with the projected pattern in outdoor or brightly lit environments. Acquisition time ranges from 0.1s to 2s depending on exposure settings, making it best suited for quasi-static scenes.

Time-of-Flight (ToF): Measures depth by calculating the round-trip time of emitted infrared light pulses. Modern ToF sensors (Basler blaze, Lucid Helios2+, ifm O3D) deliver real-time depth maps at 30-60 fps with accuracy of 5-15mm. ToF cameras operate well in varied lighting conditions and provide consistent frame rates regardless of scene complexity. Their lower accuracy compared to structured light limits precision applications but makes them ideal for navigation, obstacle avoidance, volumetric measurement, and coarse bin picking where cycle time matters more than sub-millimeter precision.

Stereo Vision: Computes depth by triangulating corresponding features across two or more calibrated cameras. Intel RealSense D400 series and Stereolabs ZED 2i are popular embedded stereo systems. Stereo vision works outdoors, produces dense depth maps, and scales to long range (10m+). Active stereo systems project an IR texture pattern to assist correspondence matching on textureless surfaces, significantly improving reliability for indoor robotics. Accuracy depends on baseline distance and resolution, typically 1-5% of range, placing it between ToF and structured light.

2.3 Specialized Sensors

Event Cameras (Dynamic Vision Sensors): Neuromorphic sensors that independently report per-pixel brightness changes asynchronously, with microsecond temporal resolution. Prophesee and iniVation DVS sensors deliver 120dB dynamic range (versus 60dB for conventional cameras) and zero motion blur. In robotics, event cameras enable high-speed tracking of fast-moving objects, visual servoing at kilohertz rates, and reliable perception in extreme lighting conditions. Research demonstrations show event-camera-based grasping of objects thrown at the robot, with reaction times under 10ms. Adoption is accelerating as ROS2 drivers and deep learning frameworks mature.

Thermal Cameras (LWIR): Long-wave infrared cameras (8-14 micron wavelength) from FLIR, Seek Thermal, and InfiRay detect thermal radiation independent of visible lighting. Critical applications in robotics include weld seam tracking on hot workpieces, food safety inspection, predictive maintenance (detecting overheating bearings or motors), and human detection for safety systems in collaborative robot cells. Thermal cameras are also invaluable in agricultural robotics for crop stress monitoring and fruit ripeness detection.

TechnologyAccuracyRangeSpeedAmbient LightCost RangeBest For
Structured Light0.05-0.3mm0.2-2.0m0.1-2sIndoor only$3K-$15KBin picking, precision assembly
Time-of-Flight5-15mm0.1-6.0m30-60 fpsGood$1K-$5KNavigation, volume measurement
Active Stereo1-5% of range0.3-10m30-90 fpsGood$300-$2KMobile robots, SLAM
Event CameraN/A (temporal)0.1-10m1M events/s120dB HDR$2K-$8KHigh-speed tracking
Thermal (LWIR)N/A (thermal)0.5-50m+30-60 fpsIndependent$500-$10KInspection, safety

3. 3D Perception & Point Cloud Processing

3.1 Point Cloud Fundamentals

A point cloud is an unstructured set of 3D coordinates, optionally augmented with color (XYZRGB), normals, or intensity values. Raw point clouds from depth cameras contain hundreds of thousands to millions of points and must be processed through a pipeline of filtering, segmentation, and feature extraction before they are useful for robotic manipulation decisions. Two open-source libraries dominate point cloud processing in robotics.

Point Cloud Library (PCL): The original C++ point cloud processing toolkit, deeply integrated with ROS/ROS2 through the pcl_ros bridge. PCL provides mature implementations of voxel grid downsampling, statistical outlier removal, RANSAC plane fitting, Euclidean clustering, ICP registration, and surface reconstruction. While development has slowed, PCL remains the standard for production ROS2 deployments due to its C++ performance and extensive ROS message support.

Open3D: A modern Python/C++ library from Intel with GPU-accelerated processing, superior visualization, and cleaner APIs than PCL. Open3D excels at reconstruction tasks (TSDF fusion, Poisson surface reconstruction), provides native tensor-based operations for deep learning integration, and supports CUDA-accelerated ICP and RANSAC. For rapid prototyping and research, Open3D is the preferred choice, while PCL remains stronger for hard real-time ROS2 applications.

3.2 Standard Processing Pipeline

A typical robotic perception pipeline processes raw point clouds through the following stages: passthrough filtering (crop to workspace volume), downsampling (voxel grid to reduce density), outlier removal (statistical or radius-based), plane segmentation (RANSAC to remove table/floor), Euclidean clustering (separate individual objects), and feature extraction (centroid, bounding box, surface normals for grasp planning).

# Point Cloud Processing Pipeline with Open3D import open3d as o3d import numpy as np def process_point_cloud(pcd_path, voxel_size=0.003): """ Complete point cloud processing pipeline for bin picking. Input: raw point cloud from structured light camera Output: list of segmented object clusters with centroids and normals """ # Load raw point cloud pcd = o3d.io.read_point_cloud(pcd_path) print(f"Raw points: {len(pcd.points)}") # 1. Passthrough filter - crop to workspace volume (meters) bbox = o3d.geometry.AxisAlignedBoundingBox( min_bound=np.array([-0.4, -0.3, 0.01]), # XYZ min max_bound=np.array([0.4, 0.3, 0.35]) # XYZ max ) pcd = pcd.crop(bbox) # 2. Voxel grid downsampling pcd_down = pcd.voxel_down_sample(voxel_size=voxel_size) print(f"After downsampling: {len(pcd_down.points)}") # 3. Statistical outlier removal pcd_clean, inlier_idx = pcd_down.remove_statistical_outlier( nb_neighbors=20, std_ratio=2.0 ) # 4. Estimate normals (required for plane segmentation) pcd_clean.estimate_normals( search_param=o3d.geometry.KDTreeSearchParamHybrid( radius=0.01, max_nn=30 ) ) # 5. RANSAC plane segmentation (remove table surface) plane_model, inliers = pcd_clean.segment_plane( distance_threshold=0.005, ransac_n=3, num_iterations=1000 ) objects_pcd = pcd_clean.select_by_index(inliers, invert=True) print(f"Objects after plane removal: {len(objects_pcd.points)}") # 6. DBSCAN clustering to separate individual objects labels = np.array(objects_pcd.cluster_dbscan( eps=0.015, min_points=50, print_progress=False )) n_clusters = labels.max() + 1 print(f"Detected {n_clusters} object clusters") # 7. Extract per-object properties objects = [] for i in range(n_clusters): cluster_idx = np.where(labels == i)[0] cluster = objects_pcd.select_by_index(cluster_idx) centroid = cluster.get_center() obb = cluster.get_oriented_bounding_box() objects.append({ 'cluster_id': i, 'centroid': centroid, 'num_points': len(cluster_idx), 'bounding_box': obb, 'dimensions': obb.extent, 'orientation': obb.R }) return objects

3.3 Surface Reconstruction

For applications requiring solid geometry rather than point samples, surface reconstruction converts point clouds into triangular meshes. Poisson Surface Reconstruction produces watertight meshes suitable for volume calculation, CAD comparison, and physics simulation. The algorithm requires oriented normals and works best on uniformly sampled surfaces. TSDF (Truncated Signed Distance Function) Fusion integrates multiple depth frames into a volumetric representation, producing dense and accurate reconstructions ideal for multi-view scanning setups. Open3D implements GPU-accelerated TSDF with ScalableTSDFVolume, supporting real-time reconstruction from RGBD streams.

3.4 6-DOF Pose Estimation

Determining the full six-degree-of-freedom pose (X, Y, Z, roll, pitch, yaw) of known objects is essential for precision pick-and-place and assembly tasks. Classical approaches use feature-based matching: extract 3D features (FPFH, SHOT) from the scene point cloud, match against a CAD model template, and refine with ICP. Modern deep learning methods trained on synthetic data (e.g., DenseFusion, PVN3D, FoundationPose) directly regress 6-DOF poses from RGBD input, achieving state-of-the-art accuracy on benchmarks like BOP Challenge. NVIDIA's FoundationPose (2024) introduced a foundation model approach that generalizes to novel objects using only a single reference image or a CAD model, eliminating per-object training entirely.

Practical Tip: Choosing Between Classical and Deep Learning Pose Estimation

For high-accuracy applications with known CAD models and controlled lighting (e.g., automotive assembly), classical ICP-based methods with structured light cameras deliver repeatable sub-millimeter accuracy. For high-variety applications with unknown or changing objects (e.g., e-commerce fulfillment), deep learning methods like FoundationPose provide superior generalization with 2-5mm accuracy sufficient for grasping. Hybrid pipelines that use deep learning for coarse detection and ICP for fine refinement combine the best of both approaches.

4. Object Detection & Recognition

4.1 YOLO Family for Real-Time Detection

The YOLO (You Only Look Once) architecture family remains the dominant choice for real-time object detection in robotics due to its single-pass inference speed and excellent accuracy-latency tradeoff. The progression from YOLOv5 through YOLOv8 to the current YOLOv11 has delivered consistent improvements in both accuracy and inference efficiency.

YOLOv8 (Ultralytics): The current production standard, offering five model scales (nano, small, medium, large, extra-large) that span the full spectrum from embedded edge deployment (YOLOv8n at 3.2M parameters, 1.8ms on Jetson Orin) to maximum accuracy (YOLOv8x at 68.2M parameters). YOLOv8 natively supports detection, segmentation, pose estimation, and oriented bounding boxes (OBB) from a unified training framework. The Ultralytics Python API provides single-line training, validation, and export to ONNX, TensorRT, CoreML, and OpenVINO formats.

YOLOv11: The latest release introduces improved C3K2 backbone blocks and attention mechanisms that push accuracy 2-3% higher on COCO while maintaining comparable inference speed. For robotics deployments where retraining is feasible, YOLOv11 is now the recommended starting point.

# YOLO Object Detection for Robotic Pick-and-Place from ultralytics import YOLO import cv2 import numpy as np # Load trained model (exported to TensorRT for Jetson deployment) model = YOLO("best.engine", task="detect") # TensorRT engine def detect_objects(image_bgr, conf_threshold=0.7): """ Detect graspable objects in camera frame. Returns list of detections with class, confidence, and pixel coordinates. """ results = model.predict( source=image_bgr, conf=conf_threshold, iou=0.45, # NMS IoU threshold imgsz=640, device=0, # GPU device verbose=False ) detections = [] for r in results: for box in r.boxes: x1, y1, x2, y2 = box.xyxy[0].cpu().numpy() detections.append({ 'class_id': int(box.cls[0]), 'class_name': model.names[int(box.cls[0])], 'confidence': float(box.conf[0]), 'bbox_xyxy': [int(x1), int(y1), int(x2), int(y2)], 'center_px': [int((x1+x2)/2), int((y1+y2)/2)], 'area_px': int((x2-x1) * (y2-y1)) }) # Sort by confidence descending for pick priority detections.sort(key=lambda d: d['confidence'], reverse=True) return detections # Usage in robot pick loop cap = cv2.VideoCapture(0) ret, frame = cap.read() objects = detect_objects(frame) if objects: target = objects[0] # Highest confidence print(f"Pick target: {target['class_name']} at {target['center_px']}") print(f"Confidence: {target['confidence']:.2f}")

4.2 Detectron2 for Instance Segmentation

Meta's Detectron2 framework provides state-of-the-art instance segmentation using Mask R-CNN, Cascade R-CNN, and PointRend architectures. While slower than YOLO (typically 5-15 fps on desktop GPU), Detectron2 produces pixel-precise segmentation masks that are essential for deformable object manipulation, suction cup grasp planning (identifying flat surfaces within the mask), and measuring object dimensions from segmented regions. Detectron2's model zoo includes pre-trained weights on COCO (80 classes), LVIS (1203 classes), and Cityscapes, providing strong transfer learning baselines for custom robotic vision datasets.

4.3 Segment Anything Model (SAM)

Meta's SAM and its successor SAM 2 represent a paradigm shift toward foundation models for visual perception. SAM segments any object in an image given a point prompt, bounding box, or text description, without object-specific training. For robotics, this capability is transformative: a robot can segment novel objects it has never seen during training by providing a rough location (from YOLO detection or user click) and receiving a pixel-precise mask. SAM 2 extends this to video with real-time temporal tracking, enabling consistent object segmentation across frames as the robot moves. The primary deployment challenge is computational cost; SAM's ViT-H backbone requires 2.5GB of GPU memory and runs at 2-4 fps on Jetson Orin. Distilled variants (MobileSAM, FastSAM, EfficientSAM) reduce this to near-real-time on edge hardware.

4.4 Open-Vocabulary and Foundation Models

The frontier of robotic perception has moved beyond fixed-class detectors to open-vocabulary models that accept natural language descriptions of target objects. Grounding DINO combines a DINO-based detector with grounded language understanding, allowing queries like "the red screw on the left side of the PCB" to produce bounding box detections without any task-specific training. OWLv2 (Google) provides open-world localization with text and image prompts. For robotic manipulation, these models are combined with SAM in a detect-then-segment pipeline: Grounding DINO localizes the object from a language prompt, and SAM produces the precise segmentation mask for grasp planning.

ModelTaskSpeed (Jetson Orin)Accuracy (COCO)Custom TrainingBest For
YOLOv8-nanoDetection1.8ms / 555 fps37.3 mAPEasy (Ultralytics)Edge real-time
YOLOv8-largeDetection12ms / 83 fps52.9 mAPEasy (Ultralytics)Accuracy-critical
YOLOv8-segSegmentation15ms / 66 fps44.6 mask mAPEasy (Ultralytics)Grasp planning
Detectron2 Mask R-CNNSegmentation80ms / 12 fps46.3 mask mAPModeratePrecision masks
SAM (ViT-H)Segmentation250ms / 4 fpsN/A (zero-shot)None neededNovel objects
Grounding DINOOpen-vocab Det.180ms / 5 fps52.5 mAP (zero-shot)Optional fine-tuneLanguage-guided

5. Visual Servoing

5.1 Fundamentals

Visual servoing (VS) is a control technique that uses real-time visual feedback to guide robot motion, closing the loop between perception and actuation. Unlike open-loop pick-and-place (detect, plan, execute), visual servoing continuously adjusts the robot's trajectory based on what the camera currently observes, enabling the robot to correct for calibration errors, object drift, and environmental perturbations. This makes VS indispensable for tasks requiring sub-millimeter precision, dynamic target tracking, or operating with imprecise kinematic models.

5.2 Image-Based Visual Servoing (IBVS)

IBVS operates entirely in the 2D image plane, computing velocity commands from the error between current and desired image features (typically point coordinates, line parameters, or image moments). The control law minimizes the feature error e = s - s* using the image Jacobian (interaction matrix) L that maps camera velocities to feature motion. The velocity command is v = -lambda * L_pinv * e, where lambda is the control gain.

Advantages of IBVS: no 3D model required, inherently robust to calibration errors since control happens in image space, and guaranteed convergence for small displacements. Disadvantages: camera trajectory in 3D space is not predictable (may produce unintuitive Cartesian paths), singularity risks when image Jacobian loses rank, and difficulty handling large rotations (particularly 180-degree rotations where features leave the field of view).

5.3 Position-Based Visual Servoing (PBVS)

PBVS first estimates the full 3D pose of the target from visual features, then computes a Cartesian velocity command to move the robot end-effector toward the desired 3D pose. The control law operates on the 6-DOF pose error in SE(3), producing straight-line Cartesian trajectories that are intuitive and predictable. PBVS requires accurate camera calibration and a 3D model of the target for pose estimation.

Advantages of PBVS: predictable Cartesian motion paths, natural handling of large displacements and rotations, and easy integration with collision avoidance systems that operate in Cartesian space. Disadvantages: sensitivity to calibration errors (camera intrinsics, hand-eye transform), reliance on accurate 3D pose estimation, and potential for features to leave the camera's field of view during servoing if the initial pose error is large.

5.4 Hybrid Approaches

Modern implementations increasingly adopt hybrid visual servoing strategies that combine IBVS and PBVS advantages. A common approach uses PBVS for translational motion (predictable Cartesian path) and IBVS for rotational control (robust to calibration errors). Partitioned approaches decouple the control into translational and rotational components, each using the most appropriate servoing strategy. Deep learning-based visual servoing replaces hand-crafted features with learned feature representations, using convolutional neural networks to directly predict velocity commands from raw images. These end-to-end methods show promise for unstructured environments but remain less reliable than classical approaches for precision industrial tasks.

When to Use Visual Servoing vs. Open-Loop Pick-and-Place

Use open-loop when: camera and robot are well-calibrated, objects are stationary, cycle time is critical, and 1-2mm accuracy is sufficient (e.g., standard bin picking with vacuum grippers).

Use visual servoing when: objects may move during approach (conveyor tracking), sub-millimeter accuracy is required (PCB insertion, precision assembly), calibration drift is expected (thermal expansion, mobile robots), or the robot must react to real-time visual feedback (welding seam tracking, wire insertion).

6. Bin Picking

6.1 Random Bin Picking

Random bin picking is widely considered the most challenging practical application of computer vision in industrial robotics. The task requires the robot to identify, localize, and grasp individual parts from a disordered heap of randomly oriented objects in a bin. The difficulty arises from severe occlusion (objects partially hidden by other objects), entanglement (parts interlocking), reflective or dark surfaces that challenge 3D cameras, and the need to plan collision-free approach paths into a confined bin volume.

A production random bin picking pipeline typically follows this sequence: (1) acquire 3D point cloud of the bin contents from a structured light camera mounted above the bin; (2) segment the bin walls and remove background points; (3) detect individual objects using 3D instance segmentation or 2D detection projected onto the 3D cloud; (4) estimate the 6-DOF pose of the topmost (most accessible) objects; (5) score candidate grasps based on collision clearance, grasp quality metrics, and reachability; (6) execute the highest-scored grasp with the robot; (7) verify grasp success using force/torque feedback or re-scan.

Leading commercial bin picking solutions (Photoneo Bin Picking Studio, Mech-Mind Mech-Vision, SICK PLB, Zivid + Pickit) bundle calibrated 3D cameras with integrated perception software, reducing deployment from months of custom development to days of configuration. These systems achieve 99%+ pick success rates on well-characterized parts at cycle times of 6-12 seconds per pick.

6.2 Structured Picking and Depalletizing

Structured Bin Picking: When parts arrive in known arrangements (trays, blister packs, organized layers), simplified vision algorithms suffice. Template matching or CAD-guided detection locates parts with known spacing, requiring only compensation for tray misalignment and layer height variation. Cycle times drop to 2-4 seconds per pick with higher reliability than random picking.

Depalletizing with Vision: Vision-guided depalletizing uses overhead 3D cameras to detect the top layer of cases or bags on a pallet, determine pick points, and generate layer-by-layer unloading sequences. ToF cameras (for speed) or structured light cameras (for accuracy on shiny packaging) provide the 3D scene understanding. The algorithm must handle mixed-SKU pallets, damaged cartons, shrink-wrapped loads, and slip sheets between layers. Modern depalletizing systems from Cognex, SICK, and Mech-Mind achieve 99.5%+ reliability at 600-1000 cases per hour.

99%+
Pick Success Rate (Commercial Systems)
6-12s
Random Bin Pick Cycle Time
1000
Cases/Hour Depalletizing Rate
0.5mm
Typical Pose Estimation Accuracy

7. Quality Inspection

7.1 Defect Detection with Deep Learning

Deep learning has fundamentally transformed automated visual inspection, replacing hand-crafted feature engineering with data-driven defect models that generalize across variations in lighting, positioning, and material surface properties. The dominant architectures for industrial defect detection fall into three categories.

Supervised Classification/Segmentation: When labeled defect data is available, models like U-Net, DeepLabv3+, and YOLO-seg directly segment defect regions at pixel level. Training requires 200-2000 annotated defect images per class, which can be augmented with synthetic data generation. This approach delivers the highest accuracy (95-99.5% defect detection rate) but requires labeled datasets for each defect type.

Anomaly Detection (Unsupervised): When defect samples are rare or unknown, anomaly detection models learn the distribution of normal (good) parts and flag deviations. PatchCore, PaDiM, and FastFlow architectures from the anomalib library achieve 95-98% AUROC on the MVTec Anomaly Detection benchmark using only defect-free training images. This approach dramatically reduces data collection requirements and naturally handles novel defect types not seen during training. For production deployment, anomaly detection is particularly valuable in industries like semiconductor fabrication and precision machining where new defect modes emerge unpredictably.

Few-Shot and Foundation Models: Vision-language models (CLIP, BLIP-2) and segment-anything approaches enable defect detection with minimal labeled data. An inspector can describe a defect type in natural language ("scratch on polished surface", "solder bridge between pads") and the model identifies matching regions. While accuracy lags behind fully supervised approaches, the near-zero setup time makes this attractive for low-volume, high-mix manufacturing common in Vietnamese contract manufacturing facilities.

7.2 Surface Inspection Systems

Surface inspection requires specialized illumination strategies to reveal defects. Bright-field illumination (direct on-axis light) highlights color defects, contamination, and markings. Dark-field illumination (low-angle grazing light) reveals surface topography defects like scratches, dents, and texture irregularities by scattering light at defect edges. Dome illumination provides diffuse, shadow-free lighting for inspecting curved or reflective surfaces. Structured illumination (photometric stereo with multiple light directions) reconstructs surface normal maps that reveal micro-topography invisible under standard lighting. Production surface inspection systems from Cognex (In-Sight), Keyence, and ISRA Vision combine optimized illumination with specialized optics and real-time deep learning inference for throughputs exceeding 10 parts per second.

7.3 Dimensional Measurement

Vision-based dimensional measurement replaces contact gauging (calipers, CMMs) with non-contact optical methods. 2D measurement using calibrated telecentric lenses achieves 5-20 micron accuracy for in-plane dimensions. 3D measurement using structured light or laser triangulation extends this to height, flatness, and volumetric dimensions with 10-50 micron accuracy. Critical success factors include thermal stability (camera and lens expand with temperature), vibration isolation, and traceable calibration against certified reference artifacts. For GD&T (Geometric Dimensioning and Tolerancing) compliance, vision measurement systems must be validated per MSA (Measurement System Analysis) protocols with documented Gage R&R studies.

8. Calibration

8.1 Intrinsic Calibration

Camera intrinsic calibration determines the internal parameters that map 3D world points to 2D pixel coordinates: focal length (fx, fy), principal point (cx, cy), and lens distortion coefficients (radial k1-k6, tangential p1-p2). Standard calibration uses Zhang's method with a planar checkerboard pattern captured from 15-30 viewpoints. OpenCV's calibrateCamera() function implements this with sub-pixel corner detection. For production-grade calibration, use a machine-printed target (not laser-printed) on flat glass or ceramic substrate. Reprojection error below 0.3 pixels indicates good calibration; below 0.1 pixels is excellent. Recalibrate whenever the lens is adjusted, the camera is remounted, or operating temperature changes significantly.

8.2 Hand-Eye Calibration

Hand-eye calibration determines the rigid transformation between the robot end-effector (hand) and the camera (eye). This transform is essential for converting object poses detected in camera coordinates to robot base coordinates for manipulation. Two configurations exist.

Eye-in-Hand: Camera mounted on the robot's wrist, moving with the end-effector. The calibration solves AX = XB, where A is the robot motion between two poses (known from forward kinematics), B is the camera motion (computed from observing a fixed calibration target), and X is the unknown hand-eye transform. At least 3 non-degenerate motions are required; 8-15 motions with diverse orientations yield robust results. OpenCV implements Tsai-Lenz, Park, Horaud, and Daniilidis solvers via calibrateHandEye().

Eye-to-Hand: Camera fixed in the workspace, observing the robot. The calibration solves AX = ZB, where Z is the transform from robot base to camera. This configuration is standard for overhead bin picking cameras. Calibration requires the robot to present a calibration target (mounted on the flange) at 8-15 diverse poses within the camera's field of view. OpenCV's calibrateRobotWorldHandEye() solves this variant.

8.3 Multi-Camera Systems

Complex robotic cells often employ multiple cameras for complete scene coverage: an overhead camera for coarse localization, an eye-in-hand camera for fine alignment, and side cameras for quality verification. Calibrating multi-camera systems requires establishing a common coordinate frame. Extrinsic calibration between cameras can use shared observations of a calibration target visible to both cameras simultaneously, or transitively through the robot coordinate frame (if each camera is independently hand-eye calibrated to the same robot). For multi-camera stereo setups, OpenCV's stereoCalibrate() jointly optimizes both camera intrinsics and the relative extrinsic transformation.

# Hand-Eye Calibration with OpenCV (Eye-in-Hand Configuration) import cv2 import numpy as np def perform_hand_eye_calibration(robot_poses, target_poses): """ Compute hand-eye transform from paired robot and camera observations. Args: robot_poses: list of 4x4 homogeneous transforms (base_T_ee) target_poses: list of 4x4 homogeneous transforms (cam_T_target) Returns: 4x4 hand-eye transform (ee_T_cam) """ R_gripper2base, t_gripper2base = [], [] R_target2cam, t_target2cam = [], [] for pose in robot_poses: R_gripper2base.append(pose[:3, :3]) t_gripper2base.append(pose[:3, 3].reshape(3, 1)) for pose in target_poses: R_target2cam.append(pose[:3, :3]) t_target2cam.append(pose[:3, 3].reshape(3, 1)) # Solve AX = XB using Tsai-Lenz method R_cam2ee, t_cam2ee = cv2.calibrateHandEye( R_gripper2base, t_gripper2base, R_target2cam, t_target2cam, method=cv2.CALIB_HAND_EYE_TSAI ) # Construct 4x4 homogeneous transform ee_T_cam = np.eye(4) ee_T_cam[:3, :3] = R_cam2ee ee_T_cam[:3, 3] = t_cam2ee.flatten() # Validate: reprojection error should be < 1mm print(f"Hand-eye translation: {t_cam2ee.flatten()}") print(f"Hand-eye rotation (rodrigues): {cv2.Rodrigues(R_cam2ee)[0].flatten()}") return ee_T_cam # Validate calibration quality def validate_calibration(ee_T_cam, robot_poses, target_poses, target_3d_pts): """Compute reprojection error across all calibration poses.""" errors = [] for i in range(len(robot_poses)): base_T_ee = robot_poses[i] cam_T_target = target_poses[i] # Chain: base_T_target = base_T_ee @ ee_T_cam @ cam_T_target base_T_target = base_T_ee @ ee_T_cam @ cam_T_target # Compare across pose pairs for consistency if i > 0: delta = np.linalg.norm(base_T_target[:3, 3] - prev_target[:3, 3]) errors.append(delta) prev_target = base_T_target print(f"Mean consistency error: {np.mean(errors)*1000:.2f} mm") print(f"Max consistency error: {np.max(errors)*1000:.2f} mm")

9. Edge AI Platforms

9.1 Why Edge Inference for Robotics

Robotic vision systems demand low-latency, deterministic inference that cloud-based processing cannot reliably provide. Network round-trip latency (10-100ms to cloud) exceeds the response time requirements of visual servoing (5-10ms loop), safety-rated obstacle detection (under 20ms), and high-speed conveyor tracking. Edge AI accelerators colocated with the robot's camera system eliminate network dependency, provide deterministic latency, and operate in air-gapped manufacturing environments that prohibit external data transmission for IP protection. The economics also favor edge: after initial hardware cost, inference is essentially free, versus per-query cloud API costs that scale linearly with throughput.

9.2 Platform Comparison

PlatformAI PerformancePowerGPU/NPUPrice (Module)Best For
NVIDIA Jetson Orin NX 16GB100 TOPS (INT8)10-25W1024 CUDA + 32 Tensor~$600Multi-model pipelines
NVIDIA Jetson AGX Orin 64GB275 TOPS (INT8)15-60W2048 CUDA + 64 Tensor~$1,600Autonomous robots, multi-cam
NVIDIA Jetson Orin Nano 8GB40 TOPS (INT8)7-15W512 CUDA + 16 Tensor~$250Single-camera detection
Intel Neural Compute Stick 24 TOPS (INT8)~1.5WMyriad X VPU~$70Low-power classification
Google Coral Edge TPU (M.2)4 TOPS (INT8)2WEdge TPU ASIC~$30TFLite single-model
Hailo-8 (M.2)26 TOPS (INT8)2.5WCustom dataflow NPU~$100Multi-stream, high efficiency
Hailo-15H (coming)20 TOPS + ISP3WNPU + vision proc.~$40Smart cameras

9.3 NVIDIA Jetson Ecosystem

The NVIDIA Jetson platform dominates robotic edge AI due to its combination of GPU compute, mature software ecosystem, and direct compatibility with training frameworks. The deployment pipeline typically flows: train on desktop/cloud GPU (PyTorch/TensorFlow) -> export to ONNX -> optimize with TensorRT -> deploy on Jetson. TensorRT optimization routinely delivers 2-5x speedup over native PyTorch inference through layer fusion, precision calibration (FP32 to FP16/INT8), and kernel auto-tuning. NVIDIA's Isaac ROS platform provides pre-built, GPU-accelerated ROS2 nodes for stereo depth estimation (using DNN-based disparity), visual SLAM (cuVSLAM), object detection (DOPE, CenterPose), and 3D perception (nvblox occupancy mapping).

For production robotics, the Jetson Orin NX 16GB represents the sweet spot: sufficient performance to run a YOLOv8-medium detector, a depth estimation model, and point cloud processing simultaneously at 15+ fps within a 25W power envelope. The AGX Orin 64GB is reserved for autonomous mobile robots running concurrent SLAM, multi-camera detection, and path planning, or for running large models like SAM alongside real-time detection.

9.4 Hailo and Emerging Alternatives

Hailo's dataflow architecture NPU has emerged as a compelling alternative for multi-stream edge applications. The Hailo-8 delivers 26 TOPS at just 2.5W, offering 10x better TOPS/Watt than Jetson Orin. The Hailo Dataflow Compiler converts ONNX/TFLite models with automatic quantization and scheduling. Particularly attractive for multi-camera quality inspection systems where 4-8 camera streams must be processed simultaneously on a single edge device. Hailo's partnership with Raspberry Pi (Hailo AI Kit for RPi5) is driving adoption in research and low-cost robotics applications.

10. Software Frameworks

10.1 Open-Source Frameworks

OpenCV (Open Source Computer Vision Library): The foundational library for computer vision across all domains. OpenCV 4.x provides 2500+ algorithms covering image processing, feature detection, camera calibration, stereo vision, object detection (DNN module for running ONNX/TensorFlow/Caffe models), ArUco/ChArUco marker detection, and optical flow. OpenCV's DNN module supports inference on CPU, CUDA GPU, and OpenVINO backends, making it the universal preprocessing and inference layer for robotics vision. Every serious robotics vision system uses OpenCV, either directly or through higher-level wrappers.

OpenCV Contrib: Extended modules include ArUco marker detection (essential for robot calibration and fiducial tracking), structured light pattern generation, surface matching (3D object recognition), and xfeatures2d (SIFT, SURF feature detectors). The cv2.aruco module is particularly valuable for robotics, providing robust 6-DOF pose estimation from printed markers for calibration validation, fixture alignment, and simple object tracking.

10.2 Commercial Frameworks

MVTec HALCON: The gold standard for industrial machine vision, used in over 90% of major automotive inspection systems worldwide. HALCON provides an integrated development environment (HDevelop) with 2000+ operators covering blob analysis, template matching (shape-based, correlation-based, deformable), 3D vision (surface matching, 3D pose estimation), deep learning (anomaly detection, classification, semantic segmentation), barcode/OCR, and calibration. HALCON's shape-based matching is unrivaled in speed and robustness, locating trained patterns in under 10ms even under significant rotation, scaling, and partial occlusion. License cost: $3,500-$8,000 per runtime seat, which is justified for high-reliability industrial deployments.

Cognex VisionPro: Cognex's PC-based vision software provides PatMax (geometric pattern matching), PatInspect (defect detection), IDMax (barcode reading), and deep learning tools. Cognex hardware (In-Sight cameras, DataMan readers) integrates tightly with VisionPro for turnkey inspection solutions. Cognex's deep learning edge (Cognex ViDi) requires minimal training data (as few as 20 images) and deploys to In-Sight cameras for standalone edge inference. Market strength: strongest in consumer electronics and semiconductor inspection.

Matrox Imaging Library (MIL): Matrox provides high-performance image processing with particular strength in multi-camera systems, line scan processing, and GigE Vision/Camera Link capture. MIL X is the latest generation supporting GPU-accelerated processing and deep learning inference. Matrox's SureDotOCR and PatternFinder are widely used in pharmaceutical packaging inspection and PCB assembly verification.

FrameworkLicenseStrengthsDeep Learning3D VisionROS2 Support
OpenCVFree (Apache 2.0)Universal, huge communityDNN inference onlyBasic (stereo, calib)Native via cv_bridge
HALCON$3.5K-$8K/seatIndustrial reliability, matchingIntegrated (train+deploy)ExcellentC++ interface
Cognex VisionPro$5K-$15K/seatPatMax, turnkey hardwareCognex ViDiGood (3D-A5000)Limited
Matrox MIL$3K-$10K/seatMulti-cam, line scanMIL DLGoodLimited
Open3DFree (MIT)Point clouds, reconstructionTensor integrationExcellentPython bridge

11. Integration with ROS2

11.1 ROS2 Vision Architecture

ROS2 (Robot Operating System 2) provides the middleware layer that connects camera drivers, perception algorithms, and robot controllers into a coherent vision-guided manipulation pipeline. The key architectural components for vision integration in ROS2 Humble/Iron/Jazzy are organized into standardized packages with well-defined message types and topic conventions.

image_pipeline: The core set of packages for camera processing in ROS2. image_transport handles efficient image transmission with pluggable compression (raw, compressed JPEG/PNG, theora video). image_proc performs debayering (converting raw Bayer patterns to color), rectification (undistorting images using calibration parameters), and resizing. depth_image_proc converts depth images to point clouds and registers color onto depth. stereo_image_proc computes disparity maps and 3D point clouds from calibrated stereo camera pairs. These packages form the foundation that downstream perception nodes build upon.

cv_bridge: The bridge between ROS2 sensor_msgs/Image messages and OpenCV cv::Mat (C++) or NumPy arrays (Python). Every vision node that processes images uses cv_bridge for format conversion. In ROS2, cv_bridge supports zero-copy transport when publisher and subscriber are in the same process, eliminating the image copy overhead that can bottleneck high-resolution pipelines.

11.2 Point Cloud Topics and Processing

3D perception in ROS2 centers on the sensor_msgs/PointCloud2 message type, which carries dense or organized point clouds with arbitrary fields (XYZ, RGB, normals, intensity). The pcl_ros package bridges PCL data structures with ROS2 messages. Key topic conventions include /camera/depth/points for raw depth camera point clouds, /camera/depth_registered/points for color-registered clouds, and custom topic names for processed/filtered clouds. Downstream nodes subscribe to these topics for object detection, segmentation, and grasp planning.

For GPU-accelerated 3D perception, NVIDIA's Isaac ROS provides nvblox (real-time occupancy mapping and ESDF computation for collision avoidance), isaac_ros_depth_segmentation, and isaac_ros_cuMotion (GPU-accelerated motion planning aware of 3D obstacles). These nodes leverage CUDA and Jetson hardware for throughput that pure CPU implementations cannot match.

11.3 Complete Vision-Guided Pick Pipeline

# ROS2 Vision-Guided Pick-and-Place Node (Python) # Subscribes to camera image, runs YOLO detection, publishes pick targets import rclpy from rclpy.node import Node from sensor_msgs.msg import Image, PointCloud2, CameraInfo from geometry_msgs.msg import PoseStamped, Point from visualization_msgs.msg import Marker, MarkerArray from cv_bridge import CvBridge import cv2 import numpy as np from ultralytics import YOLO import sensor_msgs_py.point_cloud2 as pc2 class VisionPickNode(Node): def __init__(self): super().__init__('vision_pick_node') # Parameters self.declare_parameter('model_path', 'best.engine') self.declare_parameter('confidence_threshold', 0.75) self.declare_parameter('camera_frame', 'camera_color_optical_frame') model_path = self.get_parameter('model_path').value self.conf_thresh = self.get_parameter('confidence_threshold').value self.camera_frame = self.get_parameter('camera_frame').value # Initialize YOLO model self.model = YOLO(model_path, task='detect') self.bridge = CvBridge() self.camera_info = None self.latest_cloud = None # Subscribers self.create_subscription( Image, '/camera/color/image_raw', self.image_callback, 10 ) self.create_subscription( PointCloud2, '/camera/depth_registered/points', self.cloud_callback, 10 ) self.create_subscription( CameraInfo, '/camera/color/camera_info', self.caminfo_callback, 10 ) # Publishers self.pick_pub = self.create_publisher( PoseStamped, '/vision/pick_target', 10 ) self.marker_pub = self.create_publisher( MarkerArray, '/vision/detections_viz', 10 ) self.debug_pub = self.create_publisher( Image, '/vision/debug_image', 10 ) self.get_logger().info('Vision Pick Node initialized') def caminfo_callback(self, msg): self.camera_info = msg def cloud_callback(self, msg): self.latest_cloud = msg def image_callback(self, msg): # Convert ROS Image to OpenCV cv_image = self.bridge.imgmsg_to_cv2(msg, 'bgr8') # Run YOLO detection results = self.model.predict( source=cv_image, conf=self.conf_thresh, imgsz=640, device=0, verbose=False ) detections = [] for r in results: for box in r.boxes: x1, y1, x2, y2 = box.xyxy[0].cpu().numpy().astype(int) cx, cy = int((x1+x2)/2), int((y1+y2)/2) cls_name = self.model.names[int(box.cls[0])] conf = float(box.conf[0]) detections.append({ 'class': cls_name, 'conf': conf, 'center': (cx, cy), 'bbox': (x1, y1, x2, y2) }) # Draw on debug image cv2.rectangle(cv_image, (x1,y1), (x2,y2), (0,255,0), 2) cv2.putText(cv_image, f'{cls_name} {conf:.2f}', (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2) # Publish debug image self.debug_pub.publish( self.bridge.cv2_to_imgmsg(cv_image, 'bgr8') ) # Project best detection to 3D using point cloud if detections and self.latest_cloud is not None: best = max(detections, key=lambda d: d['conf']) cx, cy = best['center'] # Read 3D point at detection center from organized cloud points = list(pc2.read_points( self.latest_cloud, field_names=('x','y','z'), skip_nans=True, uvs=[(cx, cy)] )) if points: x3d, y3d, z3d = points[0] pick_pose = PoseStamped() pick_pose.header.frame_id = self.camera_frame pick_pose.header.stamp = self.get_clock().now().to_msg() pick_pose.pose.position = Point( x=float(x3d), y=float(y3d), z=float(z3d) ) # Default orientation (approach from above) pick_pose.pose.orientation.w = 1.0 self.pick_pub.publish(pick_pose) self.get_logger().info( f'Pick target: {best["class"]} at ' f'[{x3d:.3f}, {y3d:.3f}, {z3d:.3f}]m' ) def main(): rclpy.init() node = VisionPickNode() rclpy.spin(node) node.destroy_node() rclpy.shutdown() if __name__ == '__main__': main()

11.4 Launch File and Configuration

# ROS2 Launch File: vision_pick_launch.py from launch import LaunchDescription from launch_ros.actions import Node from launch.actions import DeclareLaunchArgument from launch.substitutions import LaunchConfiguration def generate_launch_description(): return LaunchDescription([ DeclareLaunchArgument('model_path', default_value='best.engine', description='Path to TensorRT YOLO model'), DeclareLaunchArgument('confidence', default_value='0.75', description='Detection confidence threshold'), # Intel RealSense D435i camera driver Node( package='realsense2_camera', executable='realsense2_camera_node', name='camera', parameters=[{ 'enable_color': True, 'enable_depth': True, 'align_depth.enable': True, 'pointcloud.enable': True, 'enable_sync': True, 'color_module.profile': '1280x720x30', 'depth_module.profile': '1280x720x30', }], ), # Image rectification Node( package='image_proc', executable='rectify_node', name='rectify_color', remappings=[ ('image', '/camera/color/image_raw'), ('camera_info', '/camera/color/camera_info'), ('image_rect', '/camera/color/image_rect'), ], ), # Vision pick node Node( package='robot_vision', executable='vision_pick_node', name='vision_pick', parameters=[{ 'model_path': LaunchConfiguration('model_path'), 'confidence_threshold': LaunchConfiguration('confidence'), 'camera_frame': 'camera_color_optical_frame', }], ), # TF2 static transform: camera mount to robot base Node( package='tf2_ros', executable='static_transform_publisher', arguments=[ '--x', '0.05', '--y', '0.0', '--z', '0.12', '--roll', '0.0', '--pitch', '0.785', '--yaw', '0.0', '--frame-id', 'tool0', '--child-frame-id', 'camera_link', ], ), ])

11.5 ROS2 Vision Topic Map

# Standard ROS2 Vision Topics for a Pick-and-Place System # # Camera Driver Output: # /camera/color/image_raw [sensor_msgs/Image] RGB image # /camera/color/camera_info [sensor_msgs/CameraInfo] Intrinsics # /camera/depth/image_rect_raw [sensor_msgs/Image] Depth map (16UC1, mm) # /camera/depth_registered/points [sensor_msgs/PointCloud2] XYZRGB cloud # # image_pipeline Output: # /camera/color/image_rect [sensor_msgs/Image] Undistorted RGB # # Vision Node Output: # /vision/detections [vision_msgs/Detection2DArray] 2D detections # /vision/pick_target [geometry_msgs/PoseStamped] 3D pick pose # /vision/debug_image [sensor_msgs/Image] Annotated image # /vision/detections_viz [visualization_msgs/MarkerArray] RViz markers # # Point Cloud Processing: # /vision/filtered_cloud [sensor_msgs/PointCloud2] Cropped + denoised # /vision/object_clusters [sensor_msgs/PointCloud2] Segmented objects # /vision/plane_cloud [sensor_msgs/PointCloud2] Detected surfaces
Deployment Checklist for Production Vision Systems

Before deploying a vision-guided robotic system to production, validate the following:

  • Camera intrinsic calibration reprojection error below 0.3 pixels
  • Hand-eye calibration consistency error below 0.5mm
  • Object detection model validated on 500+ held-out test images with target domain data
  • Edge inference latency benchmarked end-to-end (capture to pick pose output) under 100ms
  • Lighting variation tested: worst-case ambient conditions still yield reliable detection
  • Thermal stability verified: calibration accuracy after 4+ hours of continuous operation
  • Failure mode handling: behavior defined for zero detections, low confidence, sensor dropout
  • Logging and monitoring: all detections, confidences, and cycle metrics recorded for analysis
Ready to Integrate Computer Vision into Your Robotic Systems?

Seraphim Vietnam provides end-to-end computer vision and robotics engineering, from camera selection and calibration through deep learning model development, edge AI deployment, and ROS2 system integration. Schedule a consultation to discuss your machine vision requirements.

Get the Computer Vision Robotics Assessment

Receive a customized technical evaluation including camera selection, AI model recommendations, edge hardware sizing, and integration architecture for your robotic vision application.

© 2026 Seraphim Co., Ltd.