1. Executive Summary
The global machine vision market for robotics is projected to reach $21.4 billion by 2028, growing at a compound annual growth rate (CAGR) of 7.6%. In the APAC region, growth is notably faster at 9.8% CAGR, fueled by accelerating manufacturing automation in Vietnam, Thailand, China, and South Korea. Computer vision has evolved from a supplementary sensor modality into the primary perceptual backbone of modern robotic systems, enabling capabilities that were considered research-only problems just five years ago.
The convergence of three technology waves is reshaping what robots can see and do. First, affordable high-resolution 3D cameras now cost under $500 for industrial-grade depth sensors that deliver sub-millimeter accuracy. Second, real-time deep learning inference on edge AI accelerators (NVIDIA Jetson Orin delivers 275 TOPS) allows sophisticated neural networks to run directly on the robot without cloud dependency. Third, foundation models like Meta's Segment Anything Model (SAM) and open-vocabulary detectors such as Grounding DINO enable robots to recognize and manipulate objects they have never been explicitly trained on, collapsing the deployment timeline from months of data collection to hours of prompt engineering.
This technical guide provides a complete reference for integrating computer vision into robotic systems. We cover the full pipeline from camera selection and calibration through 3D perception, object detection, visual servoing control loops, and production deployment on edge hardware. Every section includes practical code examples, comparison tables, and architecture recommendations drawn from our deployment experience across 30+ industrial vision systems in APAC manufacturing and logistics facilities.
2. Camera Technologies
2.1 2D Cameras
Two-dimensional cameras remain the workhorse of industrial machine vision, providing high-resolution texture and color information essential for inspection, barcode reading, and feature-based detection. The two primary architectures serve different application profiles.
Area Scan Cameras: Capture a complete 2D frame in a single exposure using a rectangular sensor array. Resolutions range from VGA (0.3 MP) to 151 MP in current industrial models from Basler, FLIR (Teledyne), and Allied Vision. Global shutter sensors (e.g., Sony IMX sensor family) freeze motion without rolling shutter artifacts, making them essential for inspecting fast-moving objects on conveyor lines. For robotics pick-and-place, 5-12 MP area scan cameras with GigE Vision or USB3 Vision interfaces provide the optimal balance of resolution, field of view, and frame rate. A 12 MP camera with a 16mm lens at 600mm working distance delivers approximately 0.15mm/pixel resolution, sufficient for most grasping applications.
Line Scan Cameras: Use a single row of pixels (or trilinear RGB rows) to build images line by line as the object moves beneath the camera. Resolutions reach 16K pixels per line (16,384 pixels), delivering continuous inspection of materials without the field-of-view limitations of area scan sensors. Line scan excels at web inspection (textiles, paper, film), printed circuit board inspection, and any scenario with continuous linear motion. DALSA (Teledyne), Basler racer, and e2v line scan cameras dominate this segment. Triggering requires an encoder signal synchronized to material transport speed, making integration more complex than area scan but yielding superior results for high-speed continuous processes.
2.2 3D Cameras
Depth perception is the gateway to robotic manipulation in unstructured environments. Three principal 3D sensing technologies compete for different application niches.
Structured Light: Projects a known pattern (typically a sequence of coded stripes or speckle patterns) onto the scene and triangulates depth from the observed pattern deformation. Delivers the highest accuracy among 3D technologies, with leading systems (Photoneo PhoXi, Ensenso, Zivid) achieving point cloud accuracy of 0.05-0.3mm. Structured light excels at bin picking, precision assembly, and quality measurement. The primary limitation is sensitivity to ambient infrared light, which can interfere with the projected pattern in outdoor or brightly lit environments. Acquisition time ranges from 0.1s to 2s depending on exposure settings, making it best suited for quasi-static scenes.
Time-of-Flight (ToF): Measures depth by calculating the round-trip time of emitted infrared light pulses. Modern ToF sensors (Basler blaze, Lucid Helios2+, ifm O3D) deliver real-time depth maps at 30-60 fps with accuracy of 5-15mm. ToF cameras operate well in varied lighting conditions and provide consistent frame rates regardless of scene complexity. Their lower accuracy compared to structured light limits precision applications but makes them ideal for navigation, obstacle avoidance, volumetric measurement, and coarse bin picking where cycle time matters more than sub-millimeter precision.
Stereo Vision: Computes depth by triangulating corresponding features across two or more calibrated cameras. Intel RealSense D400 series and Stereolabs ZED 2i are popular embedded stereo systems. Stereo vision works outdoors, produces dense depth maps, and scales to long range (10m+). Active stereo systems project an IR texture pattern to assist correspondence matching on textureless surfaces, significantly improving reliability for indoor robotics. Accuracy depends on baseline distance and resolution, typically 1-5% of range, placing it between ToF and structured light.
2.3 Specialized Sensors
Event Cameras (Dynamic Vision Sensors): Neuromorphic sensors that independently report per-pixel brightness changes asynchronously, with microsecond temporal resolution. Prophesee and iniVation DVS sensors deliver 120dB dynamic range (versus 60dB for conventional cameras) and zero motion blur. In robotics, event cameras enable high-speed tracking of fast-moving objects, visual servoing at kilohertz rates, and reliable perception in extreme lighting conditions. Research demonstrations show event-camera-based grasping of objects thrown at the robot, with reaction times under 10ms. Adoption is accelerating as ROS2 drivers and deep learning frameworks mature.
Thermal Cameras (LWIR): Long-wave infrared cameras (8-14 micron wavelength) from FLIR, Seek Thermal, and InfiRay detect thermal radiation independent of visible lighting. Critical applications in robotics include weld seam tracking on hot workpieces, food safety inspection, predictive maintenance (detecting overheating bearings or motors), and human detection for safety systems in collaborative robot cells. Thermal cameras are also invaluable in agricultural robotics for crop stress monitoring and fruit ripeness detection.
| Technology | Accuracy | Range | Speed | Ambient Light | Cost Range | Best For |
|---|---|---|---|---|---|---|
| Structured Light | 0.05-0.3mm | 0.2-2.0m | 0.1-2s | Indoor only | $3K-$15K | Bin picking, precision assembly |
| Time-of-Flight | 5-15mm | 0.1-6.0m | 30-60 fps | Good | $1K-$5K | Navigation, volume measurement |
| Active Stereo | 1-5% of range | 0.3-10m | 30-90 fps | Good | $300-$2K | Mobile robots, SLAM |
| Event Camera | N/A (temporal) | 0.1-10m | 1M events/s | 120dB HDR | $2K-$8K | High-speed tracking |
| Thermal (LWIR) | N/A (thermal) | 0.5-50m+ | 30-60 fps | Independent | $500-$10K | Inspection, safety |
3. 3D Perception & Point Cloud Processing
3.1 Point Cloud Fundamentals
A point cloud is an unstructured set of 3D coordinates, optionally augmented with color (XYZRGB), normals, or intensity values. Raw point clouds from depth cameras contain hundreds of thousands to millions of points and must be processed through a pipeline of filtering, segmentation, and feature extraction before they are useful for robotic manipulation decisions. Two open-source libraries dominate point cloud processing in robotics.
Point Cloud Library (PCL): The original C++ point cloud processing toolkit, deeply integrated with ROS/ROS2 through the pcl_ros bridge. PCL provides mature implementations of voxel grid downsampling, statistical outlier removal, RANSAC plane fitting, Euclidean clustering, ICP registration, and surface reconstruction. While development has slowed, PCL remains the standard for production ROS2 deployments due to its C++ performance and extensive ROS message support.
Open3D: A modern Python/C++ library from Intel with GPU-accelerated processing, superior visualization, and cleaner APIs than PCL. Open3D excels at reconstruction tasks (TSDF fusion, Poisson surface reconstruction), provides native tensor-based operations for deep learning integration, and supports CUDA-accelerated ICP and RANSAC. For rapid prototyping and research, Open3D is the preferred choice, while PCL remains stronger for hard real-time ROS2 applications.
3.2 Standard Processing Pipeline
A typical robotic perception pipeline processes raw point clouds through the following stages: passthrough filtering (crop to workspace volume), downsampling (voxel grid to reduce density), outlier removal (statistical or radius-based), plane segmentation (RANSAC to remove table/floor), Euclidean clustering (separate individual objects), and feature extraction (centroid, bounding box, surface normals for grasp planning).
3.3 Surface Reconstruction
For applications requiring solid geometry rather than point samples, surface reconstruction converts point clouds into triangular meshes. Poisson Surface Reconstruction produces watertight meshes suitable for volume calculation, CAD comparison, and physics simulation. The algorithm requires oriented normals and works best on uniformly sampled surfaces. TSDF (Truncated Signed Distance Function) Fusion integrates multiple depth frames into a volumetric representation, producing dense and accurate reconstructions ideal for multi-view scanning setups. Open3D implements GPU-accelerated TSDF with ScalableTSDFVolume, supporting real-time reconstruction from RGBD streams.
3.4 6-DOF Pose Estimation
Determining the full six-degree-of-freedom pose (X, Y, Z, roll, pitch, yaw) of known objects is essential for precision pick-and-place and assembly tasks. Classical approaches use feature-based matching: extract 3D features (FPFH, SHOT) from the scene point cloud, match against a CAD model template, and refine with ICP. Modern deep learning methods trained on synthetic data (e.g., DenseFusion, PVN3D, FoundationPose) directly regress 6-DOF poses from RGBD input, achieving state-of-the-art accuracy on benchmarks like BOP Challenge. NVIDIA's FoundationPose (2024) introduced a foundation model approach that generalizes to novel objects using only a single reference image or a CAD model, eliminating per-object training entirely.
For high-accuracy applications with known CAD models and controlled lighting (e.g., automotive assembly), classical ICP-based methods with structured light cameras deliver repeatable sub-millimeter accuracy. For high-variety applications with unknown or changing objects (e.g., e-commerce fulfillment), deep learning methods like FoundationPose provide superior generalization with 2-5mm accuracy sufficient for grasping. Hybrid pipelines that use deep learning for coarse detection and ICP for fine refinement combine the best of both approaches.
4. Object Detection & Recognition
4.1 YOLO Family for Real-Time Detection
The YOLO (You Only Look Once) architecture family remains the dominant choice for real-time object detection in robotics due to its single-pass inference speed and excellent accuracy-latency tradeoff. The progression from YOLOv5 through YOLOv8 to the current YOLOv11 has delivered consistent improvements in both accuracy and inference efficiency.
YOLOv8 (Ultralytics): The current production standard, offering five model scales (nano, small, medium, large, extra-large) that span the full spectrum from embedded edge deployment (YOLOv8n at 3.2M parameters, 1.8ms on Jetson Orin) to maximum accuracy (YOLOv8x at 68.2M parameters). YOLOv8 natively supports detection, segmentation, pose estimation, and oriented bounding boxes (OBB) from a unified training framework. The Ultralytics Python API provides single-line training, validation, and export to ONNX, TensorRT, CoreML, and OpenVINO formats.
YOLOv11: The latest release introduces improved C3K2 backbone blocks and attention mechanisms that push accuracy 2-3% higher on COCO while maintaining comparable inference speed. For robotics deployments where retraining is feasible, YOLOv11 is now the recommended starting point.
4.2 Detectron2 for Instance Segmentation
Meta's Detectron2 framework provides state-of-the-art instance segmentation using Mask R-CNN, Cascade R-CNN, and PointRend architectures. While slower than YOLO (typically 5-15 fps on desktop GPU), Detectron2 produces pixel-precise segmentation masks that are essential for deformable object manipulation, suction cup grasp planning (identifying flat surfaces within the mask), and measuring object dimensions from segmented regions. Detectron2's model zoo includes pre-trained weights on COCO (80 classes), LVIS (1203 classes), and Cityscapes, providing strong transfer learning baselines for custom robotic vision datasets.
4.3 Segment Anything Model (SAM)
Meta's SAM and its successor SAM 2 represent a paradigm shift toward foundation models for visual perception. SAM segments any object in an image given a point prompt, bounding box, or text description, without object-specific training. For robotics, this capability is transformative: a robot can segment novel objects it has never seen during training by providing a rough location (from YOLO detection or user click) and receiving a pixel-precise mask. SAM 2 extends this to video with real-time temporal tracking, enabling consistent object segmentation across frames as the robot moves. The primary deployment challenge is computational cost; SAM's ViT-H backbone requires 2.5GB of GPU memory and runs at 2-4 fps on Jetson Orin. Distilled variants (MobileSAM, FastSAM, EfficientSAM) reduce this to near-real-time on edge hardware.
4.4 Open-Vocabulary and Foundation Models
The frontier of robotic perception has moved beyond fixed-class detectors to open-vocabulary models that accept natural language descriptions of target objects. Grounding DINO combines a DINO-based detector with grounded language understanding, allowing queries like "the red screw on the left side of the PCB" to produce bounding box detections without any task-specific training. OWLv2 (Google) provides open-world localization with text and image prompts. For robotic manipulation, these models are combined with SAM in a detect-then-segment pipeline: Grounding DINO localizes the object from a language prompt, and SAM produces the precise segmentation mask for grasp planning.
| Model | Task | Speed (Jetson Orin) | Accuracy (COCO) | Custom Training | Best For |
|---|---|---|---|---|---|
| YOLOv8-nano | Detection | 1.8ms / 555 fps | 37.3 mAP | Easy (Ultralytics) | Edge real-time |
| YOLOv8-large | Detection | 12ms / 83 fps | 52.9 mAP | Easy (Ultralytics) | Accuracy-critical |
| YOLOv8-seg | Segmentation | 15ms / 66 fps | 44.6 mask mAP | Easy (Ultralytics) | Grasp planning |
| Detectron2 Mask R-CNN | Segmentation | 80ms / 12 fps | 46.3 mask mAP | Moderate | Precision masks |
| SAM (ViT-H) | Segmentation | 250ms / 4 fps | N/A (zero-shot) | None needed | Novel objects |
| Grounding DINO | Open-vocab Det. | 180ms / 5 fps | 52.5 mAP (zero-shot) | Optional fine-tune | Language-guided |
5. Visual Servoing
5.1 Fundamentals
Visual servoing (VS) is a control technique that uses real-time visual feedback to guide robot motion, closing the loop between perception and actuation. Unlike open-loop pick-and-place (detect, plan, execute), visual servoing continuously adjusts the robot's trajectory based on what the camera currently observes, enabling the robot to correct for calibration errors, object drift, and environmental perturbations. This makes VS indispensable for tasks requiring sub-millimeter precision, dynamic target tracking, or operating with imprecise kinematic models.
5.2 Image-Based Visual Servoing (IBVS)
IBVS operates entirely in the 2D image plane, computing velocity commands from the error between current and desired image features (typically point coordinates, line parameters, or image moments). The control law minimizes the feature error e = s - s* using the image Jacobian (interaction matrix) L that maps camera velocities to feature motion. The velocity command is v = -lambda * L_pinv * e, where lambda is the control gain.
Advantages of IBVS: no 3D model required, inherently robust to calibration errors since control happens in image space, and guaranteed convergence for small displacements. Disadvantages: camera trajectory in 3D space is not predictable (may produce unintuitive Cartesian paths), singularity risks when image Jacobian loses rank, and difficulty handling large rotations (particularly 180-degree rotations where features leave the field of view).
5.3 Position-Based Visual Servoing (PBVS)
PBVS first estimates the full 3D pose of the target from visual features, then computes a Cartesian velocity command to move the robot end-effector toward the desired 3D pose. The control law operates on the 6-DOF pose error in SE(3), producing straight-line Cartesian trajectories that are intuitive and predictable. PBVS requires accurate camera calibration and a 3D model of the target for pose estimation.
Advantages of PBVS: predictable Cartesian motion paths, natural handling of large displacements and rotations, and easy integration with collision avoidance systems that operate in Cartesian space. Disadvantages: sensitivity to calibration errors (camera intrinsics, hand-eye transform), reliance on accurate 3D pose estimation, and potential for features to leave the camera's field of view during servoing if the initial pose error is large.
5.4 Hybrid Approaches
Modern implementations increasingly adopt hybrid visual servoing strategies that combine IBVS and PBVS advantages. A common approach uses PBVS for translational motion (predictable Cartesian path) and IBVS for rotational control (robust to calibration errors). Partitioned approaches decouple the control into translational and rotational components, each using the most appropriate servoing strategy. Deep learning-based visual servoing replaces hand-crafted features with learned feature representations, using convolutional neural networks to directly predict velocity commands from raw images. These end-to-end methods show promise for unstructured environments but remain less reliable than classical approaches for precision industrial tasks.
Use open-loop when: camera and robot are well-calibrated, objects are stationary, cycle time is critical, and 1-2mm accuracy is sufficient (e.g., standard bin picking with vacuum grippers).
Use visual servoing when: objects may move during approach (conveyor tracking), sub-millimeter accuracy is required (PCB insertion, precision assembly), calibration drift is expected (thermal expansion, mobile robots), or the robot must react to real-time visual feedback (welding seam tracking, wire insertion).
6. Bin Picking
6.1 Random Bin Picking
Random bin picking is widely considered the most challenging practical application of computer vision in industrial robotics. The task requires the robot to identify, localize, and grasp individual parts from a disordered heap of randomly oriented objects in a bin. The difficulty arises from severe occlusion (objects partially hidden by other objects), entanglement (parts interlocking), reflective or dark surfaces that challenge 3D cameras, and the need to plan collision-free approach paths into a confined bin volume.
A production random bin picking pipeline typically follows this sequence: (1) acquire 3D point cloud of the bin contents from a structured light camera mounted above the bin; (2) segment the bin walls and remove background points; (3) detect individual objects using 3D instance segmentation or 2D detection projected onto the 3D cloud; (4) estimate the 6-DOF pose of the topmost (most accessible) objects; (5) score candidate grasps based on collision clearance, grasp quality metrics, and reachability; (6) execute the highest-scored grasp with the robot; (7) verify grasp success using force/torque feedback or re-scan.
Leading commercial bin picking solutions (Photoneo Bin Picking Studio, Mech-Mind Mech-Vision, SICK PLB, Zivid + Pickit) bundle calibrated 3D cameras with integrated perception software, reducing deployment from months of custom development to days of configuration. These systems achieve 99%+ pick success rates on well-characterized parts at cycle times of 6-12 seconds per pick.
6.2 Structured Picking and Depalletizing
Structured Bin Picking: When parts arrive in known arrangements (trays, blister packs, organized layers), simplified vision algorithms suffice. Template matching or CAD-guided detection locates parts with known spacing, requiring only compensation for tray misalignment and layer height variation. Cycle times drop to 2-4 seconds per pick with higher reliability than random picking.
Depalletizing with Vision: Vision-guided depalletizing uses overhead 3D cameras to detect the top layer of cases or bags on a pallet, determine pick points, and generate layer-by-layer unloading sequences. ToF cameras (for speed) or structured light cameras (for accuracy on shiny packaging) provide the 3D scene understanding. The algorithm must handle mixed-SKU pallets, damaged cartons, shrink-wrapped loads, and slip sheets between layers. Modern depalletizing systems from Cognex, SICK, and Mech-Mind achieve 99.5%+ reliability at 600-1000 cases per hour.
7. Quality Inspection
7.1 Defect Detection with Deep Learning
Deep learning has fundamentally transformed automated visual inspection, replacing hand-crafted feature engineering with data-driven defect models that generalize across variations in lighting, positioning, and material surface properties. The dominant architectures for industrial defect detection fall into three categories.
Supervised Classification/Segmentation: When labeled defect data is available, models like U-Net, DeepLabv3+, and YOLO-seg directly segment defect regions at pixel level. Training requires 200-2000 annotated defect images per class, which can be augmented with synthetic data generation. This approach delivers the highest accuracy (95-99.5% defect detection rate) but requires labeled datasets for each defect type.
Anomaly Detection (Unsupervised): When defect samples are rare or unknown, anomaly detection models learn the distribution of normal (good) parts and flag deviations. PatchCore, PaDiM, and FastFlow architectures from the anomalib library achieve 95-98% AUROC on the MVTec Anomaly Detection benchmark using only defect-free training images. This approach dramatically reduces data collection requirements and naturally handles novel defect types not seen during training. For production deployment, anomaly detection is particularly valuable in industries like semiconductor fabrication and precision machining where new defect modes emerge unpredictably.
Few-Shot and Foundation Models: Vision-language models (CLIP, BLIP-2) and segment-anything approaches enable defect detection with minimal labeled data. An inspector can describe a defect type in natural language ("scratch on polished surface", "solder bridge between pads") and the model identifies matching regions. While accuracy lags behind fully supervised approaches, the near-zero setup time makes this attractive for low-volume, high-mix manufacturing common in Vietnamese contract manufacturing facilities.
7.2 Surface Inspection Systems
Surface inspection requires specialized illumination strategies to reveal defects. Bright-field illumination (direct on-axis light) highlights color defects, contamination, and markings. Dark-field illumination (low-angle grazing light) reveals surface topography defects like scratches, dents, and texture irregularities by scattering light at defect edges. Dome illumination provides diffuse, shadow-free lighting for inspecting curved or reflective surfaces. Structured illumination (photometric stereo with multiple light directions) reconstructs surface normal maps that reveal micro-topography invisible under standard lighting. Production surface inspection systems from Cognex (In-Sight), Keyence, and ISRA Vision combine optimized illumination with specialized optics and real-time deep learning inference for throughputs exceeding 10 parts per second.
7.3 Dimensional Measurement
Vision-based dimensional measurement replaces contact gauging (calipers, CMMs) with non-contact optical methods. 2D measurement using calibrated telecentric lenses achieves 5-20 micron accuracy for in-plane dimensions. 3D measurement using structured light or laser triangulation extends this to height, flatness, and volumetric dimensions with 10-50 micron accuracy. Critical success factors include thermal stability (camera and lens expand with temperature), vibration isolation, and traceable calibration against certified reference artifacts. For GD&T (Geometric Dimensioning and Tolerancing) compliance, vision measurement systems must be validated per MSA (Measurement System Analysis) protocols with documented Gage R&R studies.
8. Calibration
8.1 Intrinsic Calibration
Camera intrinsic calibration determines the internal parameters that map 3D world points to 2D pixel coordinates: focal length (fx, fy), principal point (cx, cy), and lens distortion coefficients (radial k1-k6, tangential p1-p2). Standard calibration uses Zhang's method with a planar checkerboard pattern captured from 15-30 viewpoints. OpenCV's calibrateCamera() function implements this with sub-pixel corner detection. For production-grade calibration, use a machine-printed target (not laser-printed) on flat glass or ceramic substrate. Reprojection error below 0.3 pixels indicates good calibration; below 0.1 pixels is excellent. Recalibrate whenever the lens is adjusted, the camera is remounted, or operating temperature changes significantly.
8.2 Hand-Eye Calibration
Hand-eye calibration determines the rigid transformation between the robot end-effector (hand) and the camera (eye). This transform is essential for converting object poses detected in camera coordinates to robot base coordinates for manipulation. Two configurations exist.
Eye-in-Hand: Camera mounted on the robot's wrist, moving with the end-effector. The calibration solves AX = XB, where A is the robot motion between two poses (known from forward kinematics), B is the camera motion (computed from observing a fixed calibration target), and X is the unknown hand-eye transform. At least 3 non-degenerate motions are required; 8-15 motions with diverse orientations yield robust results. OpenCV implements Tsai-Lenz, Park, Horaud, and Daniilidis solvers via calibrateHandEye().
Eye-to-Hand: Camera fixed in the workspace, observing the robot. The calibration solves AX = ZB, where Z is the transform from robot base to camera. This configuration is standard for overhead bin picking cameras. Calibration requires the robot to present a calibration target (mounted on the flange) at 8-15 diverse poses within the camera's field of view. OpenCV's calibrateRobotWorldHandEye() solves this variant.
8.3 Multi-Camera Systems
Complex robotic cells often employ multiple cameras for complete scene coverage: an overhead camera for coarse localization, an eye-in-hand camera for fine alignment, and side cameras for quality verification. Calibrating multi-camera systems requires establishing a common coordinate frame. Extrinsic calibration between cameras can use shared observations of a calibration target visible to both cameras simultaneously, or transitively through the robot coordinate frame (if each camera is independently hand-eye calibrated to the same robot). For multi-camera stereo setups, OpenCV's stereoCalibrate() jointly optimizes both camera intrinsics and the relative extrinsic transformation.
9. Edge AI Platforms
9.1 Why Edge Inference for Robotics
Robotic vision systems demand low-latency, deterministic inference that cloud-based processing cannot reliably provide. Network round-trip latency (10-100ms to cloud) exceeds the response time requirements of visual servoing (5-10ms loop), safety-rated obstacle detection (under 20ms), and high-speed conveyor tracking. Edge AI accelerators colocated with the robot's camera system eliminate network dependency, provide deterministic latency, and operate in air-gapped manufacturing environments that prohibit external data transmission for IP protection. The economics also favor edge: after initial hardware cost, inference is essentially free, versus per-query cloud API costs that scale linearly with throughput.
9.2 Platform Comparison
| Platform | AI Performance | Power | GPU/NPU | Price (Module) | Best For |
|---|---|---|---|---|---|
| NVIDIA Jetson Orin NX 16GB | 100 TOPS (INT8) | 10-25W | 1024 CUDA + 32 Tensor | ~$600 | Multi-model pipelines |
| NVIDIA Jetson AGX Orin 64GB | 275 TOPS (INT8) | 15-60W | 2048 CUDA + 64 Tensor | ~$1,600 | Autonomous robots, multi-cam |
| NVIDIA Jetson Orin Nano 8GB | 40 TOPS (INT8) | 7-15W | 512 CUDA + 16 Tensor | ~$250 | Single-camera detection |
| Intel Neural Compute Stick 2 | 4 TOPS (INT8) | ~1.5W | Myriad X VPU | ~$70 | Low-power classification |
| Google Coral Edge TPU (M.2) | 4 TOPS (INT8) | 2W | Edge TPU ASIC | ~$30 | TFLite single-model |
| Hailo-8 (M.2) | 26 TOPS (INT8) | 2.5W | Custom dataflow NPU | ~$100 | Multi-stream, high efficiency |
| Hailo-15H (coming) | 20 TOPS + ISP | 3W | NPU + vision proc. | ~$40 | Smart cameras |
9.3 NVIDIA Jetson Ecosystem
The NVIDIA Jetson platform dominates robotic edge AI due to its combination of GPU compute, mature software ecosystem, and direct compatibility with training frameworks. The deployment pipeline typically flows: train on desktop/cloud GPU (PyTorch/TensorFlow) -> export to ONNX -> optimize with TensorRT -> deploy on Jetson. TensorRT optimization routinely delivers 2-5x speedup over native PyTorch inference through layer fusion, precision calibration (FP32 to FP16/INT8), and kernel auto-tuning. NVIDIA's Isaac ROS platform provides pre-built, GPU-accelerated ROS2 nodes for stereo depth estimation (using DNN-based disparity), visual SLAM (cuVSLAM), object detection (DOPE, CenterPose), and 3D perception (nvblox occupancy mapping).
For production robotics, the Jetson Orin NX 16GB represents the sweet spot: sufficient performance to run a YOLOv8-medium detector, a depth estimation model, and point cloud processing simultaneously at 15+ fps within a 25W power envelope. The AGX Orin 64GB is reserved for autonomous mobile robots running concurrent SLAM, multi-camera detection, and path planning, or for running large models like SAM alongside real-time detection.
9.4 Hailo and Emerging Alternatives
Hailo's dataflow architecture NPU has emerged as a compelling alternative for multi-stream edge applications. The Hailo-8 delivers 26 TOPS at just 2.5W, offering 10x better TOPS/Watt than Jetson Orin. The Hailo Dataflow Compiler converts ONNX/TFLite models with automatic quantization and scheduling. Particularly attractive for multi-camera quality inspection systems where 4-8 camera streams must be processed simultaneously on a single edge device. Hailo's partnership with Raspberry Pi (Hailo AI Kit for RPi5) is driving adoption in research and low-cost robotics applications.
10. Software Frameworks
10.1 Open-Source Frameworks
OpenCV (Open Source Computer Vision Library): The foundational library for computer vision across all domains. OpenCV 4.x provides 2500+ algorithms covering image processing, feature detection, camera calibration, stereo vision, object detection (DNN module for running ONNX/TensorFlow/Caffe models), ArUco/ChArUco marker detection, and optical flow. OpenCV's DNN module supports inference on CPU, CUDA GPU, and OpenVINO backends, making it the universal preprocessing and inference layer for robotics vision. Every serious robotics vision system uses OpenCV, either directly or through higher-level wrappers.
OpenCV Contrib: Extended modules include ArUco marker detection (essential for robot calibration and fiducial tracking), structured light pattern generation, surface matching (3D object recognition), and xfeatures2d (SIFT, SURF feature detectors). The cv2.aruco module is particularly valuable for robotics, providing robust 6-DOF pose estimation from printed markers for calibration validation, fixture alignment, and simple object tracking.
10.2 Commercial Frameworks
MVTec HALCON: The gold standard for industrial machine vision, used in over 90% of major automotive inspection systems worldwide. HALCON provides an integrated development environment (HDevelop) with 2000+ operators covering blob analysis, template matching (shape-based, correlation-based, deformable), 3D vision (surface matching, 3D pose estimation), deep learning (anomaly detection, classification, semantic segmentation), barcode/OCR, and calibration. HALCON's shape-based matching is unrivaled in speed and robustness, locating trained patterns in under 10ms even under significant rotation, scaling, and partial occlusion. License cost: $3,500-$8,000 per runtime seat, which is justified for high-reliability industrial deployments.
Cognex VisionPro: Cognex's PC-based vision software provides PatMax (geometric pattern matching), PatInspect (defect detection), IDMax (barcode reading), and deep learning tools. Cognex hardware (In-Sight cameras, DataMan readers) integrates tightly with VisionPro for turnkey inspection solutions. Cognex's deep learning edge (Cognex ViDi) requires minimal training data (as few as 20 images) and deploys to In-Sight cameras for standalone edge inference. Market strength: strongest in consumer electronics and semiconductor inspection.
Matrox Imaging Library (MIL): Matrox provides high-performance image processing with particular strength in multi-camera systems, line scan processing, and GigE Vision/Camera Link capture. MIL X is the latest generation supporting GPU-accelerated processing and deep learning inference. Matrox's SureDotOCR and PatternFinder are widely used in pharmaceutical packaging inspection and PCB assembly verification.
| Framework | License | Strengths | Deep Learning | 3D Vision | ROS2 Support |
|---|---|---|---|---|---|
| OpenCV | Free (Apache 2.0) | Universal, huge community | DNN inference only | Basic (stereo, calib) | Native via cv_bridge |
| HALCON | $3.5K-$8K/seat | Industrial reliability, matching | Integrated (train+deploy) | Excellent | C++ interface |
| Cognex VisionPro | $5K-$15K/seat | PatMax, turnkey hardware | Cognex ViDi | Good (3D-A5000) | Limited |
| Matrox MIL | $3K-$10K/seat | Multi-cam, line scan | MIL DL | Good | Limited |
| Open3D | Free (MIT) | Point clouds, reconstruction | Tensor integration | Excellent | Python bridge |
11. Integration with ROS2
11.1 ROS2 Vision Architecture
ROS2 (Robot Operating System 2) provides the middleware layer that connects camera drivers, perception algorithms, and robot controllers into a coherent vision-guided manipulation pipeline. The key architectural components for vision integration in ROS2 Humble/Iron/Jazzy are organized into standardized packages with well-defined message types and topic conventions.
image_pipeline: The core set of packages for camera processing in ROS2. image_transport handles efficient image transmission with pluggable compression (raw, compressed JPEG/PNG, theora video). image_proc performs debayering (converting raw Bayer patterns to color), rectification (undistorting images using calibration parameters), and resizing. depth_image_proc converts depth images to point clouds and registers color onto depth. stereo_image_proc computes disparity maps and 3D point clouds from calibrated stereo camera pairs. These packages form the foundation that downstream perception nodes build upon.
cv_bridge: The bridge between ROS2 sensor_msgs/Image messages and OpenCV cv::Mat (C++) or NumPy arrays (Python). Every vision node that processes images uses cv_bridge for format conversion. In ROS2, cv_bridge supports zero-copy transport when publisher and subscriber are in the same process, eliminating the image copy overhead that can bottleneck high-resolution pipelines.
11.2 Point Cloud Topics and Processing
3D perception in ROS2 centers on the sensor_msgs/PointCloud2 message type, which carries dense or organized point clouds with arbitrary fields (XYZ, RGB, normals, intensity). The pcl_ros package bridges PCL data structures with ROS2 messages. Key topic conventions include /camera/depth/points for raw depth camera point clouds, /camera/depth_registered/points for color-registered clouds, and custom topic names for processed/filtered clouds. Downstream nodes subscribe to these topics for object detection, segmentation, and grasp planning.
For GPU-accelerated 3D perception, NVIDIA's Isaac ROS provides nvblox (real-time occupancy mapping and ESDF computation for collision avoidance), isaac_ros_depth_segmentation, and isaac_ros_cuMotion (GPU-accelerated motion planning aware of 3D obstacles). These nodes leverage CUDA and Jetson hardware for throughput that pure CPU implementations cannot match.
11.3 Complete Vision-Guided Pick Pipeline
11.4 Launch File and Configuration
11.5 ROS2 Vision Topic Map
Before deploying a vision-guided robotic system to production, validate the following:
- Camera intrinsic calibration reprojection error below 0.3 pixels
- Hand-eye calibration consistency error below 0.5mm
- Object detection model validated on 500+ held-out test images with target domain data
- Edge inference latency benchmarked end-to-end (capture to pick pose output) under 100ms
- Lighting variation tested: worst-case ambient conditions still yield reliable detection
- Thermal stability verified: calibration accuracy after 4+ hours of continuous operation
- Failure mode handling: behavior defined for zero detections, low confidence, sensor dropout
- Logging and monitoring: all detections, confidences, and cycle metrics recorded for analysis
Seraphim Vietnam provides end-to-end computer vision and robotics engineering, from camera selection and calibration through deep learning model development, edge AI deployment, and ROS2 system integration. Schedule a consultation to discuss your machine vision requirements.

