Full-Stack Optimization for Low-Light Video on Jetson Orin NX: From 400 ms to 28 ms

 

Jaehoon Lee
Technical Content Manager, Nota AI

 

When enterprises adopt AI, the most common bottleneck is not model development. It is the deployment stage: getting a finished model to run reliably on the actual target device.

This CCTV low-light video restoration project followed the same pattern, with deployment-stage optimization as the core challenge. The client's existing model measured 400 ms per frame on the target edge environment (Jetson Orin NX), far short of the 33 ms (30 FPS) real-time target.

As a result, Nota AI reduced the processing time to under 28 ms (exceeding 30 FPS) while meeting the client's quality requirements.

This improvement did not come from generic post-hoc optimization alone. When standard methods such as quantization and pruning hit their limits, Nota AI combined hardware constraints with domain-specific video characteristics to redesign the model from scratch.

The following sections walk through exactly where standard optimization stalled on the edge, and how Nota AI's integrated design capability, spanning model architecture through hardware optimization in a single continuous flow, resolved the real-time processing challenge.

 

CCTV Operational Criteria and the Limits of Existing Vision Models

The starting specifications for the collaboration are summarized below.

table summarizing 6 starting specifications for the project.

Figure 1: Starting specifications for the field deployment project

30 FPS is not an arbitrary number. It is the lower bound for downstream object detection and tracking. Below this rate, fast-moving objects fall between frames and detection becomes unreliable.

Treating this case as a pure "speed problem" misses the point. Because the deployment domain is commercial CCTV operation, image quality and operational conditions are intertwined. Beyond 30 FPS and edge deployment, the following operational conditions must hold simultaneously for the result to matter:

  • Ground Truth (GT)-free environment: With no reference images available on site, performance must be validated using "no-reference quality" metrics.

    • Brightness order preservation: Prevents distortion of the relative balance between light and dark, which would otherwise degrade downstream object detection accuracy.

    • Natural image quality: Required so that downstream recognition algorithms can extract stable features without artificial noise.

  • Multi-channel concurrent processing: Multiple camera channels must run concurrently on a single GPU for infrastructure efficiency.

Among these four operational conditions, the first barrier existing models face is the difference in evaluation metrics. Conventional algorithm validation typically relies on PSNR (Peak Signal-to-Noise Ratio) or SSIM (Structural Similarity Index Measure), both of which compare a restored image against a reference ground truth (GT) at the pixel level.

In actual CCTV deployments, however, no such reference ground truth exists. The evaluation must instead rely on no-reference metrics: NIQE (Naturalness Image Quality Evaluator) for perceived naturalness, and LOE (Lightness Order Error) for the relative ordering of brightness within each frame. Once the evaluation criteria shift to operational metrics like these, even models that lead public-dataset leaderboards no longer guarantee a quality advantage in production.

Even if a model satisfies these no-reference quality criteria, the real-time processing constraint (33 ms) remains a separate challenge altogether. The specific experimental figures that follow detail exactly how far the existing baseline model exceeded this target and caused a bottleneck on real hardware.

 

The 33 ms Wall: Where Generic Optimization Fell Short

The existing model the client prepared for deployment was a heavy architecture: a base low-light enhancement module followed by a separate denoising module to suppress noise amplified during brightness lifting. When the original model ran on the deployment target (Jetson Orin NX), inference time reached approximately 400 ms, more than 10× the 33 ms real-time target. To gauge the optimization ceiling, we profiled the model on a higher-spec development board (AGX Orin). Even there, latency measured around 180 ms, still far from the target.

We applied every viable optimization technique within the development environment. First, by modifying the original model's ONNX graph to reduce Transpose and Reshape operations and by increasing the downsample ratio, we cut latency from 180 ms to roughly 60 ms. However, once TensorRT 10.3.x was fixed as the runtime, the runtime itself began performing this manual graph optimization automatically, leaving no further headroom from that path. INT8 quantization actually ran slower than FP16 because of the limited proportion of eligible layers. 50% pruning stalled at half the target frame rate (around 15 FPS) and produced visible quality degradation, with no value in pushing it further.

Table of 3 generic optimization attempts on the existing model and their results. ONNX graph+downsample ratio

Figure 2: Results of generic optimization attempts on the existing model

We concluded that post-hoc optimization on the existing model's architecture alone could not reach 33 ms. As an alternative, we experimented with four widely cited models in this field: RUAS, Zero-DCE++, LiteIE, and Wave-mamba. None reached the target either.

Figure 3: Existing model optimization and all four alternative models hit the same 33 ms wall

After a sequence of experiments, the root cause became clear. Both the existing model and the four alternatives relied on a heavy computational approach: generating or restoring the entire image from scratch. Under this pattern, no amount of optimization could push past the 33 ms line. Nota AI therefore stopped tuning off-the-shelf models and shifted direction toward designing a lightweight model from scratch, fully accounting for the edge hardware constraints. This decision was only possible because Nota AI handles algorithm development, optimization, and deployment as a single continuous flow.

 

In-house Lightweight ViT (98K) with YUV Channel Separation

Before designing the in-house architecture, we redefined the problem. The essence of low-light video enhancement is restoration of the luminance (Y) channel, not regeneration of the chrominance (UV) channels. In nighttime CCTV environments, most of the identifiable information is luminance data buried in darkness. Color information can be handled by a lightweight correction layer instead of a heavy deep-learning decoder, with negligible impact on downstream recognition algorithms. With this problem definition in place, three structural design decisions followed.

  • YUV color space separation

    We split the input image into Y (luminance) and UV (chrominance) channels, and routed only the Y channel through the model backbone. This reduced the input information the model had to process to one-third, and removed the need for a heavy color reconstruction decoder.

  • Aggressive downsampling at roughly 2/5 ratio

    With the UV channels excluded from the model backbone, we could apply more aggressive downsampling. The luminance channel is relatively robust to the loss of high-frequency spatial information, whereas color channels suffer from pronounced color bleeding when downsampled. By isolating color into a separate lightweight correction layer, we freed the model backbone from this constraint and passed the 1920×1080 input through the backbone at roughly 2/5 of the original size.

  • Bilinear upsampling and color-corrected UV merge

    The Y channel, restored after downsampling, is upsampled back to the original resolution using bilinear interpolation. It is then merged with the original UV channels, which have passed through a lightweight color correction layer to compensate for luminance-driven color shifts, producing the final image. Because color is not regenerated through a heavy learning-based model, color consistency is preserved naturally without a separate decoder.

Figure 4: Data flow separating Y channel computation from UV channel correction

Within the pipeline optimized by these three mechanisms, the most critical role is performed by the Luminance (Y) Restoration block. The model backbone deployed in this position is DeltaViT, Nota AI's in-house lightweight Vision Transformer (ViT) operating with approximately 98K parameters. Compared to RetinexFormer (around 1.6M), a top-ranked leaderboard model addressing the same task, DeltaViT uses roughly 16× fewer parameters, demonstrating its structural lightweight design intuitively.

Note: This design is optimized for low-light CCTV environments where luminance correction is the primary need. Specialized domains with rapidly shifting color temperature, such as stage lighting or industrial spectrum-specific illumination where UV components vary significantly, may require additional validation.

 

28 ms on Jetson Orin NX and Top-Ranked Quality Metrics

After exporting the resulting Torch model to ONNX and converting to TensorRT, we measured under 10 ms latency on the development environment (AGX Orin). Moving to the actual deployment target (Jetson Orin NX), the result was GPU compute time under 28 ms. The quantitative target of 33 ms was cleared with over 5 ms of headroom. On the same device where the client's existing model previously ran at approximately 400 ms, DeltaViT now runs at under 28 ms.

With the primary challenge of real-time inference speed resolved, we proceeded to quantitative image quality validation. The model had already passed the client's qualitative visual review against operational requirements. The next step was to quantify the gap against existing general-purpose models.

We ran benchmarks on an RTX 3090 24GB across five standard low-light datasets, comparing the in-house DeltaViT (98K) against LYT-Net (45K), RUAS (3.4K), Zero-DCE++ (10.6K), and RetinexFormer (1.6M). Notably, LYT-Net failed at 1080p inference due to an out-of-memory (OOM) error, making real-world execution unfeasible. Excluding that model, the chart below summarizes the four remaining models on the metrics most relevant to CCTV production environments: LOE, NIQE, and 1080p inference time.

Figure 5: 4-model 3-metric benchmark results, RTX 3090, average across 5 datasets

As the chart shows, DeltaViT ranked first in both NIQE and LOE. This result was achieved using an architecture 16× smaller in parameters (98K vs 1.6M) and approximately 63× faster than RetinexFormer. At 1080p inference, DeltaViT measured 8.92 ms, around 112 FPS, with 3.7× headroom over the 30 FPS minimum threshold. Most notably, when input pixels increased 6.7× from 480p to 1080p, DeltaViT's inference time grew by only 1.93×, demonstrating exceptional load resilience (Zero-DCE++: 6.72×, RetinexFormer: 7.87×).

 

3 vs 0 Channels: The Throughput Gap on a Single RTX 3090

The primary deployment goal of this collaboration was a single channel on Jetson Orin NX. However, the client's operational requirements also included a scenario where multiple camera channels are processed concurrently on a central monitoring server. The number of channels a single GPU can sustain translates directly into deployment and operating cost.

We measured how many channels each of the same four models could process concurrently on a single RTX 3090 24GB while sustaining 1080p at 30 FPS. The results: DeltaViT held 3 channels stably, Zero-DCE++ reached 1 channel at the threshold, RUAS stayed below 1 channel with instability, and RetinexFormer reached 0 channels. Even on a high-end GPU like the RTX 3090, RetinexFormer could not sustain a single 1080p 30 FPS channel.

DeltaViT was designed for single-channel edge deployment, yet the same model carries the same multi-channel server workload on fewer GPUs.

 

A Continuous Flow from Model Design to Optimization

Fitting a low-light video enhancement model into the 33 ms constraint on Jetson Orin NX, without falling below the client's quality thresholds, was beyond the reach of generic post-hoc optimization alone. When standard compression techniques such as quantization and pruning hit the wall of computational overhead inherent to existing restoration models, the solution Nota AI chose was an integrated design approach: combining hardware constraints and domain characteristics from the design stage onward.

By separating YUV channels to reduce the information volume to one-third and applying a lightweight ViT only to the luminance (Y) channel while preserving color relationships, this structural approach did more than shorten processing time. It cleared the target line within the edge device's computational constraints. The result: under 28 ms latency on the target device, alongside first-place rankings on LOE and NIQE in standard benchmark environments.

This case offers an important lesson for the edge deployment domain, where benchmark scores do not translate directly into real-time performance in production environments. In areas where generic graph optimization or compression alone cannot reach the target, what determines the outcome is the ability to treat algorithm modeling and hardware-aware optimization as a single continuous flow rather than disconnected pipelines. In this collaboration, Nota AI validated that integrated design capability on a production target.

These differentiated technical capabilities are built on NetsPresso®, Nota AI's in-house AI optimization platform. To explore Nota AI's tailored AI optimization services, which leverage the NetsPresso engine to maximize performance on target hardware, get in touch today.

 
 

Stay ahead with Nota AI on LinkedIn. From edge AI trends to the latest tech updates — subscribe to Edge Insights and be the first to know. 👉 Subscribe now
Previous
Previous

Why Nota AI Was the Only Korean Company on the Panel at NVIDIA's APAC Partner Day: The Final Piece of Physical AI

Next
Next

[NetsPresso® x AI Agents] Easier to Use, Even More Powerful