While single-frame OCR (Optical Character Recognition) has reached high accuracy, mobile video capture introduces motion blur, glares, and perspective distortions that vary frame-by-frame. This paper introduces , an expanded dataset focusing on high-variability environmental conditions. We propose a Multi-Frame Fusion Network (MFFN) that utilizes temporal information across the video stream to "denoise" document fields, achieving a 15% increase in field-level accuracy over static baselines. 2. Introduction

Some filters are case-sensitive.

Midv586 -

While single-frame OCR (Optical Character Recognition) has reached high accuracy, mobile video capture introduces motion blur, glares, and perspective distortions that vary frame-by-frame. This paper introduces , an expanded dataset focusing on high-variability environmental conditions. We propose a Multi-Frame Fusion Network (MFFN) that utilizes temporal information across the video stream to "denoise" document fields, achieving a 15% increase in field-level accuracy over static baselines. 2. Introduction

Some filters are case-sensitive.