Andrea Pinna
Efficient translation of HTML to JSON for enhanced web content production
Di Stefano A.;Ramzan F.;Reforgiato Recupero D.
2026-01-01
Abstract
The automated transformation of unstructured HTML into schema-compliant data is a foundational challenge for platform interoperability and the scalability of modern no-code web editors. While powerful, Large Language Models (LLMs) are often ill-suited for this task due to their inherent stochasticity and computational cost, failing to guarantee the deterministic precision required at scale. This paper addresses this challenge by introducing a novel, deterministic pipeline to translate arbitrary HTML emails into the proprietary, grid-based JSON of the Beefree content platform. Our core contributions are: (1) a hybrid methodology that combines Document Object Model (DOM) analysis for semantics with computer vision for geometric layout interpretation; (2) a vision-based abstraction technique using visual placeholders for robust row-column detection, resilient to DOM structural variations; and (3) a rigorous, dual-faceted validation of its real-world viability via a large-scale assessment on over 16,000 HTML emails and qualitative usability studies (SUS) with 16 industry professionals and 10 academic researchers. The results confirm our deterministic, vision-augmented approach is a highly effective and scalable alternative to generative models for structured content creation in production environments.| File | Size | Format | |
|---|---|---|---|
| s11042-026-21232-7-1.pdf Solo gestori archivio
Type: versione editoriale
Size 2.38 MB
Format Adobe PDF
|
2.38 MB | Adobe PDF | & nbsp; View / Open Request a copy |
| s11042-026-21232-7-1 (1) (1) (1).pdf embargo until 23/01/2027
Type: Author’s Accepted Manuscript AAM, Post-print, (version accepted by the publisher)
Size 2.22 MB
Format Adobe PDF
|
2.22 MB | Adobe PDF | & nbsp; View / Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
University of Cagliari