Portuguese Construction Dataset for AI BOQ Text Extraction and Synthetic Data Augmentation Using LLMs

Luís Jacques de Sousa1,2, João Poças Martins2, Luís Sanhudo1, João Miguel Silva1
1 BUILT CoLAB—The Collaborative Laboratory for the Built Environment of the Future, Portugal
2 CONSTRUCT-GEQUALTEC, Department of Civil and Georesources Engineering, Faculty of Engineering (FEUP), University of Porto
DOI: 10.35490/EC3.2025.454
Abstract: Manual classification of Bill of Quantities in construction procurement is labor-intensive and error-prone, limiting efficiency in bidding and contract management. No structured datasets for BOQ classification exist in the literature, limiting automation routes. To address this, we present a labeled dataset of BOQ tasks from Portuguese public procurement contracts, structured for multilabel classification. Synthetic augmentation using GPT-4o Mini and cosine similarity-based batching mitigated class imbalance, expanding training data to 23,542 examples per fold (3 folds). This dataset, provides a Portuguese construction corpus and enables Artificial Intelligence-driven BOQ task classification, fostering procurement automation and expanding automation routes in construction contract analysis.

Presentation video

Successfully submitted

Your submission has been received. We will review your details and contact you soon.