Portuguese Construction Dataset for AI BOQ Text Extraction and Synthetic Data Augmentation Using LLMs
DOI: 10.35490/EC3.2025.454
Abstract: Manual classification of Bill of Quantities in construction procurement is labor-intensive and error-prone, limiting efficiency in bidding and contract management. No structured datasets for BOQ classification exist in the literature, limiting automation routes. To address this, we present a labeled dataset of BOQ tasks from Portuguese public procurement contracts, structured for multilabel classification. Synthetic augmentation using GPT-4o Mini and cosine similarity-based batching mitigated class imbalance, expanding training data to 23,542 examples per fold (3 folds). This dataset, provides a Portuguese construction corpus and enables Artificial Intelligence-driven BOQ task classification, fostering procurement automation and expanding automation routes in construction contract analysis.