Since the 1990s, federal legislation and local and state politics have changed the landscape for metropolitan planning of land use and transportation, and Metropolitan Planning Organizations (MPOs) are scrambling to react. There is an urgent need for improved models that address the interdependencies between land use and transportation at state and regional levels, and considerable new work is underway to develop such models (See, for example, ULTRANS ITS UC Davis and HBA Specto Incorporated, 2011; Waddell et al., 2010; Weidner et al., 2009). These models and planning practices to integrate land use and transportation, however, require the integration of massive amounts of land use and socio-economic data that is messy and incomplete. There is a suggestion that as much as 70% of the total effort in developing integrated land use and transportation models is directly or indirectly associated with data development, integration and cleaning (Waddell et al., 2005). Even with such excruciating efforts, current practice of data development in most planning agencies is largely ad-hoc and un-reusable. This practice is increasingly challenged, as data sources increasingly update more frequently (for example, the Household and Population Census moved from a decennial survey to the annual rolling American Community Survey) and even real time (such as traffic counts data). In addition, the quality issues coming with the new data sources also make the agencies scramble to cope. In recent decades, there have been considerable advances in techniques in computer science and statistics, such as Bayesian statistics, data mining and machine-learning techniques, which have been applied to address such data problems in a wide range of domains. These domains are as varied as cleaning of web data, detecting fraud in credit data, reconciling medical records, mining the vast streams of email and web content for targeted advertising, and many others (See, for example, Hu et al., 2012). To date, most attempts to tackle the problem in the modeling communities (Abraham et al., 2009, 2005; for example, Waddell et al., 2005) are tied to a specific model system and a chosen study area. Few systematic efforts have applied these technological advances to produce re-usable tools in the problem domain of land use and transportation data. The data integration project aims to address the challenging data problems by leveraging interdisciplinary techniques to develop reusable methods and tools. Specifically, we focus on making the data preparation process for modeling more systematic and reproducible with a harmonized data scheme and re-usable tools.