Data Done Right: The Four Dimensions of Data Quality

Sherrie Clevenger, Principal, Product Management and Mark Weaver, Principal, Business Developmentat, CoreLogicSherrie Clevenger, Principal, Product Management and Mark Weaver, Principal, Business Developmentat
As product experts at CoreLogic®, Sherrie and Mark are responsible for transforming industries including real estate, mortgage, insurance, ReTech, PropTech, and FinTech through a variety of choices in data innovations and delivery.

Everyone has heard the saying “garbage in, garbage out.” It’s a saying that especially applies to how companies use data. Quality data is the foundation of any business application or model, and it can make or break product adoption.

Take a startup aiming to rapidly grow and scale new artificial intelligence (AI) models for valuing financial portfolios. In bringing this service to market, the underlying data will directly impact the decisions the startup is hoping to empower their customers with —for better or worse. Any data that has not been fully-vetted can lead to issues in a portfolio, and a portfolio valuation that deviates even a few percentage points can mean billions of dollars in risk. In short, garbage in, garbage out.

This then begs the question, what does “high quality data” really look like? At its core, high quality data has four key dimensions: accuracy, coverage, currency, and completeness. Let’s break that down.

Data Accuracy

Data accuracy is the most intuitive feature of high-quality data. It means that the value stored for any given data point represents the truth about a real-world object.

For example, a property’s sale price, at a national level, is a necessary blended output of actual and estimated sale prices. If the actual sale amount is available, it is captured directly. Otherwise, if the tax stamped amount is available, then a tax rate table is used to calculate the sale price. Although this is a calculated sale price, its reliability is comparable to an “actual” sale price.

If neither of the above is available, the industry’s common practice is to use the mortgage amount to calculate the sale price by applying a multiplication factor for a given loan based on the loan type to generate an estimated saleprice.
While this method to calculate an estimated sale price is a commonly accepted industry practice, the usefulness of these sale prices is questionable. When applying this industry standard, over 47% of the current estimated salesprices have greater than 10% error*.

A statistically created model, using machine learning (ML), is another option which, although not as reliable as a reported actual price, provides a superior estimate over the simple multiplication factor approach when the price information is unavailable. Machine learning can learn from data without relying on rules-based programming, while statistical models require prior assumptions about the data.

Data Coverage

Coverage speaks to the breadth and scope of data across geographies. If property data isn’t broadly available in the targeted geographies, it won’t facilitate a consistent, quality customer experience.

The information for many properties is not accessible from a single source such as public record data, enterprise proprietary sources or derived data. That’s why it’s critical to rely on software solutions that offer a data-dense “single source of truth” for residential property that blends data beyond a single source (for example, by blending public record data with other datasets).

Data Currency

Currency refers to the frequency at which your data sources are updated and how readily you are able to access that refreshed data, which is critical in a world that is constantly changing. For example, many PropTech applications cite number of bedrooms and square footage of a home as important data points. Over time, the probability of the home being remodeled to change both bedroom count and square footage increases. In addition, capturing changes to transactions and taxes is critical to have the most current information.

Another example that highlights the importance of data currency is how land data is recorded. Land may be subdivided to create multiple sub-lots, with each sub-lot having a different address, parcel number and owner than the original lot. Traditional data systems have no means for linking a relationship between the original lot and the new lots. Thus, an original lot and a new lot may appear to be different properties altogether.

Across properties, fluctuations are happening daily and in significant volumes. Ultimately, maintaining up-to-date data can greatly impact loan-to-value ratios or other metrics which drive sophisticated models.

Data Completeness

The fourth dimension of data quality is completeness; that is, the proportion of elements that are populated in a record. Having accurate data won’t yield a high-quality outcome if only a small fraction of possible attributes is populated. Data completeness can be accomplished by blending multiple sources of data as well as using deterministic rules to estimate other characteristics

Conclusion

A lack of industry standards for real property identification makes it difficult for businesses to manage and use property data to make accurate decisions involving property locations, pricing, competition analysis, prospect targeting, and more.

Without four-dimensional data quality — accuracy, coverage, currency and completeness — the foundation on which data is built is flawed. Start with high-quality data, and you can build innovative and disruptive solutions from the ground up on a strong, reliable foundation and maximize customer delight at scale.