What is CLEAN data?

Without quality data, there is no artificial intelligence.

Defence does not have the luxury of the volumes of data available to other industries, so as the size of the available datasets decrease, the quality of them must increase, if truly trustworthy Machine learning models are to be deployed.

The process of improving the quality of the data is sometimes referred to as ‘cleaning it’ and is often time consuming - at least initially. Even the most modern simulators built on the latest standards still use system data that was designed for system use, not for analytics. 

Data scientists must study the data, convert formats, standardise and curate it. This can be automated once understood, but understanding it can take time. Only when converted into its more analytics friendly state, and then stored irrespective of present value or current project needs, will the value of that data stand a chance of being realised.   

The US Department of Defense recently republished its guidance on improving fundamental data management to increase the quality and availability of relevant DoD data. They have established seven “Data Quality Dimensions” to help define the quality of a particular dataset:

  • Accuracy: Data that correctly reflect proven, true values or the specified action, person, or entity. Accuracy includes data structure, content, and variability.

  • Completeness: The data present at a specified time contain the expected information or statistics, as measured at the data set, row, or column level.

  • Conformity: Data sets follow agreed upon internal policies, standards, procedures, and architectural requirements.

  • Consistency: The degree to which a value is uniformly represented within and across data sets.

  • Uniqueness: Ensures there is a one-to-one alignment between each observed event and the record that describes such an event.

  • Integrity: A data set's pedigree, provenance, and lineage are known and aligned with relevant business rules.

  • Timeliness: Measures the time between an event occurring and the data's availability for use.

Poor quality data will inevitably undermine data trustworthiness, raise security concerns, and is of limited value to analytical and AI efforts. The only thing that is better than cleaned, processed and analytics friendly data, is data that was designed to be that from the very beginning of the procurement process.

We’re already helping a number of defence companies and MOD organisations shape their data strategies in order to mitigate the effects of dirty data, so reach out anytime if this is a problem you or your organisation are also facing - happy to chat. hello@missiondecisions.com

Previous
Previous

The Difference between ‘Structured’ and ‘Unstructured’ Data

Next
Next

Navigating Connections: The Power of Graph Neural Networks