What Makes Data Good?
It’s all about Usability, Relevance, and Completeness
Mar 03, 2021
As Artificial Intelligence (AI) and Machine Learning (ML) technologies gain increased adoption, organizations are launching efforts to extract additional value from their data using these new technologies. While some initiatives have been successful, many industries have struggled. A prominent example of the latter can be found within the healthcare sector.
Hospitals and healthcare organizations are sitting on mountains of data, and are gathering more every day. Yet, performing a meaningful analysis on the data at scale remains a daunting task. This is because data exists, but it is often not good data.
Here, Software Engineer Sumeet Shah explores what makes data good, by looking at three key attributes: usability, relevance, and completeness.
How easy is it for users to access and interact with the data? As we mentioned, hospitals collect and store massive amounts of data. Unfortunately, much of it is not in a usable state. There are many aspects of usability that can serve as potential barriers.
- Medium: Data in the form of physical records is valuable (and sometimes legally mandated) but very difficult to use at scale. For the purposes of large scale analysis and AI/ML, data absolutely needs to be digital. Firstly, new data being generated should be recorded digitally. Secondly, pre-existing physical records should be digitized. Thankfully, manual data re-entry is not the only option. Automation tools like optical character recognition (OCR) can help to speed things up! Nonetheless, forms and pictures take time to scan in high quality and data will still have to be mapped and organized in digital data stores.
- Structure: A common problem is having data of variable structure. Different hospitals and research labs may collect the same data but organize and label it differently. In fact, it is not uncommon to encounter independent operators within the same organization structuring their data differently. To make things worse, there may be common pieces of data that are named differently due to context. A common set of conventions must be determined and non-conforming data must be restructured for the data to be aggregated.
Given that every project’s needs will be different, there’s no one data structure that will solve all of our problems. However, what we can do is to adopt common, widely used data standards and present the data in compliance with them. Users will know exactly what they’re getting when they access the data and will have an easier time mapping data in/out of the system. A prime example of this practice in action is the Fast Healthcare Interoperability Resources (FHIR) standard developed Health Level Seven International. The FHIR standard defines a modular, extensible data format for healthcare and research data along with programmatic standards for interacting with the data. With this common standard, FHIR is gaining traction all over the world as a common standard and is helping create a rich and cohesive global health data environment.
- Discoverability: Even if we have digitized our data sets in a common structure, data that cannot be searched for and found may as well not exist. The desired audience, be it entities within an organization, or the internet as a whole, should be able to easily access and search the data. This doesn’t mean that every record in the data set must be directly searchable, but rather that there should be searchable manifests that contain descriptions of the data and include tags/keywords.
Digitizing and structuring data makes it easier to analyze, search, and share. A great example of these usability practices in action can be found at the UCI Machine Learning Repository. Data in this repository is extremely diverse, pertaining to a wide range of subjects and coming from many different sources. Each data set comes with a manifest that includes documentation on the data’s origin, how it is structured, and keywords to make the data set searchable. As a result, users are able to access the repository online, search for datasets that relate to the subject they’re studying, and quickly understand how the data is structured so they can parse it and use it for their machine learning efforts.
Does the data pertain to the problem being analyzed? The relevance of data is, by definition, relative to an objective. So, when there is a clear objective in mind it should be easy to assess the relevance of the data. Let’s say a medical team has the objective of minimizing the number of post-surgery infections. This objective is a prerequisite to filtering data and identifying new data to gather if it turns out that the relevant data does not yet exist. This team can safely filter out any data gathered on non-surgical patients, as it would just be noise in the context of their objective. A clear objective is critical to identifying relevant metrics.
It is not uncommon for organizations to lack clear objectives. They have data and they know they have room for improvement, but they may not be sure where to start. That’s okay! The subjectivity of data relevance can work to our advantage. When there isn’t a clear goal, we can sometimes work backwards and derive problems that the data may be relevant to. Perhaps that non-surgical patient data can be studied to try and find ways to improve hygiene and manage other sources of infection in the hospital. One study’s trash can be another’s treasure.
- Adaptability: Without an objective, we can’t really measure relevance. But once there is a goal, we can start by using our best judgement. Common sense goes a long way, and things that seem relevant/irrelevant probably are. That being said, we may learn that some pieces aren’t as relevant as they seemed or data that was labeled as irrelevant is actually important. Be flexible, and don’t be afraid to adjust what data you gather and use as you build domain knowledge.
Data allows us to paint a picture of the problem domain being analyzed. Relevant data will lie within this domain. Completeness is the percentage of the relevant domain that the data covers.
- Dimensional Coverage: The first facet of data completeness is the set of dimensions being gathered as data. For discrete problems in well understood environments, the relevant dimensions may be easy to identify. However, for highly complex problems like human health, completeness may be more of an ideal to strive for as opposed to a concrete milestone. For example, if we were studying the cardiac health of people over 50, we may have collected EKG readings and information about cholesterol levels and blood pressure. These dimensions are relevant data, but incomplete. To cover the full domain of the problem we would need to know about smoking habits, diet, exercise, genetic factors, and family history. Given the complexity of the human body, that list may go on to encompass countless other factors. At some point, data may seem less relevant and more peripheral.
While it would be ideal to be able to gather every piece of information about a problem, this is seldom possible. In terms of limiting the scope of the data being gathered, sometimes we just have to draw a line in the sand at a point that seems reasonable. Otherwise, the onus of data collection will become crippling. As always, one must balance the cost of data collection with the potential value it may yield. And, keep in mind that some dimensions initially identified as relevant may not be, and vice-versa.
For example, in the cardiac study above, it would be completely reasonable to ignore some seemingly unrelated pieces of data like, let’s say, the shape of the patients’ ears. That’s probably peripheral, irrelevant data. Except that it’s not. By ignoring that metric, we would miss out on being able to observe Frank’s Sign, a diagonal crease in the ear lobe that’s indicative of capillary breakdown throughout the body. But, there’s no way we could have known that prior to the discovery of that relationship! Since it can be very difficult to discern if data is relevant or not, metric completeness is often better thought of as an ideal.
- Representativeness: Now that a set of relevant dimensions has been identified, it’s important that the population from which these metrics are gathered is representative of the entire domain. If data is not representative, then any analyses conducted on it will not be generalizable to the actual population. This has recently been uncovered to be a massive, persistent issue in clinical trials, where it has been noted that despite the high ethnic diversity of America’s population, the ethnic diversity of clinical trial participants is comparatively very low. As such, the data gathered in these trials is not representative of the actual population, and generalizing the outcomes to the actual population leads to mis-informed medical actions being taken with potentially harmful outcomes.
Taking the time to adjust data pipelines and enhance data quality may be difficult, but it’s a prerequisite for enabling meaningful higher level analysis. Together, usability, relevance, and completeness make for good data that will serve as the lifeblood of your AI/ML initiatives.