A unified picture of the virus and all that are related to it continues to elude everyone, as different approaches (even belief systems) capture different sets of data, adding to the problem of lack of integrated data systems. Standardization and sophistication in data capture followed by an informed use of big data are options towards creating an integrated data system assisted by an appropriate network
If you have been staying in touch with the Covid-19 virus, you will recall that it took some time before even the experts recognized the severity of the problem. Although there are aspects common to any virus, there are new dimensions too which typically take a while before they reveal themselves. The corollary of this is that data capture would have had to undergone serious revisions because as experts ‘saw’ more, they also realized what data they need to capture, which could effectively be done only prospectively. This also means that some of the data captured earlier, if at all, could have been fundamentally inadequate.
Hence the dictum ‘Data doesn’t lie’ doesn’t really help, except perhaps in the sense of showing gaps in our understanding. To a great extent, what is captured is a function of what people want captured and that could well be defined by some hypothesis, provisional or otherwise. One of the challenges (related to data) is the ability to define what data needs to be captured without any informing hypothesis or at least suspending their exclusive influence on data gathering. We need to practice disinterested interest.
Infrastructure for data capture
The primary question to me is the level of granularity that is relevant to the problem. Clearly, this begs the question: is there adequate and appropriate infrastructure at data capture points (locations) including trained staff to capture data with as much details as is deemed necessary? The tragic truth in India, as perhaps in many other countries too, is that healthcare infrastructure itself is so inadequate that it is naïve to expect sophistication in data capture. And during the pandemic, the healthcare staff must have been under so much strain tending to the afflicted that data capture would not have been one of their priorities. They would have done whatever they needed to comply with registration requirement. No one knows for certain how the data that is being used by everybody has been captured. In a survey, for example, the quality of information captured is a direct function of the questionnaire and developing a good questionnaire needs training. Perhaps, training in data capture could be made part of emergency response; it can be done if, and it is a big if, there were a well-articulated, structure questionnaire. Standardized templated are a simple solution.
ET on Sunday (January 9, 2022), carried an interesting article titled ‘Masked numbers’ which focused on the question of data and related problems. Everyone ET spoke to lamented the problems created by disintegrated data systems. This was annoying because it is common knowledge that there is no integrated data system because there is no network connecting all those who are part of the healthcare system. No connected network, no integrated data system. This is the key point. Tata Trusts funded a project to build a computer network connecting all cancer hospitals in India. Even within a large enterprise or institution, especially geographically spread, an integrated data system is not an automatic result; it has to be created. That calls for more than a network. It calls for standardization of data across all data capture points, data formats, finely defined granularity to capture information in as much detail as possible.
Let me cite tennis as an example. Much of ‘analysis’ in tennis is a simple outcome of the way information (about a match) has been captured. A shot is either an ace or service winner or fault or double-fault, winner (forehand or backhand), an error (forced or unforced) and so on. One shot but it could be any of these. Tabulating such information is a cakewalk! Extending further, one point can be just one shot or a rally, short or long, measured in number of shots, all of which is captured, enabling us to know who won short or long rallies. The server, at certain times, could be under pressure because they are trailing (facing a break point) and the system captures this too! We know then who won how many pressure points, how many high-intensity points. The tabulated data itself becomes analysis because the data has been captured by people who have a deep understanding of the sport. You will find this trend across all sports, probably the first to extensively experiment with ‘analytics’.
This is the central point: data capture has been designed ‘backwards’, from the output that we need. Anyone who has designed an online application form should know this. If this understanding doesn’t inform data capture, anything good resulting is a matter of chance! Any system can be created; it is no big deal. What is a big deal is the quality of the information that the system holds and that is a function of deep understanding, not technology.
Big data and the virus
A substantial chunk of what we need now would accrue from what is understood as big data, a fair bit of which is unstructured. Rather than get caught in a debate on quantitative and qualitative data, we need to focus on creating structured data from big data.
In an article titled ‘The Rise of Big Data: How It’s Changing the Way We Think about the World’ by Kenneth Cukier and Viktor Mayer-Â Schönberger, as part of a book called ‘The Best writing in mathematics 2014’ edited by Mircea Pitici (easily downloadable), the authors make a fundamental observation: ‘Big data is about more than just communication: the idea is that we can learn from a large body of information things that we could not comprehend when we used only smaller amounts’. What they say next is central to my argument: Not just scale, ‘Big data’ is also characterized by the ability to render into data many aspects of the world that have never been quantified before; call it “datafication”’. Now, just recall what I have written about tennis.
Their observation about using scale is of great significance against the background of modern sampling but that will take us away from our concern with the virus; I will discuss this aspect later in an article specifically devoted to it. They mention three changes needed in our approach to data. “The first is to collect and use a lot of data rather than settle for small amounts or samples, as statisticians have done for well over a century. The second is to shed our preference for highly curated and pristine data and instead accept messiness: in an increasing number of situations, a bit of inaccuracy can be tolerated because the benefits of using vastly more data of variable quality outweigh the costs of using smaller amounts of very exact data. Third, in many instances, we need to give up our quest to discover the cause of things, in return for accepting correlations”. As the authors say, “big data is a matter not just of creating somewhat larger samples but of harnessing as much of the existing data as possible about what is being studied. We still need statistics; we just no longer need to rely on small samples”.
The point about correlation is vital as is the earlier observation about the difference between big data and sampling but that is another story.
Takeaways
That data systems are disintegrated is common knowledge
Need to address fundamental questions about what and how data is captured
Creating a connected network is primary; an integrated system arises is a natural outcome
‘Datafication’ out of big data is critical
Use of big data calls for significant changes in our approach to data