Relationships matter, in life and among data. And, just as in life, we need to steer clear off spurious relationships and stay invested with the genuine. Both take a lot of effort and thinking. It gets more challenging with data because of the multiple ways in which it is captured and the multiple formats and databases in which it is stored.

‘Profiling’ became a dirty word when security agencies and prisons were found to be targeting people from certain groups discriminating against them in more ways than one. While we must unambiguously deplore such malpractices, profiling per se is not at fault. In fact, its use is inevitable.

Most useful data is complex, which means it is not ‘present’ in the world, at least not as we want. We have to construct it from chosen parameters, which means building relationships across parameters to ‘construct new data’. The operative word is ‘construct’ because it automatically lays emphasis on the purpose and object of the exercise.

This, at its simplest, is what is meant by data profiling, and rests on three aspects: structure discovery (whether data is consistent and correctly formatted), content discovery (ensuring data quality) and relationships discovery (correctly identifying relationships across different datasets).

My focus is on the third aspect, while recognizing that it is deeply affected by the other two. I am steering clear off technicalities to grasp the core dimension, but we cannot escape mentioning visualization because it can portray relationships among data points better than anything else, and is used by professionals across different areas.

Breaking it down

Let us start with a simple piece of data (before we make it complex) – per capita income which is gross national income divided by population yielding average income per person. Now, let us plot against (not necessarily in a graph) the PCI of the top 1% of the population and the PCI of the balance and contrast the two and analyse the difference between the two. And we can further see the PCI of the top 5%, 10% and calculate the PCI of the balance. Each such data will tell us something about the first data that we began with, such as the degree of inequality, depth of concentration of incomes and wealth and so on.

Alternatively, I can take the undifferentiated PCI and set it against medical expenditure and especially out of pocket expenses. Or I can match it against the cost of education. Or against the cost of housing. Or against a consumption basket.  And so on. Or I can take the differentiated ‘real PCI’ and see how it matches against various costs necessary for dignified living.

In every example, we are essentially portraying or finding relationships and oscillations among seemingly discrete data sets. This is what is normally meant by complex data, something we construct out of a given sets of data. We can also notice that sorting the given set of data is a necessary step in creating a complex data set. 

(The most rigorous definition of complexity involves not just the many constituent elements but the interrelationships among them, which, many agree, cannot be mathematically modelled. Anyone interested in the subject of complexity may look at the Santa Fe Institute, which is fully devoted to the subject.)

Staying clear off technicalities – the appropriate software to capture complex data, various techniques of visualization of data – I am going to explore how to make sense of complex data. One commonly used approach is to consider three or four Vs – Volume, Variety, Velocity, Veracity. The fourth V is a relatively recent addition to the trinity as a response to the growing threat of fake/false data which is part of any data set. It is important to grasp this point – false or fake data is not some separate data residing somewhere by itself. False data injection is a reality and is present everywhere which forces us to recognise the fourth V and include it in understanding complex data. Else, whatever ‘picture’ we portray will be woefully off the mark and lead to potentially disastrous results especially in security and healthcare. So much of social media is infected with fake/false data that algorithmic ‘studies’ of such data is an enormous waste of time and effort.

Why data technology is important

Anyone who has used some search engine or the other will understand words like tagging, which usually appear at the bottom of any writing on the internet, which helps in the search. Every piece of data is placed in some basket based on some parameters and organised on some lines. Normally, we refer to it as taxonomy. Every library uses it. In fact, any storing of information arranged with the object of facilitating retrieval has to follow some principle of classifying data.

The reason technology is important because, as Keith Foote observes, “Complex data models have now become the norm. A single stream of the data can travel through many hubs, and many different technologies”. And each schema has its own technology-specific terminology and syntax, as well as data types, as Foote notes. Modelling data thus becomes difficult.

One of the principal reasons for the increasing complexity is the multiplicity of data systems in a ‘single information system’, each following different foundational principles. If you did not pay attention to these specificities, the data ‘generated’ or ‘produced’ by the system will be of doubtful credibility. Integrating such diverse systems is no easy task but is essential to produce data that will stand up to scrutiny.

Textual data and profiling  

Textual data is a large component of building relationships to create relevant and correct profiles.  It is ordinarily unstructured data which makes it more difficult to store and retrieve. Domain-specific approach is the ideal way to manage such data since familiarity improves success in retrieving and building appropriate relationships.

This is very serious especially in healthcare where the way a certain disease data is captured will determine its entire life cycle, and help uncover correct causes, leading us to the correct group of people who are likely exposed. Incorrect profiling can lead to wrong data from being captured and analysed. Ulcer is a classic example. For decades, it was mistakenly believed and treated that ulcers were caused by stress and spicy food until two Australian scientists – J Robin Warren and Barry J Marshall – published a paper in 1982 demonstrating that it was caused by H.pylori. They were awarded the Nobel Prize for Medicine in 2005. An essay on this points out that “In 1997, the Centers for Disease Control and Prevention, with other government agencies, academic institutions, and industry, launched a national education campaign to inform health care providers and consumers about the link between H. pylori and ulcers. This campaign reinforced the news that ulcers are a curable infection, and that health can be greatly improved and money saved by disseminating information about H. pylori” (https://www.tulipgroup.com/Crux_magazine/2009/Crux_34.pdf).

Let me end by coming back to the beginning: relationships are (almost) everything. Getting them right will lead us to our goals. Else, it just leads to pain, harm and destruction.