‘Data is the new oil’ – there is now a continuous stream of articles on how data is the new resource, but should be rephrased as ‘correct, relevant and complete data is the new oil’. While it doesn’t have the catchiness of the first, it draws attention to what is at stake because, as a corollary, false/incorrect data can be a drain on resources.
If anyone ever doubted the value of data, here is the proof. A NITI Aayog report, quoted in The Times of India (Page 14, June 25, 2025) shows savings of Rs 40,000 from three clean-up actions – Rs 9,000 cr from deleting 1.7 crore ineligible PM-Kisan names, Rs 21,000 cr from weeding out 3.5 crore LPG connections (over two years) and Rs 10,000 cr from dropping 1.6 crore fake ration cards, totaling to Rs 40,000 cr. According to an article, “IBM estimates that postaor data quality costs the US economy around $3.1 trillion annually”.
Understand, understand, understand
Understanding the behaviour of data is central to unearthing false data because it informs us that something is amiss, is out of sync and misleading. It is not necessary to use complex econometric models but good old common sense ought to do the job, at least as a warning signal. An increase in sales will be reflected either in an increase in debtors or bank balance. And an increase in debtors should be realized in cash is some defined period of time. Every odd behaviour is a red flag.
An article titled ‘The challenge of data accuracy’ by Dennis O’Reilly says that “Data accuracy is measured by determining how closely information represents the objects and events it describes. A common example is the accuracy of GPS systems in directing people to their destination: Do they lead you directly to the doorstep, or point you to a spot a block or two away?” (https://www.dataversity.net/the-challenge-of-data-accuracy/).
Confirming accuracy or otherwise of data is a function of patters and relationships apart from subject matter and environment; anything that seems amiss is a good indicator of poor quality of data. Anything that does not fit in with certain established parameters is at least a warning that the data could be compromised.
False data injection
False data per se is not a problem but false data in an active system (not part of just archives) because such a system is the basis for financial transactions. Interestingly, in certain cases, lack of interoperability is a blessing. Consider the financial services industry including markets, which are home to a wide range of software acquired over the years – organic and inorganic – that remain silos and in different formats. A simple thing like formatting can lead to inaccurate data, often ignored by people because it just strike them that formatting can actually change data. Anyone who has worked on a reasonable scale IT/software project ought to know. Certainly, anyone who has worked on data migration will.
One obvious question is the source of false data. Either some old, out-dated data has not been weeded out or there is injection of false data into the system, whatever be the system. In most banking frauds, creating multiple fictitious accounts to siphon off money is a standard technique. The technical term is ‘false data injection’, which can threaten the integrity and therefore the performance of any system.
Economic data
Economic data, especially macroeconomic data, is an attractive target because it can fetch a bonanza until caught. Banking frauds run into a minimum of hundreds of crores. Markets of all kinds are a target because false data is not easily detected. A study of smart power grids in Transactive Energy Markets has demonstrated how active injection of false data by adversaries during energy trading can by maliciously acting as a market participant or by exploiting fragile security mechanism – (https://ieeexplore.ieee.org/document/9715633). The authors state in the Abstract: “The result shows that adversaries gain significant financial benefits while victims face economic losses. The result also indicates that attackers could severely impact market operations just by injecting a small false demand”.
Keith Hall, Senior Research Fellow, Meractus Center, George Mason University, in his Testimony to Congress on ‘The challenge of producing economic data for the 21st century’ in late 2024, made some perceptive observations: “Like physical infrastructure, statistical systems become obsolete over time. The economy is consistently changing, new industries emerge while old industries restructure and sometimes decline, business practices change, and households change how they make economic decisions. Keeping up the coverage and quality of economic data has been, and is likely to continue to be, constrained by tight budgets and the complexity of data collection and analysis. It has always been a problem that data users often need new information quickly while it takes agencies a long time to design and produce new, high-quality statistics”.
The ingenuity of the corrupt knows no bounds but improving data quality and accuracy can act as a check. And it is not a one-time but continuous exercise to keep looking for red flags. Apart from the corrupt, the inefficient also breed poor data, with dubious practices. Digitized data is not necessarily and always accurate; often, it is easier to hide real data and bring upfront false data, leading to flawed decisions. Hiding positions in derivatives trading at Barings, copper trading in Sumitomo are classic examples of hiding within a sophisticated system.
In almost every country, macroeconomic data is a matter of aggregation. In a country as vast and uneven (in every respect) as India with a Union Government and 28 States and 8 Union territories, each of which have their own data collection mechanisms, the Union Government’s aggregation will simply carry further all the inaccuracies of the data collected by State governments and UTs. Thus, it is vital to establish data accuracy before commencing the process of aggregation, with as many layers of scrutiny as is necessary. There is an excellent article on the challenges faced in data collection in India.
What’s the way out?
As Einstein observed, intuition is prior intellectual experience. Experience of working with data ably aided by sound common sense does help but there are obvious limits, especially when the dataset is large. Fortunately, there are techniques available to detect false data and fraud.
One well-known technique is called the Benford’s Law, named after the physicist, but originally discovered by the astronomer Simon Newcomb in 1881 (No there is no mistake, it is 1881), popularised by the physicist in 1938. The law describes which are the leading digits in numbers in any large dataset, forming the basis for detecting anomalies. As Jim Frost explains, “Leading digits with smaller values occur more frequently than larger values. This law states that approximately 30% of numbers start with a 1 while less than 5% start with a 9. According to this law, leading 1s appear 6.5 times as often as leading 9s! Benford’s law is also known as the First Digit Law”.
An excellent article on this law in The Scientific American shows how this law has practical implications and consequences.
Many studies have confirmed the Bernford’s Law, which has helped put people behind bars because of its ability to detect frauds. “Financial adviser Wesley Rhodes was convicted of defrauding investors when prosecutors argued in court that his documents did not accord with the expected distribution of leading digits and therefore were probably fabricated. The principle later helped computer scientist Jennifer Golbeck uncover a Russian bot network on Twitter. She observed that for most users, the number of followers that their followers have adheres to Benford’s law, but artificial accounts significantly veer from the pattern. Examples of Benford’s law being applied to fraud detection abound, from Greece manipulating macroeconomic data in its application to join the eurozone to vote rigging in Iran’s 2009 presidential election. The message is clear: organic processes generate numbers that favor small leading digits, whereas naive methods of falsifying data do not”.
There is significant volume of literature on Bernford’s Law which should be of use to anyone working on large datasets, subject to legitimate areas of application. Hopefully, NITI Aayog is aware of the law.