Showing posts with label Clean Data. Show all posts
Showing posts with label Clean Data. Show all posts

The Significance of "Clean" Data And Bias.




Algorithms for machine learning are only as good as the data used to train them.

Making sure the data you are using is as clean as possible is a crucial aspect of governance. 

We typically refer to data as being "clean" when it is both of high quality and bias-free. 


Data Integrity.


Quality refers to a set of measures that may be used to judge whether data is appropriate for the purpose for which you want it. 

To make sure you're meeting all the requirements for data quality, each of these factors should be examined since they are all equally significant. 

Consistency is the first one. This indicates that a dataset's data was all collected and recorded in the same manner. 

For instance, if a record has numerous fields, each record should have each field filled in. 

In order to always be able to utilize the data together, fields should be used consistently across all records and if we "know" anything about one item of information in the dataset, we should also "know" it about all other pieces of information. 

On to accuracy. This only indicates that the data is devoid of errors. 

The tools or sensors that have been used to gather or input the observations and measurements must be audited and shown to be operating properly for the observations and measurements to be considered accurate. 

We must make sure that stakeholders who are responsible for data entry are taught and aware of all the governance standards since mistakes may still happen when data is entered by humans. 


Another crucial measure is uniqueness, which simply refers to the absence of duplicate entries. 

When processing data, it's highly probable that your database may start to become inaccurate if the same piece of information is stored more than once in different entries. 

Validity is a metric for determining whether each record or piece of information in a database is appropriate for the use for which it is intended. 

For instance, are dates kept consistently and are all numbers stored consistently—as integers or rounded up or down to a certain decimal place? Timeliness evaluates how probable it is for your data to be relevant given the period it was obtained. 

Some processes, like the movement of glaciers, may be monitored and understood with just sporadic measurements. 

Others need measurements to the millionth of a second, such the locations of protons and electrons in a subatomic structure. 


Measurements must be collected and recorded with a delay that is as near to zero as is possible for procedures that call for real-time datasets. 

Last but not least, completeness measures how much of the entire amount of data that is available on a topic is represented in your dataset. 

Make sure every item you carry in your inventory is reflected in the database if you're utilizing a database of your items and pricing to determine which are the most popular. 

It would not be possible to capture the whole information for other uses, such as tracking animal migration routes, thus a sample would be selected for tracking and analysis. 

However, your insights will be more realistically based the more full your collection is. 

Data governance requires auditing your data using metrics that monitor these parameters in order to confirm that you are dealing with high-quality data. 



Biased Data.



Bias is the second component of "clean data." Data that is biased is not accurately reflective of the data topic. 

Typically, this is caused by elements innate to the method of data collection. 

For instance, your data would be intrinsically skewed if you use feedback forms to gauge customer happiness and only distribute them to clients who have given favorable evaluations. 

Bias may always come in because of the enormous and intricate datasets utilized in AI and machine learning projects. 

The fact that skewed data implies your insights won't be based on objective truth is a significant hurdle for many data endeavors. 

In fact, if the promise of AI is to be fulfilled, experts in the area of AI believe that one of the main problems society will have to solve is eliminating prejudice (or at least decreasing the harm it may bring). 


As mentioned above, biased data may be the consequence of poor data quality, but bias can also appear when your data is of high quality overall. 

This is due to the fact that bias might exist even whether the data is correct, distinctive, reliable, or current. 

It indicates that you aren't casting your net far enough to collect data from a range of sources or perspectives. 

As a consequence, the simulations and models you create won't accurately reflect reality. 


Data bias has some extremely dangerous consequences. 

When face recognition technologies used by police forces to locate criminals in crowds in the US were audited, it was shown that young, female, black citizens were considerably more likely to be incorrectly identified than those of any other age category. 

The algorithm's accuracy rate was determined to be 34% lower for this population when used in comparison to other groups. 

This may undoubtedly result in more erroneous arrests, stops, or searches of persons in this demographic if allowed uncontrolled. 

Recruitment is another instance where data bias might really be problematic. 

The datasets utilized by recruiting algorithms that Noel Sharkey has investigated are so rife with bias that they simply shouldn't be used unless they can be controlled and reviewed with the same level of rigors as is required for data used in pharmaceutical trials, according to Sharkey. 


After realizing that a machine learning algorithm used by Amazon to evaluate job applications was essentially sexist, the company ceased using it in 2018. 

The dataset that the algorithm used was found to discriminate against women, passing them over for opportunities for no reason other than the fact that it did not have enough data on female applicants for these roles because far fewer women than men had applied to work for the company over the previous 10 years. 

The fact that it may sometimes be appropriate to purposely add bias into a system in order to make up for societal elements that tend towards injustice or intolerance complicates matters even more. 


Microsoft and IBM both introduced AI-powered chatbots during the previous decade, but they eventually needed to be modified (or, in Microsoft's case, deleted) to prevent them from behaving in an offensive and bigoted way. 

This was due to the fact that they were developing their communication skills based on social media conversations, which are of course sometimes racist or hostile in nature. 

This entailed informing the bot that it shouldn't be learning from racist or abusive material, which necessitated injecting a deliberate bias aspect into the system. 

Of course, this obviously results in less realistic data being used to train the bots. 

There weren't many options, but it's obviously inappropriate for a computer speaking for a business like IBM to use racial slurs and quote Hitler. 


Another crucial step in the governance process is to balance the damage that might result from using biased data against the harm from omitting it. 


~ Jai Krishna Ponnappan

Find Jai on Twitter | LinkedIn | Instagram



References And Further Reading


  1. Roff, HM and Moyes, R (2016) Meaningful Human Control, Artificial Intelligence and Autonomous Weapons, Briefing paper prepared for the Informal Meeting of Experts on Lethal Autonomous Weapons Systems, UN Convention on Certain Conventional Weapons, April, article36.org/wp-content/uploads/2016/04/MHC-AI-and-AWS-FINAL.pdf (archived at https://perma.cc/LE7C-TCDV)
  2. Wakefield, J (2018) The man who was fired by a machine, BBC, 21 June, www.bbc.co.uk/news/technology-44561838 (archived at https://perma.cc/KWD2-XPGR)
  3. Kande, M and Sönmez, M (2020) Don’t fear AI. It will lead to long-term job growth, WEF, 26 October, www.weforum.org/agenda/2020/10/dont-fear-ai-it-will-lead-to-long-term-job-growth/ (archived at https://perma.cc/LY4N-NCKM)
  4. The Royal Society (2019) Explainable AI: the basics, November, royalsociety.org/-/media/policy/projects/explainable-ai/AI-and-interpretability-policy-briefing.pdf (archived at https://perma.cc/XXZ9-M27U)
  5. Hao, K (2019) Training a single AI model can emit as much carbon as five cars in their lifetimes, MIT Technology Review, 6 June, www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/ (archived at https://perma.cc/AYN9-C8X9)
  6. Najibi, A (2020) Racial discrimination in face recognition technology, SITN Harvard University, 24 October, sitn.hms.harvard.edu/flash/2020/racial-discrimination-in-face-recognition-technology/ (archived at https://perma.cc/F8TC-RPHW)
  7. McDonald, H (2019) AI expert calls for end to UK use of ‘racially biased’ algorithms, Guardian, 12 December, www.theguardian.com/technology/2019/dec/12/ai-end-uk-use-racially-biased-algorithms-noel-sharkey (archived at https://perma.cc/WX8L-YEK8)
  8. Dastin, J (2018) Amazon scraps secret AI recruiting tool that showed bias against women, Reuters, 11 October, www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G (archived at https://perma.cc/WYS6-R7CC)
  9. Johnson, J (2021) Cyber crime: number of breaches and records exposed 2005–2020, Statista, 3 March, www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-and-records-exposed/ (archived at https://perma.cc/BQ95-2YW2)
  10. Palmer, D (2021) These new vulnerabilities put millions of IoT devices at risk, so patch now, ZDNet, 13 April, www.zdnet.com/article/these-new-vulnerabilities-millions-of-iot-devives-at-risk-so-patch-now/ (archived at https://perma.cc/RM6B-TSL3)