Work I Completed

This week in data science I started to complete the task on data bias. The task involved reading a PowerPoint and understanding a few key data bias types, and then creating a case study for each of them. The data bias types were response bias, selection bias, presentation bias, omitted variable bias and social bias. I was able to study the PowerPoint and create an understanding of all the bias types, but was only able to complete case studies for 2/5. In this blog I will be discussing the content I learned in the PowerPoint.

Response Bias

Note - the PowerPoint discussed response bias as a specific data bias type when it is more of an umbrella term. The bias type it was reffering to sounds like a combination of non-response and selection bias.

Response bias is a bias for a particular element in survey responses, or similar online social data, and occurs due to the source of the data.

  • 7% of Facebook users produce 50% of posts
  • 4% of users produce 50% of Amazon reviews.
  • 0.04% of Wikipedia editors produced the first version of 50% of Wikipedia articles.

These users only represent a small demographic, however they heavily influence the data, creating bias.

Selection Bias

Selection bias occurs when the selected survey group does not represent the population as a whole, and proper randomization is not achieved.

Youtubes video recommendation algorithm suffers from this.

  • Video suggestions are inferred by views, clicks, and scrolls.
  • The first video suggestions are based on non-random factors, and hence a proper random selection of videos (survey group) is not achieved.
  • The algorithm creates new video suggestions based data from the previous, and hence all proceeding video suggestions are bias.

Presentation Bias

A bias due to the presentation during data collection. Specific fonts, bolding, size and more factors can create presentation bias, as people see different as more important.

Additionally, western audiences read left-to-right, so things on the left are subconsciously seen as more important.

Omitted Variable Bias

Ommitted variable bias is bias due to one or more crucial survey points not being included. It can also arise due to responders being inclined to give in-accurate data for a variety of reasons? (in the PowerPoint, but I think this is response bias)

For example: Users are scored on how likely they are to buy a product again, and users with high scores will be targeted by agents. The agents will spend more time trying to sell them products. This agent time is not recorded, and the scores seem like they work properly, however this could solely be due to the agents increased time selling.

Social Bias

Human generated data is often inherently vias due to exisiting culture. For example: Amazon AI screens women from workplace, as most technical roles are filled by men. Also, Microsoft Tay is a tweeting bot trained of twitter conversations, and within a day it becomes a racist Nazi.

Reflection

Why could you not complete all the work?

Although I was able to make a good understanding of all the bias types, I was not able to make a case study for each of them. This is due to a number of reasons. First, I had not gotten a chance to start this work earlier, as I was trying to switch to web dev, and was reluctant to do any work. Second, It is now exam week, and I have 3 other exams to prepare for on top of data science, so I have not had the time. Third, it was near impossible to find any case studies on these topics, as companies do not want to give examples of their failures. I mainly ended up looking at journal articles that analyze bad cases of data bias.

Do you feel sufficiently prepared for the exam?

I think that most people in my class did not do a summary of the PowerPoint, however completing the case studies would give them an edge over me. I think I have a fairly solid understanding of these bias types, but finding and creating a case study applies your knowledge, and will make it easier to use when the exam comes. I may have a chance to finish this work before the exam, but it may be better for me to spend my time doing something else.

Updated: