In a close race, it is very difficult to predict a winner with confidence. The problem with this election cycle is that most pollsters didn't even predict the race would be this close. Regardless of the actual election result, one could say that the biggest losers in the 2016 and 2020 election cycles are pollsters and experts.
How could they be so far away twice in a row with such large margins? Not just for the presidential election but also for congressional races? Along with many other analytics freaks, I attributed sampling errors to the wrong prediction of the last presidential election. For one, a minor misrepresentation of key factors can lead to incorrect answers when you are studying sparsely populated areas. For example, if the voting patterns are broken down into rural and suburban areas, this sample bias increases even further.
To avoid making the same mistake, I heard some analysts this time trying to overshoot segments like "White Men Without a College". Apparently that wasn't good enough, right? If sample sizes of certain segments are to be manipulated, how and by how much have they done it? How do analysts judge impartially in the face of such challenges? It will be difficult to find just two statisticians who completely agree on the methodology.
You say modeling is half science, half art, and I wholeheartedly agreed with that statement. However, when I look at completely false predictions, I start to think that the "art" part could be problematic in at least some cases.
Then there are human factors. You say modeling is half science, half art, and I wholeheartedly agreed with that statement. But when I look at totally false predictions, I start to think that the art part could be problematic in at least some cases. Old-fashioned modeling involves heavy human intervention in the selection of variables and the determination of the final algorithm among the test models. Statisticians who are known to have good opinions can argue about seemingly simple things like the "optimal number of predictors in a model" until the cows get home.
In reality, no statistical effort can reliably compensate for sample bias and, even worse, “incorrect” data. An old expression applies here – garbage in, garbage out. If a respondent did not answer the question honestly, this should be considered incorrect information. Incorrect data is not only due to data entry or processing errors. Some just show up wrong.
The human factor goes beyond model development. Given a prediction, how would a decision maker respond? Let's say a weather forecasters predicted that there is a 60% chance of a shower tomorrow morning. How would a user apply this information? Would he carry an umbrella all day? Would he cancel his golf trip? People use information as they see fit, and that has nothing to do with the validity of the data or modeling methods used. Would it make a difference if the forecaster was 90% sure of the rain? Perhaps. In any case, it is almost impossible to ask ordinary users to get rid of all emotion in making decisions.
Granted, we can't expect to eliminate emotional factors on the user side, data scientists need to find a more impartial way to build predictive models. You may not have complete control over the inputs and outputs of the data flow, but you can transform available data and choose prediction variables as you wish. And you have full control over variable selection methods and model development. In other words, there is room to be creative beyond what they are used to or trained to do.