Two Painful Facts about Data Science

The limitations of two critical techniques

Pavan B Govindaraju
2 min readAug 21, 2021

Data scientists work with real-world problems, which are usually not as simplified in assumptions as to the ones you see in textbooks. But there is a reason why textbook problems are “simplified” in the first place. Theoretical foundations require building on various assumptions regarding the constraints and problem formulation to offer proofs. However, there are a few painful characteristics that occur in real-world problems and are proven to be intractable. In this post, we are going to look at two such facts.

Photo by Usman Yousaf on Unsplash

Non-Linear Optimization

Many problems involve maximizing a particular outcome or revenue in business settings. Optimization is a well-established field that looks at using various techniques to find the “optimal” solution. Problems governed by equations can also be solved using these techniques by minimizing the residual to zero. However, this is possible when the attribute to optimize fits certain assumptions. When the function to optimize or constraints are non-linear, one excellent approach is Branch and Bound, which breaks down the problem recursively and uses existing state-of-the-art optimization techniques to solve the sub-problems. This approach can degenerate into an exhaustive, or even worse, an endless search if the problem is complex enough and is more likely the case.

Causal Inference

A big part of every data scientist’s job is to figure out fundamental characteristics that drive the business. It involves figuring out the “causality” between two factors. Causal Inference is a branch, which is foundational to all sciences and deals with determining the independent effect of a component in a complex system. For example, to study whether a vaccine offers a cure, one would ideally have to give and not give the vaccine to the same person to keep all other factors constant. Since this is impossible, the next best solution is to randomly split the group into two, hoping that there are no common factors between them, although this can never be decisively shown. This approach forms the basis of a randomized control trial and is commonplace in science.

It gets more complicated in fields such as macroeconomics, where control trials cannot be performed and the underlying causal model has to be inferred from empirical data. This is possible only under strict assumptions on the data and arguments must be made to show certain factors are not dependent on each other.

Summary

Several problems faced by data scientists can be abstracted into a non-linear optimization or causal inference problem. These two fields however face limitations and affect the utility of a data scientist in a business setting. This is why decisions can only be “data-driven” but not “data-automated”, and the final call must be made by executives based on intuition.

--

--

Pavan B Govindaraju

Specializes in not specializing || Blogging about data, systems and tech in general