How to Extract Data Subsets Using Boolean Algebra and First-Order Logic with pandas & Python

A guide on selecting your data by turning complex conditioning into writing a simple boolean expression.

Senhaji Rhazi hamza
Python in Plain English

--

Photo by Anthony Riera on Unsplash

What you will learn in this article?

Assuming you know what a pandas DataFrame is in Python, you will learn how you can express on this DataFrame, some complex conditions to extract a subset of your data, into easier ones by using some notions of boolean algebra & first-order logic to extract the subset of data that satisfies these conditions

What this article is not?

This article isn’t another article that focuses on how many ways exist to select DataFrames in pandas, it puts the focus more on how you can use boolean algebra & first-order logic to express some quite complex conditions on your DataFrame.

How did I write this article?

Last time, a biologist post-doc PhD friend of mine, needed to do some analysis on some genome tabular data he works on, the conditions that he needed were quite complex to write on a DataFrame, so while we were working together, as lazy as all developers are, I knew that the formalism we were using to express the required conditions could be improved, then I remembered some of my college courses of logic (propositional logic/boolean algebra, first-order logic, second-order logic) and eureka, I told him, we would express these conditions in a much easier and flexible manner using propositional logic and maybe some first-order logic.

I’m not sure whether he understood what I meant, but I’m sure he trusted me, so we rewrote all the conditions, the old and new ones, with this formalism.

The aim here is to present to you how we expressed conditions using boolean algebra & first-order logic

A Quick Reminder on Logic

Boolean Algebra

Boolean algebra or propositional logic, deal with the truth valuation of propositional statements, it’s just fancy jargon to say that we can calculate if a statement is true or false. Like elementary algebra where we have numbers, that we can combine with operators (+, *,-,/) here we have values {True (sometimes referred to as 1), False (sometimes referred to as 0)} and operators :

  • AND (conjunction), denoted xy (sometimes x AND y or Kxy), satisfies xy = 1 if x = y = 1, and xy = 0 otherwise.
  • OR (disjunction), denoted xy (sometimes x OR y or Axy), satisfies xy = 0 if x = y = 0, and xy = 1 otherwise.
  • NOT (negation), denoted ¬x (sometimes NOT x, Nx, x̅, x ‘ or !x), satisfies ¬x = 0 if x = 1 and ¬x = 1 if x = 0.

See more details in Wikipedia

First-order Logic

First-order logic is an extension to propositional logic (which is an equivalent system to boolean algebra) adds quantifiers {∃, ∀} and predicates P(x). Predicates are like lambda functions (that take one argument and return a value) that return True or false, see how they are combined:

  • ∃ x : P(x) means there is at least one x such that P(x) is true.
  • x: P(x) or (x) P(x) means P(x) is true for all x.

See more details on Wikipedia

Let's Go Back to Some Coding

We will not be using my friend’s data, since it’s his lab’s data, instead, we will use mock data corresponding to a person’s details, then simulate conditioning, we will pose a set of criteria in a form of questions that increase gradually in complexity, we will answer the first 5 by the formalism of boolean algebra, then the last two ones we will use first-order logic, the code will be written in a notebook and will be linked by the end of this article, so here are the steps again:

  1. Generating mock data (person details type data)
  2. Define our operators and extractors
  3. Answer this set of questions :

Using only Boolean Algebra

  • A person whose first name starts with an a
  • A person whose last name starts with a b
  • A person whose first name doesn’t end with an a and her last name doesn’t end with a b
  • A person whose country code is not US and verify the condition/question (1 and 2) or 3
  • A person who verifies condition 4 and verifies either condition 1 or 3 then currency code isn’t ZWD

Using Some First-order Logic

  • A person who verifies that at least one of his columns start with a +
  • A person who verifies that all of her first name, job and last name start with an a

1. Generating Mock Data (Data on Personal Details)

We will use the python lib faker, define a function, and a config argument, instantiate the function with config argument, get the DataFrame result:

Result:

Here we have generated 8000 random row records as specified in the config argument fake_config for the columns [‘first_name’, ‘last_name’, ‘email’, ‘country_code’, ‘phone_number’, ‘job’, ‘currency_code’]

2. Define Our Operators and Extractors

Operators:

Extractors:

3) Answer a Set of Questions

Boolean Algebra Questions

3.1 A person whose first name starts with an a.

The corresponding code:

#A person whose first name starts with an alambda_func_startswith_a = lambda cell : cell.lower().startswith('a')column = 'first_name'c1 = get_condition_on_df_col(df, column, lambda_func_startswith_a)df[c1].head()

3.2 A person whose last name starts with a b.

The corresponding code:

lambda_func_startswith_b = lambda cell : cell.lower().startswith('b')column = 'last_name'c2 = get_condition_on_df_col(df, column, lambda_func_startswith_b)df[c2].head()

3.3 A person whose first name doesn’t end with an a and her last name doesn’t end with a b.

The corresponding code:

lambda_func_endswith_a = lambda cell : cell.lower().endswith('a')column = 'first_name'c3A = get_condition_on_df_col(df, column, lambda_func_endswith_a)lambda_func_endswith_b = lambda cell : cell.lower().endswith('b')column = 'last_name'c3B = get_condition_on_df_col(df, column, lambda_func_endswith_b)c3 = and_(not_(c3B), not_(c3A)) # could be written also as not_(and_(c3B, c3A))df[c3].head()

3.4 A person whose country code is not US and verify the previous conditions (1 and 2) or 3.

The corresponding code:

#A person whose country code is not US and verify the condition (1 and 2) or 3lambda_func_country_code_not_us = lambda cell : cell != 'US'column = 'country_code'c4A = get_condition_on_df_col(df, column, lambda_func_country_code_not_us)c4 = and_(c4A, or_(and_(c1, c2), c3))df[c4].head()

3.5 A person who verifies condition 4 and verifies either condition 1 or 3 then currency code isn’t ZWD.

The corresponding code:

lambda_func_country_code_not_zwd = lambda cell : cell != 'ZWD'column = "country_code"c5A = and_(c4, or_(c1, c3))c5B =  get_condition_on_df_col(df, column, lambda_func_country_code_not_zwd)c5 = and_(c5A, c5B)df[c5].head()

First-order Logic Questions

3.6 A person who verifies that all of her first name, job and last name start with an a.

The corresponding code:

lambda_func_startswith_plus = lambda cell : cell.lower().startswith('+')c6 = get_condition_on_at_least_one_col(df, lambda_func_startswith_plus)df[c6].head()

3.7 A person who verifies that every of his first name, job and last name start with an a.

The corresponding code:

lambda_func_startswith_a = lambda cell : cell.lower().startswith('a')columns = ['first_name', 'last_name', 'job']c7 = get_condition_on_every_col(df, lambda_func_startswith_a, columns)df[c7].head()

NB: If we wanted to be more formal in simulating the operator ∀ we should have applied the condition for every column, but the result will be empty for our data, since we are not mathematicians but engineers we have given ourselves some flexibility by applying ∀ only on the columns ['first_name', 'last_name', 'job'] (Sorry dear mathematicians for our blasphemies, and thank you for your work)

Conclusion

Starting from questions 4, 5 the conditions start to get a bit harder to express without a formalism, using the formalism of boolean algebra and first-order logic (for questions 6 and 7) made it easier to express and combine conditions in our dataset.

This formalism is also useful because when using it, the calculations and the theorems associated with it are still valid. Thus we can simplify complex conditions into simpler equivalent ones on the paper before expressing them in the DataFrame.

Here is the notebook containing all the code shown here.

Hope you liked it, feel welcome to clap or/and connect.

More content at plainenglish.io

--

--