Blog

Pandas DataFrame Validation with Pydantic - Part 2

In part 1 of the article we learned about dynamic typing, Pydantic and decorators.

In this part we will learn how to combine these concepts for Pandas DataFrame validation in our codebase.

  1. Combining Decorators, Pydantic and Pandas - Combine section 2. and 3. of Part 1 to showcase how to use them for output validation.
  2. Let's define ourselves a proper spaceship!
  3. Summary

1. Combining Decorators, Pydantic and Pandas

Let's define a validate_data_schema decorator hat does data validation for functions returning a pandas.DataFrame:

import pandas as pd
from pydantic.main import ModelMetaclass
from typing import List

def validate_data_schema(data_schema: ModelMetaclass):
    """This decorator will validate a pandas.DataFrame against the given data_schema."""

    def Inner(func):
        def wrapper(*args, **kwargs):
            res = func(*args, **kwargs)
            if isinstance(res, pd.DataFrame):
                # check result of the function execution against the data_schema
                df_dict = res.to_dict(orient="records")
                
                # Wrap the data_schema into a helper class for validation
                class ValidationWrap(BaseModel):
                    df_dict: List[data_schema]
                # Do the validation
                _ = ValidationWrap(df_dict=df_dict)
            else:
                raise TypeError("Your Function is not returning an object of type pandas.DataFrame.")

            # return the function result
            return res
        return wrapper
    return Inner

Now that was easy, and we can go on using it. Just kidding ;) Let's look at it a little closer.

The decorator first takes a Pydantic model data_schema (e.g. DictValidator class from above) as an input.

The Inner(): and wrapper(): calls are some necessary wrapping before the function wrapped is executed, and the results is saved to the res object.

If the res object is not a pandas.DataFrame, a TypeError is raised.

If it is a pandas.DataFrame, the frame is transformed to a List[Dict], a small ValidationWrap helper class is created, and then the validation happens by passing the df_dict to the ValidationWrap class.

If there is any problem with the content of the pandas.DataFrame that does not fit with the defined Pydantic model, an error will be raised.

If everything is fine, the res object is returned.

Let's test it out with a toy example!

from pydantic import BaseModel, Field
# We expect a DataFrame with the columns id, name and height.
# id must be an int >= 1, name a string shorter than 20, and height is a float between 0 and 250.
class AvatarFrameDefinition(BaseModel):
    id: int = Field(..., ge=1)
    name: str = Field(..., max_length=20)
    height: float = Field(..., ge=0, le=250, description="Height in cm.")


@validate_data_schema(data_schema=AvatarFrameDefinition)
def return_user_avatars(user_id: int) -> pd.DataFrame:
    # Let's use the user_id as the height for the Mustermann avatar to trigger the validation.
    return pd.DataFrame(
         [
         {"id": user_id, "name": "Sebastian", "height": 178.0},
         {"id": user_id, "name": "Max", "height": 218.0},
         {"id": user_id, "name": "Mustermann", "height": user_id },
        ]
    )
return_user_avatars(user_id=42)  # works

from pydantic import ValidationError

try:
    return_user_avatars(user_id=342) # gives an error!
except ValidationError as e:
    print(e)
1 validation error for ValidationWrap
df_dict -> 2 -> height
  ensure this value is less than or equal to 250 (type=value_error.number.not_le; limit_value=250)

The decorator works as expected! To be sure, let's assert it some more!

Let's pass it a float as a user_id. Since we defined the user_id as an int in the Pydantic model, we should get an error.

return_user_avatars(34.0)  # works? T.T

First, we see that pycharm is picking up our type hints and highlights that 34.0 is not an int but a float, but Pydantic is not throwing an error. The function is executed without any issue, so what is going on? Is our good idea worthless in the end?

Here we have to better understand how Pydantic works. We have to be specific about how strict we want to be in the evaluation.

If we define a type as int in Pydantic, Pydantic will try to call int(value), and if that does not throw an error, everything is fine. This could introduce unwanted bugs, so we have to be more specific.

Let's use con(stricted)int and confloat and constr from Pydantic to use the strict=True mode together with our conditions. (If there are no specific conditions, we could also directly import the type StrictInt).

from typing import Optional
from pydantic import BaseModel, conint, confloat, constr

class StrictAvatarFrameDefinition(BaseModel):

    id: conint(strict=True, ge=1)
    name: constr(max_length=20)
    height: Optional[confloat(strict=True, ge=0, le=250)] = Field(None, description="Height in cm.")

@validate_data_schema(data_schema=StrictAvatarFrameDefinition)
def return_user_avatars(user_id: int) -> pd.DataFrame:
    # Let's use the user_id as the height for the Mustermann avatar to trigger the validation.
    user_avatars = pd.DataFrame(
         [
         {"id": user_id, "name": "Sebastian", "height": 178},
         {"id": user_id, "name": "Max", "height": 218},
         {"id": user_id, "name": "Mustermann", "height": user_id },
        ]
    )

    return user_avatars
try:
    return_user_avatars(user_id=42.0) # does not work anymore
except ValidationError as e:
    print(e)
3 validation errors for ValidationWrap
df_dict -> 0 -> id
  value is not a valid integer (type=type_error.integer)
df_dict -> 1 -> id
  value is not a valid integer (type=type_error.integer)
df_dict -> 2 -> id
  value is not a valid integer (type=type_error.integer)

So we see that Pydantic is its own package with its own syntax one has to get comfortable with. Once accustomed to it, Pydantic can be a real help to maintain type consistency, output shape, and content validation for arbitrary complex DataFrame output of properties and functions.

Also note that the decorator was not touched to fix our float error, just the Pydantic model definition had to be adjusted. This shows the general flexibility and usefulness of decorators.

Let's get one more complex example output to finish up the article.

2. Let's define ourselves a proper spaceship!

First we define the ship types using an Enum class to make sure no one tries to define an alien spaceship.

from enum import Enum
class SpaceShipTypes(
    str,
    Enum,
):
    """Possible spaceship Types."""

    light_fighter = "light_fighter"
    cruiser = "cruiser"
    battlecruiser = "battlecruiser"
    destroyer = "destroyer"
    death_star = "death_star"

Now we define our SpaceShipClass.

A proper spaceship needs

  • A ship_type (matching the Enum class above)
  • A ssin [spaceship identification number] (wich is defined as a 12 element string starting with 2 letters, followed by 9 letters and numbers, followed by 1 number) which we will validate using a regular expression.
  • A build_date. We will use a custom validator that checks that the spaceship will not be build in the future (although possible), and was not built before the era of space travel.
  • A deprecation_date (for tax purposes). We will validate that it is larger than the build date.
import datetime
from pydantic import validator

# Let's define our SpaceShipClass!
class SpaceShipClass(BaseModel):
    ship_type: SpaceShipTypes  # the previous defined Enum class
    
    ssin: str = Field(  # ssin validation using regex
        ...,
        max_length=12,
        min_length=12,
        regex="^([A-Z]{2})([0-9A-Z]{9})([0-9]{1})$",
        title="SSIN",
        description="A spaceship Identification Numbers.",
    )

    build_date: datetime.date = Field(  # build date with custom @validator below
        ...,
        title="Build Date",
        description="The date the spaceship was produced. Has to be a str in ISO 8601 format, like: 2020-09-25.",
    )

    @validator("build_date")
    def build_date_ok(cls, build_date):
        """Validate build_date value."""
        min_build_date = datetime.date(year=1961, month=4, day=12)
        if build_date > datetime.date.today():
            raise ValueError("Build date should not be in the future.")
        elif build_date < min_build_date:
            raise ValueError(
                "Build date must be larger than {min_date}.".format(min_date=min_build_date),
            )
        return build_date

    # Deprecation date with custom @validator that ensures that it is larger than the build date.
    # By adding a `values` field to the @validator we can access previously defined elements of the class.
    deprecation_date: datetime.date = Field(
        None,
        title="Deprecation Date",
        description="The date the spaceship will retire. Has to be a str in ISO 8601 format, like: 2020-09-25.",
    )

    @validator("deprecation_date")
    def deprecation_date_ok(cls, deprecation_date, values):
        """Validate the build_date value."""
        if "build_date" in values:  
            # 'if "build_date" in values:' is necessary because if the validation already fails on the build_date variable,
            # it will not be present for validation of this field, which would therefore create a key error
            # (behaviour given by the Pydantic package).
            if deprecation_date <= values["build_date"]:
                raise ValueError("The deprecation date must be larger than the build date.")
        return deprecation_date

Now that is a lot of checks and balances to ensure we only define proper spaceships!

Let's use it in a spaceship creation function.

@validate_data_schema(data_schema=SpaceShipClass)
def diy_ss(ship_type, ssin, build_date, deprecation_date):
    return pd.DataFrame(
        {"ship_type": ship_type,
         "ssin": ssin,
         "build_date": build_date,
         "deprecation_date": deprecation_date},
        index = [0]
    )
diy_ss(
    ship_type= "battlecruiser",
    ssin= "DE342INWT944",
    build_date= "2020-09-25",
    deprecation_date= "2030-09-25"
)

The creation of the "DE342INWT944" was a success. Lets check what happens if someone makes an error defining a spaceship.

try:
    diy_ss(
        ship_type= "battlecruiser",
        ssin= "FR123",
        build_date= "2020-09-25",
        deprecation_date= "2030-09-25"
    )
except ValueError as e:
    print(e)
1 validation error for ValidationWrap
df_dict -> 0 -> ssin
  ensure this value has at least 12 characters (type=value_error.any_str.min_length; limit_value=12)
try:
    diy_ss(
        ship_type= "battlecruiser",
        ssin= "DE342INWT944",
        build_date= "1900-09-25",
        deprecation_date= "2030-09-25"
    )
except ValueError as e:
    print(e)
1 validation error for ValidationWrap
df_dict -> 0 -> build_date
  Build date must be larger than 1961-04-12. (type=value_error)
try:
    diy_ss(
        ship_type= "battlecruiser",
        ssin= "DE342INWT944",
        build_date= "2030-09-25",
        deprecation_date= "2030-09-25"
    )
except ValueError as e:
    print(e)
1 validation error for ValidationWrap
df_dict -> 0 -> build_date
  Build date should not be in the future. (type=value_error)
try:
    diy_ss(
        ship_type= "battlecruiser",
        ssin= "DE342INWT944",
        build_date= "2020-09-25",
        deprecation_date= "2010-09-25"
    )
except ValueError as e:
    print(e)
1 validation error for ValidationWrap
df_dict -> 0 -> deprecation_date
  The deprecation date must be larger than the build date. (type=value_error)

The complex validation is working like a charm.

We will have to get accustomed to the Pydantic syntax to really get the hang of things, but many functions can already be validated on the fly now.

3. Summary

In the first part of this article series we discussed the downsides of python's dynamic typing capabilities in regard to data quality and code maintainability. We gave an introduction into the Pydantic package and showed how decorators work.

In this second part we combined the two concepts to show how to create a simple, yet flexible and powerful (thanks to Pydantic) way to do complex DataFrame validation. The decorator syntax allows easy addition to an existing code base.

This way, unit tests for functions that return DataFrames can be reduced, and the data quality of production pipelines can be ensured.

We also showed the importance of really understanding the Pydantic package and syntax to ensure you are actually validating what you think you are validating.

This article shows an interesting way to combine existing python packages and concepts to tackle some problems of the python programming language.