pandas

Image of Author
October 25, 2023 (last updated September 21, 2024)

pandas is a tool for working with data in Python. It is useful for AI Engineering among other things.

conditional selection

reviews.loc[(reviews.country == 'Italy') & (reviews.country >= 90)]

categorical data

When working with categorical data, it is common to do one-hot encoding. In pandas this is two step process of turning categorical data (like "male" and "female" string values) into numbers (like 1 and 0). This is done using pd.Categorical. See Pandas User Guide: Categorical data.

Once you have done that, you can the do one-hot encoding via get_dummies(data_frame). This is a reshaping of the data. See Pandas User Guide: Reshaping and pivot tables, particularly the section of get_dummies. Also, the API docs on get_dummies. This will one-hot encode each categorical column into n-many "dummy columns", which you then train on.

Most importantly, you can just call get_dummies without categorizing your data first. It will detect columns with object data types, assume they are categorical, and go from there.

From the Pandas User Guide: Reshaping and pivot tables: section on get_dummies

get_dummies() also accepts a DataFrame. By default, object, string, or categorical type columns are encoded as dummy variables with other columns unaltered.

So, if you have categorical data that is already a number don't forget to force it to be categorical, e.g., df.cabin_class = pd.Categorical(df.cabin_class)

You can also use pd.cut() to "bin" continuous values into discrete values, which can then be treated as categorical values. See the reshaping user guide: cut section.