pandas is a tool for working with data in Python. It is useful for AI Engineering among other things.
conditional selection
reviews.loc[(reviews.country == 'Italy') & (reviews.country >= 90)]
categorical data
When working with categorical data, it is common to do one-hot encoding. In pandas this is two step process of turning categorical data (like "male" and "female" string values) into numbers (like 1 and 0). This is done using pd.Categorical
. See Pandas User Guide: Categorical data.
Once you have done that, you can the do one-hot encoding via get_dummies(data_frame)
. This is a reshaping of the data. See Pandas User Guide: Reshaping and pivot tables, particularly the section of get_dummies. Also, the API docs on get_dummies. This will one-hot encode each categorical column into n-many "dummy columns", which you then train on.
Most importantly, you can just call get_dummies
without categorizing your data first. It will detect columns with object
data types, assume they are categorical, and go from there.
From the Pandas User Guide: Reshaping and pivot tables: section on get_dummies
get_dummies() also accepts a DataFrame. By default, object, string, or categorical type columns are encoded as dummy variables with other columns unaltered.
So, if you have categorical data that is already a number don't forget to force it to be categorical, e.g., df.cabin_class = pd.Categorical(df.cabin_class)
You can also use pd.cut()
to "bin" continuous values into discrete values, which can then be treated as categorical values. See the reshaping user guide: cut section.