Python for Excel Analysts: Working with Columns

This is the second in a series of blog posts that teaches you how to work with tables of data using Python code. This post introduces you to working with the columns of pandas DataFrame objects.

If you’re unfamiliar with the pandas library, check out Part 1 (The Basics) of this blog series.

Each post in the series has an accompanying Microsoft Excel workbook to download and use to build your skills. This post’s workbook is available for download here.

For convenience, here are links to all the blog posts in this series:

Part 1 – The Basics
Part 2 – Working with Columns (this post)
Part 3 – Filtering Tables
Part 4 – Data Cleaning and Wrangling
Part 5 – Combining Tables

Note: To reproduce the examples in this post, install the Python in Excel trial. If you like this blog series, check out my Anaconda-certified course, Data Analysis with Python in Excel.

Accessing Columns

The last blog post in this series provided an overview of how data tables are represented in Python using the pandas library. You also learned how your Microsoft Excel knowledge maps to Python code.

Mapping your Excel knowledge to Python will be a recurring theme throughout this series. Just as you use functions with columns of data in Excel, you do the same when writing Python code.

For example, it is common in Excel to apply a function to a column to clean data or to calculate some value (e.g., the average). You will do the same using Python code as you move into more advanced data analysis scenarios (e.g., cluster analysis).

You have several options for accessing the columns of pandas DataFrame objects. This blog post will focus on one of the most flexible options, while later blog posts will cover other approaches.

Consider how you would write Excel code to access the SalesAmount column of the InternetSales table:

The above Excel formula treats SalesAmount (a column) as an attribute of the InternetSales object (a table). DataFrame objects work similarly.

The following Python code accesses SalesAmount (a pandas Series) of the internet_sales object (a DataFrame):

DataFrames provide access to columns via the use of square braces. When using this option, you provide the column name you wish to access as a quoted string.

The Python coding convention is to use single quotes, but using double quotes (e.g., “SalesAmount”) is also valid.

The prime benefit of accessing DataFrame columns this way is that complex column names (e.g., names with spaces) are supported.

You can also use square brackets to access multiple columns simultaneously. However, this requires you to use what is known as a Python list. A Python list is a way of declaring a collection of values.

For more information on Python lists, check out this course from Anaconda.

The following Python code accesses both the TotalProductCost and SalesAmount columns of the internet_sales DataFrame:

The Python code above has a nested pair of square braces. The inner pair of square braces creates the Python list containing two values:

‘TotalProductCost’
‘SalesAmount’

Python lists are commonly used when writing pandas code. You will see more examples in later blog posts. For simplicity, this post will only work with one DataFrame column (i.e., a pandas Series object) at a time.

Working with Numeric Columns

As an Excel analyst, you commonly work with columns of numbers. When working with a new dataset, it is common to calculate summary statistics for the numeric columns. Using summary statistics provides insights into the distribution of values within the numeric column.

Some examples of summary statistics include:

The count of values
The minimum value
The average value
The maximum value

Summary Statistics with Excel

The following Excel code shows calculating summary statistics for the SalesAmount column of the InternetSales table:

The output of Fig 5 tells us that the minimum SalesAmount is $2.29 and the average SalesAmount is $486.087. Even with just two summary statistics calculated, we’ve already learned a lot about the SalesAmount column:

Sales can be relatively small compared to the average.
The distribution of SalesAmount values might be skewed.

Summary Statistics with Python

Conceptually, calculating summary statistics using the pandas library in Python is the same as with Microsoft Excel—you request that a function be run on a collection of values. The Python function names are often the same as in Excel!

Here’s an example of finding the minimum value of the SalesAmount column using Python.

First, use the PY() function to create your Python formula:

Next, when you type the “(“ the cell will indicate it contains Python code:

You can use the min() method of a pandas Series object to get the minimum value contained within a column:

Hitting <Ctrl+Enter> on your keyboard runs the Python code and produces the output:

Another name for the average is mean. The following Python formula uses the mean() method of the pandas Series to calculate the average of the SalesAmount column:

Running the above Python formula with <Ctrl+Enter> produces the following output:

Fig 11 – Python code output for Fig 6

s the above illustrates, pandas code for calculating summary statistics is very similar to Excel code. While the above examples are simple, what’s important is the pattern of Python code you use when working with pandas Series objects.

In a later blog post, you will learn how to create new columns in a DataFrame from existing columns to provide more insight into the data. This process is known as feature engineering and builds on the skills you’ve learned in this post.

Describing a Numeric Column with Python

The pandas Series data type provides a handy method for calculating several summary statistics—the describe() method. The following code calls the describe() method on the SalesAmount column:

Running the Python formula will produce a new pandas Series object containing the calculated summary statistics. Using your mouse, hover over the card:

Fig 13 – Hovering your mouse over the card

Clicking the card will display the contents of the Series object:

Fig 14 – The card for the describe() Series object

The Series object returned by the describe() method provides a wealth of information regarding the SalesAmount column and is quicker/easier than writing many Excel formulas to call the corresponding Excel functions:

Of the 60,398 values, 50% are below $30 – despite the average value being $486.087.
75% of the values are below $540.
The SalesAmount column is skewed with many small values and relatively few large values.

Many more methods are available for working with numeric pandas Series objects. See the online documentation for more information.

Working with Text Columns

Text is a common form of business data. Textual data comes in many forms—product names, geographies, addresses, etc. In Python the term string is used to refer to textual data.

Cleaning and transforming string data is very common. It’s likely that in your work as an Excel analyst, you’ve had to wrangle string data before it could be used in analyses. Some common examples of wrangling string data include:

Transforming strings to be upper/lowercase
Extracting substrings
Replacing parts of strings
Removing special characters

String Wrangling with Excel

The following Excel code shows a common data wrangling scenario when working with string data—removing commas.

The ProductName column of the InternetSales table is string data that contains commas:

The following Excel code replaces every instance of a comma followed by a space (i.e., “, ”) with the pipe character (i.e., “|”) contained in the ProductName column:

This type of string wrangling is common when data is shared using comma-separated values (CSV) files.

Running the above Excel code produces the following output:

All the string wrangling operations performed in Excel are available using Python. To use Python for string wrangling, you need to map your Excel knowledge to the applicable pandas Series methods.

String Wrangling with Python

When performing string wrangling using pandas Series objects, the first step is to use the str attribute. Using the str attribute gives you access to the methods you need to perform string wrangling:

The replace() method provides functionality like Excel’s SUBSTITUTE() function. The following code replaces every instance of “, ” with “|”:

The replace() method returns a new pandas Series object containing the wrangled string data. This is like what you saw above in the Excel worksheet after running the SUBSTITUTE() function.

The above Python code stores the new Series object in a variable named prod_name_clean. Using a named variable allows you to easily reuse the object in later Python formulas you might write.

You can inspect the contents of this object by hovering over the card in the cell with your mouse and clicking. Excel will provide a preview of the contents of the prod_name_clean Series object:

AI PLATFORM

ANACONDA AI PLATFORM

PRICING

BY INDUSTRY

PROFESSIONAL SERVICES

RESOURCE CENTER

FOR USERS

COMPANY

PARTNER NETWORK

CONTACT

Technical Notes

Python for Excel Analysts: Working with Columns

Accessing Columns