Loading...

Blog

Latest posts

Pandas tips I wish I knew before

How does pivot work? What is the main pandas building block? And more …

Photo by Ilona Froehlich on Unsplash

I’ve been a pandas power user for a few years now. There were times when I thought I’ve mastered it and after a few months discovered I was coding like a noob. We’ve all been there 😊 The tips I’m sharing here with you are the ones I learned most recently.

To run the examples download this Jupyter notebook.

To Step Up Your Pandas Game, read:

Let’s start with an Explosive tip 💥

Gif downloaded from giphy

Pandas has a function explode. Don’t worry, it is completely safe to use it 😊 The function is useful when you have lists stored in a DataFrame column. It unpacks the values in a list and it duplicates all other values — hence explode 💥

Let’s create a DataFrame with a column that has a random number of elements in lists.

n = 10
df = pd.DataFrame(
{
"list_col": [[random.randint(0, 10) for _ in range(random.randint(3, 5))] for _ in range(10)],
}
)
df.shape(10, 1) # output

Now, let’s execute the explode function.

df = df.explode("list_col")
df.shape(40, 1) #output

In the table below, we can observe that the index was duplicated due to unpacking values from the list.

DataFrame is not pandas main building block

Meme created imgflip.com

Pandas is almost synonymous with DataFrames, but you may be surprised if I tell you that it is not its main building block. Pandas DataFrame is build on top of pandas Series. So each column in a DataFrame is a Series. We could go deeper, what is pandas Series based on? Numpy! But let’s don’t go that deep.

I even wrap Python’s list in a pandas Series when benefits outweigh the cost of installing pandas in the project — I am referring to the backend and not analytics projects here.

Let’s calculate a rolling sum of values in a list with window 2.

values = [1, 2, 0, 0, 0, 0, 1, 2, 3]
sr = pd.Series(values)
sr.rolling(2).sum()

The code is much cleaner this way.

Pivot — Data Scientist’s Essential tool

Gif downloaded from giphy

What is a pivot? A pivot table is a table of statistics that summarizes the data of a more extensive table. Sounds useful! Let’s try it.

Let’s say we have a fruit shop and a few customers.

df = pd.DataFrame(
{
"fruit": ["apple", "orange", "apple", "avocado", "orange"],
"customer": ["ben", "alice", "ben", "josh", "steve"],
"quantity": [1, 2, 3, 1, 2],
}
)

Now, let’s pivot the table with fruit and customer columns and aggregate quantity values. We expect a table where fruits are on one axis and customers on the other. Quantity values should be aggregated

df.pivot(index="fruit", columns="customer", values="quantity")

We get a “ValueError: Index contains duplicate entries, cannot reshape”. What does it mean?

Let’s try pivot with a different command (but easier to interpret), which has the same effect as pivot function.

df.set_index(["fruit", "customer"])["quantity"].unstack()

The command returns the same error as the pivot function, but it is more clear what is happening behind the scenes. It seems that fruit and customer columns when combined don’t have a unique index.

Now, let’s try with pivot_table function, which is also the recommended approach. Note, that the function also supports aggfunc, which is np.mean by default. We use np.sum in the example below.

df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum)

Few worthy mentions

Gif downloaded from giphy

Accessing elements

Accessing the first element (and the last element) in a DataFrame is as easy as:

df = pd.DataFrame({"col": [1, 2, 3, 4, 5]})df.iloc[0]df.iloc[-1] # the last element

You might ask what is the difference between loc and iloc? We use loc when accessing rows by index, where index can be string, integer or some other type.

We also use loc to set a new column value on a certain index (we cannot do that with iloc — it would return ValueError):

df.loc[0, "col"] = 1

Appending values

Pandas has an append function that enables you to append the values to the DataFrame. But the trick is that the values appended also need to be in the DataFrame.

df = df.append(pd.DataFrame({'col': [999]}))
df

Rows to columns

My favorite way to turn rows into columns is to use transpose. It also works in numpy.

df.T

Conclusion

These were the tips that I wish someone would tell me when I started using pandas. Sometimes, a friend asks me, how to turn rows into columns or how to change certain values in a DataFrame. It is as simple as described above. Hopefully, these tips save you a minute or two of unnecessary googling.

This post is originally from Towards Data Science.

APPLY NOW