How to iterate through pandas dataframe

How to iterate through pandas dataframe

How to Iterate over rows in Pandas Dataframe – Definitive Guide

Pandas DataFrame is a two-dimensional data structure used to store the data in the tabular format. It is similar to spreadsheets or a database table.

You can iterate over the pandas dataframe using the df.itertuples() method.

This tutorial teaches you how to iterate over rows using different methods.

If You’re in Hurry…

You can use the below code to iterate over rows in pandas dataframe.

This is one of the fastest methods available to iterate over rows in the pandas dataframe.

Example

You’ll see the below output.

Each row in the dataframe will be iterated over and printed using the print statement.

Output

If You Want to Understand Details, Read on…

In this tutorial, you’ll learn the various methods available to iterate over rows in the Pandas Dataframe.

Table of Contents

Sample DataFrame

DataFrame Visualization

Now let’s discuss the various methods available to iterate over the rows in the pandas dataframe.

Using Itertuples() Function

In this section, you’ll learn how to iterate over rows in Pandas dataframe using the Itertuples() method.

Itertuples() method iterates over the dataframe rows and returns a named tuple.

It accepts two parameters.

This is the fastest method to iterate over rows in Pandas dataframe.

Example

You’ve not passed the index parameter or the name parameter. Hence, the default value for both the parameter is used.

In each tuple displayed below, it’s named Pandas and it contains the index element for each row.

Output

Using Iterrows() Function

In this section, you’ll learn how to use iterrows() function to iterate through rows in a pandas dataframe.

iterrows() method iterates over dataframe as (index, series) pairs.

Example

Output

This is how you can use the iterrows() method to iterate through the pandas dataframe and access the index and series of data in the dataframe.

Next, you’ll see the index attribute of the Dataframe.

Using Index Attribute

In this section, you’ll learn how to iterate over rows in the dataframe using the index attribute.

Index attribute is an immutable sequence used of indexing elements in a dataframe.

You can use this attribute and iterate over rows by using the index values and each column name.

Dataframe is two-dimensional.

Example

In the example, you’re iterating the dataframe index, and accessing the rows with this index, and specifying the column name in the first dimension to access the row of these columns.

Output

This is how you can iterate through the pandas dataframe using the index attribute.

Next, you’ll see about the loc function.

Using LOC[] Function

In this section, you’ll learn how to use the LOC[] attribute to iterate over the dataframe.

LOC[] attribute is primarily label based and you can access the particular labels(Column name) from the specified index.

Example

In the below example, you’ll access the dataframe using the loc attribute and the column name lang and Difficulty during each iteration to access the values of these columns.

Output

This is how the loc[] attribute is used to iterate through the dataframe.

Next, you’ll see how to use the iLOC[] function.

Using iLOC[] Fuction Of DataFrame

In this section, you’ll learn how to use the iLOC[] attribute to iterate over the dataframe.

iLOC[] attribute is primarily integer-based and you can access the particular index by specifying the integer.

Example

In the below example, you’ll access the dataframe using the iloc attribute.

During each row iteration,

Output

This is how you can iterate over the dataframe using the iLOC[] attribute.

Next, you’ll see the iteritems() function to iterate over the dataframe.

Using Iteritems() Function

In this section, you’ll use the Iteritems() function to iterate over the dataframe.

iteritems() function iterates over the dataframe columns and returns a tuple with column name and content as a series.

i teritems() is deprecated and will be removed in the future pandas version. You can use the items() method instead.

Example

Output

This is how you can use the iteritems() method.

Using Items() Function

In this section, you’ll use the items() method in the dataframe to iterate over the rows.

items() method iterate over the dataframe and returns a tuple with the column name and content as a series of data.

Example

Output

pandas iterate over rows by column name

In this subsection, you’ll use the iteritems() to iterate over the dataframe and use the columnName and columnData fields to access the column data.

Example

Output

This is also known as Pandas Iterate Over Columns.

pandas iterate over rows with condition

In this subsection, you’ll use the iteritems() to iterate over the dataframe and use an if condition to check if the current column is a specific column and access the column data if the condition is true. Else, the column will be skipped.

Example

Output

Conclusion

To summarize, you’ve learned how to iterate over rows in Pandas dataframe using the different methods available in the Dataframe.

Among all the methods available, itertuples() is the fastest method available to iterate over the pandas dataframe.

If you have any questions, feel free to comment below.

Here’s the most efficient way to iterate through your Pandas Dataframe

Achieve 280x times faster data frame iteration

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

Pandas is one of the popular Python libraries among the data science community, as it offers vast API with flexible data structures for data explorations and visualization. Pandas is the most preferred library for cleaning, transforming, manipulating, and analyzing data.

The presence of vast API makes Pandas easy to use, but when it comes to handling and process large-size datasets, it fails to scale the computations across all the CPU cores. Dask, Vaex are open-sourced libraries that scale the computations to speed up the workflow.

Feature engineering and feature explorations require iterating through the data frame. There are various methods to iterate through the data frame, iterrows() being one of them. The computation time to iterate through the data frame using iterrows() is slower.

Sometimes it’s a tedious task to shift from Pandas to other scalable libraries just to speed up the iteration process. In this article, we will discuss various data frame iteration techniques and benchmarking their time numbers.

Iterrows():

Iterrows() is a Pandas inbuilt function to iterate through your data frame. It should be completely avoided as its performance is very slow compared to other iteration techniques. Iterrows() makes multiple function calls while iterating and each row of the iteration has properties of a data frame, which makes it slower.

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

iterrows() takes 790 seconds to iterate through a data frame with 10 million records.

There are various techniques (discussed below) that perform quite better than iterrows().

Itertuples():

Itertuples() is a Pandas inbuilt function to iterate through your data frame. Itertuples() make a comparatively less number of function calls than iterrows() and carry much lesser overhead. Itertuples() iterates through the data frame by converting each row of data as a list of tuples.

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

itertuples() takes 16 seconds to iterate through a data frame with 10 million records that are around 50x times faster than iterrows().

Read the below-mentioned article to get an in-depth understanding of why iterrows() is slower compared to itertuples()

Iterate Through Rows of a DataFrame in Pandas

We will use the below dataframe as an example in the following sections.

Please enable JavaScript

index Attribute to Iterate Through Rows in Pandas DataFrame

Pandas DataFrame index attribute gives a range object from the top row to the bottom row of a DataFrame. We can use the range to iterate over rows in Pandas.

It adds Income_1 and Income_2 of each row and prints total income.

loc[] Method to Iterate Through Rows of DataFrame in Python

The loc[] method is used to access one row at a time. When we use the loc[] method inside the loop through DataFrame, we can iterate through rows of DataFrame.

Here, range(len(df)) generates a range object to loop over entire rows in the DataFrame.

iloc[] Method to Iterate Through Rows of DataFrame in Python

Pandas DataFrame iloc attribute is also very similar to loc attribute. The only difference between loc and iloc is that in loc we have to specify the name of row or column to be accessed while in iloc we specify the index of the row or column to be accessed.

pandas.DataFrame.iterrows() to Iterate Over Rows Pandas

pandas.DataFrame.itertuples to Iterate Over Rows Pandas

pandas.DataFrame.itertuples returns an object to iterate over tuples for each row with the first field as an index and remaining fields as column values. Hence, we could also use this function to iterate over rows in Pandas DataFrame.

pandas.DataFrame.apply to Iterate Over Rows Pandas

pandas.DataFrame.apply returns a DataFrame as a result of applying the given function along the given axis of the DataFrame.

Where, func represents the function to be applied and axis represents the axis along which the function is applied. We can use axis=1 or axis = ‘columns’ to apply function to each row.

Here, lambda keyword is used to define an inline function that is applied to each row.

Iterate pandas dataframe

DataFrame Looping (iteration) with a for statement. You can loop over a pandas dataframe, for each column row by row.

Below pandas. Using a DataFrame as an example.

This outputs this dataframe:

Loop over columns

If you stick the DataFrame directly into a for loop, the column names (column names) are retrieved in order as follows:

Iterate dataframe

.iteritems()

You can use the iteritems() method to use the column name (column name) and the column data (pandas. Series) tuple (column name, Series) can be obtained.

.iterrows()

You can use the iterrows() method to use the index name (row name) and the data (pandas. Series) tuple (index, Series) can be obtained.

This results in:

.itertuples()

You can use the itertuples() method to retrieve a column of index names (row names) and data for that row, one row at a time. The first element of the tuple is the index name.

By default, it returns namedtuple namedtuple named Pandas. Namedtuple allows you to access the value of each element in addition to [].

This outputs the following:

Retrieve column values

It’s possible to get the values of a specific column in order.

The iterrows(), itertuples() method described above can retrieve elements for all columns in each row, but can also be written as follows if you only need elements for a particular column:

When you apply a Series to a for loop, you can get its value in order. If you specify a column in the DataFrame and apply it to a for loop, you can get the value of that column in order.

It is also possible to obtain the values of multiple columns together using the built-in function zip().

If you want to get the index (line name), use the index attribute.

What is the most efficient way to loop through dataframes with pandas?

I want to perform my own complex operations on financial data in dataframes in a sequential manner.

For example I am using the following MSFT CSV file taken from Yahoo Finance:

I then do the following:

Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.

12 Answers 12

Trending sort

Trending sort is based off of the default sorting method — by highest score — but it boosts votes that have happened recently, helping to surface more up-to-date answers.

It falls back to sorting by highest score if no posts are trending.

Switch to Trending sort

The newest versions of pandas now include a built-in function for iterating over rows.

Or, if you want it faster use itertuples()

But, unutbu’s suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.

For example, if close is a 1-d array, and you want the day-over-day percent change,

This computes the entire array of percent changes as one statement, instead of

So try to avoid the Python loop for i, row in enumerate(. ) entirely, and think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.

Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.

This is probably not the best way to measure the time consumption but it’s quick for me.

Here are some pros and cons IMHO:

EDIT 2020/11/10

For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)

You can loop through the rows by transposing and then calling iteritems:

I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:

I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is— if it’s not fast enough, convert things to Cython like this with minimal work to get something that’s about as fast as hand-coded C/C++.

You have three options:

With iterrows (most used):

Three options display something like:

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

I checked out iterrows after noticing Nick Crawford’s answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1. ) tuples.

Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

Not pythonic? Sure. But fast.

If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.

look at last one

How to iterate through pandas dataframe. Смотреть фото How to iterate through pandas dataframe. Смотреть картинку How to iterate through pandas dataframe. Картинка про How to iterate through pandas dataframe. Фото How to iterate through pandas dataframe

I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.

For a test case, let’s use the example from @DSM’s answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.

Let’s set up the 4 approaches with a small DataFrame, and we’ll time them on a larger dataset below.

And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter’s %timeit function, collapsed to a summary table for readability):

Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.

Источники информации:

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *