How to create dataframe in pandas
How to create dataframe in pandas
How to Create Pandas DataFrame in Python
In this short guide, you’ll see two different methods to create Pandas DataFrame:
Method 1: typing values in Python to create Pandas DataFrame
To create Pandas DataFrame in Python, you can follow this generic template:
Note that you don’t need to use quotes around numeric values (unless you wish to capture those values as strings).
Now let’s see how to apply the above template using a simple example.
To start, let’s say that you have the following data about products, and that you want to capture that data in Python using Pandas DataFrame:
product_name | price |
laptop | 1200 |
printer | 150 |
tablet | 300 |
desk | 450 |
chair | 200 |
You may then use the code below in order to create the DataFrame for our example:
Run the code in Python, and you’ll get the following DataFrame:
You may have noticed that each row is represented by a number (also known as the index) starting from 0. Alternatively, you may assign another value/name to represent each row.
For example, in the code below, the index=[‘product_1′,’product_2′,’product_3′,’product_4′,’product_5’] was added:
You’ll now see the newly assigned index (as highlighted in yellow):
Let’s now review the second method of importing the values into Python to create the DataFrame.
Method 2: importing values from a CSV file to create Pandas DataFrame
You may use the following template to import a CSV file into Python in order to create your DataFrame:
Let’s say that you have the following data stored in a CSV file (where the CSV file name is ‘products’):
product_name | price |
laptop | 1200 |
printer | 150 |
tablet | 300 |
desk | 450 |
chair | 200 |
In the Python code below, you’ll need to change the path name to reflect the location where the CSV file is stored on your computer.
For example, let’s suppose that the CSV file is stored under the following path:
‘C:\Users\Ron\Desktop\products.csv’
Here is the full Python code for our example:
As before, you’ll get the same Pandas DataFrame in Python:
You can also create the same DataFrame by importing an Excel file into Python using Pandas.
Find the maximum value in the DataFrame
Once you have your values in the DataFrame, you can perform a large variety of operations. For example, you may calculate stats using Pandas.
For instance, let’s say that you want to find the maximum price among all the products within the DataFrame.
Obviously, you can derive this value just by looking at the dataset, but the method presented below would work for much larger datasets.
To get the maximum price for our example, you’ll need to add the following portion to the Python code (and then print the results):
Here is the complete Python code:
Once you run the code, you’ll get the value of 1200, which is indeed the maximum price:
You may check the Pandas Documentation to learn more about creating a DataFrame.
How to Create Pandas DataFrames: A Hands-On Guide
The first and most important steps in any data science project are creating and loading data into Pandas DataFrames. Pandas is an essential part of the Python data science ecosystem. Data scientists merge, manipulate, and analyze tabular data with Pandas DataFrames. Finally, they prepare data for machine learning.
Real-world data typically comes unclean, unorganized, and unstructured. The main challenges at the beginning of any project are to:
Last Updated July 2022
Pandas fully explained | 150+ Exercises | Must-have skills for Machine Learning & Finance | + Scikit-Learn and Seaborn | By Alexander Hagmann
In this article, I am going to explain how to load data from the most common data sources into Pandas DataFrames. You will learn how to combine steps 3 and 4: cleaning and shaping up the data as much as possible when loading it into the row and column format of a DataFrame. This saves a lot of time and extra workflows. Put simply, the smoother the data import, the more efficient the entire project.
Data sources – an overview
We can categorize data sources into four groups: Python (and its basic data structures), local files, the internet, and databases. Some argue that there is another source: data stored in other Pandas DataFrames and Pandas Series.
Understanding Pandas DataFrames
Let’s have a look at the following small DataFrame df that contains information on five football players:
A DataFrame is a 2-dimensional labeled data structure. In our example, df has five rows and five columns. Each row is a football player (for example, “Lionel Messi”). Each column contains information on the players (for example, height in meters). The column “Name” on the left side isn’t a column. It’s the index of the DataFrame. The index labels the rows. In our example, the rows are labeled by the players´ names. If not specified, DataFrames have a RangeIndex with ascending integers (0, 1, 2, …,). At the top, we can find the column headers (for example, “Country”). It’s best practice to have unique row labels and unique column headers. This allows you to identify rows and columns clearly.
Follow three rules when creating DataFrames:
You can check the data types with the info() method.
The dtype object either indicates string/text data or mixed data types. In our example, we have homogeneous string/text data in the columns “Country” and “Club_2019”. Therefore, df meets all three conditions.
How to create DataFrames with basic data structures in Python
As a first step, import the Pandas Library with import pandas as pd whenever you work with Pandas.
In case you already have the data in basic Python structures, you can create a Pandas DataFrame object with pd.DataFrame(). The next steps you follow depend on how the data is organized. There are two major scenarios: 1. you already have the columns in lists/arrays. 2. you already have the rows in lists/arrays.
Let’s assume you have the columns (excl. headers) already stored in the lists: country, club, wc, height, goals. In addition, you have the row labels in the list: names.
In this scenario, it’s best to create a dictionary with all columns. Each key-value pair in the dictionary consists of an appropriate column header (for example, Club_2019) as key and the respective list as value (club). Let’s create the dictionary data:
We are ready to create the DataFrame object df with pd.DataFrame(). Pass the dictionary data to the parameter data and define that names should be the index of the DataFrame with index = names.
Finally, you can assign a name for the index with df.index.name =. In our example, we assign “Name”.
Let’s have a final look at df:
Let’s assume you have the selected rows (incl. row labels) already stored in the lists: messi, ronaldo, neymar, mbappe, neuer.
In addition, you have the desired column headers in the list “headers.”
In this scenario, it’s best to create a list of lists. Let’s put all rows into the nested list “data”:
We are ready to create the DataFrame object df with pd.DataFrame(). Pass the nested list “data” to the parameter data and define that “headers” should be the column headers of the DataFrame with columns = headers.
In this scenario, we end up with six columns (incl. the names) and a RangeIndex. You can set the index to the column Name with the method set_index(). To change the DataFrame object df with the new index, set inplace = True. Otherwise, the change of the index is not saved in memory.
We are finally there:
There is one scenario left: What if we start with a dictionary data that has the wrong data organization: Each key-value pair is a row/observation?
If you pass the dictionary data to pd.DataFrame(), you’ll end up with a DataFrame where observations are in columns and features are in rows. This isn’t a DataFrame you can work with! You can fix that problem with a few Pandas commands. But there is a better way that allows you to avoid that issue completely. It’s best to reorganize the dictionary data and create a nested list.
We are back in scenario 2:
How to load datasets from local files into Pandas DataFrames
You can load datasets from local files on your computer into Pandas with the pd.read_xxx() family:
pd.read_csv() and pd.read_excel() are very similar and share most of the options and parameters.
5 Things you should know when loading data from CSV and Excel files
The first and most important thing you need to know when loading data from local files: the location of the file. Pass the full file path/name as a string to the parameter filepath_or_buffer. The following is a template to create the DataFrame object df from CSV and Excel files:
Note that you can omit ‘filepath_or_buffer =’.
Let’s assume that the CSV file players.csv is located on my desktop. When opening the file, we can see the following structure:
A CSV file is a delimited text file that uses a comma to separate values. You can still see the tabular data structure. Each line of the file is a data record (a football player). Each record comprises one or more values, separated by commas.
The full file name on Windows could be C:\Users\alex\desktop\players.csv
The full file name on macOS and Linux could be: /Users/alex/desktop/players.csv
Please note that Windows uses the backslash (“\”) instead of the slash (“/”). Since backslash is a special character in Python, using the following code will drop an error:
There are two ways how to fix this issue:
On macOS and Linux the single best solution is:
In case the file players.csv is in your current working directory (CWD), it is sufficient to pass the filename players.csv without the full path. Be aware that the CWD can vary and depends on your system and your Python Installation.
Loading the players dataset from the Excel file players.xlsx works accordingly.
You can select a column as the index of the DataFrame. The column of your choice should contain unique values only (no duplicates). In our example, setting the Name column as the index is reasonable and can be done with index_col = “Name”.
Instead of passing the column header, you can also pass the column index position. In our example, Name is at column index position 0.
If you do not specify an index, Pandas creates a RangeIndex.
(Loading the players dataset from the Excel file players.xlsx works the same way.)
There is no need to load all columns into Pandas. You can select specific columns by passing a list with the column headers to the parameter usecols. As an example, you can load the columns Name, Country, and Goals_2019 with usecols = [“Name”, “Country”, “Goals_2019”]
This creates the DataFrame df with a RangeIndex. Of course, you can combine usecols and index_col:
Loading the players dataset from the Excel file players.xlsx works accordingly. But there is one more option. Instead of passing a list with column headers, you can also specify Excel columns (A, B, C,…) in a string: usecols = “A, B, D”.
This loads the Excel columns A, B and D into Pandas.
Sometimes, there are no column headers in the external file. This dataset starts with the first observation (Lionel Messi). Let’s consider the CSV files players.csv without column headers:
With header = None you specify that there are no column headers in the file. header = None is typically used in combination with the parameter names. You can pass a list of appropriate column headers to names:
In case the file contains column headers that are not appropriate, you can change those headers with name (don’t use header = None here!).
Some Datasets have columns that contain date and time information (‘datetime’). The following CSV file stocks.csv contains daily stock prices for Microsoft (MSFT) and Apple (AAPL):
If not specified, Pandas loads datetime information as string/object data type. Most times, it is desirable to convert the data type of those columns into datetime64 by passing the column header(s) in a list to the parameter parse_dates. This is often used in combination with index_col to create a DatetimeIndex. Managing and analyzing financial data with Pandas is easy with a DatetimeIndex.
There are more options to customize the data import with pd.read_csv() and pd.read_excel(). Learn more on how to import data from messy and unclean CSV and Excel files.
How to load data from JSON files
The following is a template to create the DataFrame object df from the JSON file players.json with pd.read_json():
There are very few options available when loading data from JSON files. JSON files are used to store and transfer complex and nested datasets. Sometimes, you must use the parameter orient or to flatten the data with pd.json_normalize() (learn more).
How to load datasets from the internet into Pandas DataFrames
Platforms like Twitter, Yahoo Finance, or The Movies Database allow users to retrieve data via their web APIs. The API documentation contains detailed instructions about using the web API. Users need to send HTTP requests (as defined in the API documentation) to the web server and receive the data in CSV or JSON file format. The requests library is the standard for making HTTP requests in Python. Finally, the data can be loaded into Pandas (see 2 examples).
In simple cases, you can directly load CSV files from the web into Pandas with pd.read_csv() by passing the URL as a string to filepath_or_buffer =.
With pd.read_html() you can read all tables from a website by passing the URL to io =.
Note that pd.read_html() returns a list of DataFrames.
How to load datasets from SQL Databases into Pandas DataFrames
You can read tables from SQL Databases like SQLite, MySQL, PostgreSQL, and more with pd.read_sql().
The sql parameter requires a SQL query in SQL language. Before you can pull data from the database, you must create a connection to the database and pass the connection object to the parameter con.
Depending on the database system you choose, you also have to install and import Python libraries like sqlite3 or SQLAlchemy to create the connection (see an example with SQLite).
How to create new Pandas DataFrames from other DataFrames and Pandas Series
Many workflows create new DataFrames from existing DataFrames and Series: filtering, aggregation, manipulation, merging, joining, concatenating, and more. Let me show two examples:
Let´s filter the players DataFrame df and create a new DataFrame tall with only those players that are taller than 1.75 meters:
To avoid any problems when working with tall and df, chain the copy() method. This creates and saves a new DataFrame object in memory that is independent of the original DataFrame.
When selecting one column of a DataFrame (for example, “Goals_2019”), Pandas creates a Pandas Series. Let’s create the Series “goals”:
A Pandas Series is a one-dimensional labeled array. DataFrame objects and Series objects behave similarly and share many methods. But they are not identical. Sometimes, it’s beneficial to convert a Series into a DataFrame with one column by using the method to_frame().
Conclusion
You can create Pandas DataFrames in many ways. The first and most important question you have to answer is: Where is the data coming from? Once you know the data source, you can select the appropriate tool to load the data into Pandas. The following table gives an overview:
pandas.DataFrameВ¶
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
Parameters data ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index.
Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns Index or array-like
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
dtype dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy bool or None, default None
Changed in version 1.3.0.
Constructor from tuples, also record arrays.
From dicts of Series, arrays, or dicts.
Read a comma-separated values (csv) file into DataFrame.
Read general delimited file into DataFrame.
Read text from clipboard into DataFrame.
Constructing DataFrame from a dictionary.
Notice that the inferred dtype is int64.
To enforce a single dtype:
Constructing DataFrame from a dictionary including Series:
Constructing DataFrame from numpy ndarray:
Constructing DataFrame from a numpy ndarray that has labeled columns:
Constructing DataFrame from dataclass:
Access a single value for a row/column label pair.
Dictionary of global attributes of this dataset.
Return a list representing the axes of the DataFrame.
The column labels of the DataFrame.
Return the dtypes in the DataFrame.
Indicator whether Series/DataFrame is empty.
Get the properties associated with this pandas object.
Access a single value for a row/column pair by integer position.
Purely integer-location based indexing for selection by position.
The index (row labels) of the DataFrame.
Access a group of rows and columns by label(s) or a boolean array.
Return an int representing the number of axes / array dimensions.
Return a tuple representing the dimensionality of the DataFrame.
Return an int representing the number of elements in this object.
Returns a Styler object.
Return a Numpy representation of the DataFrame.
T
Return a Series/DataFrame with absolute numeric value of each element.
add (other[,В axis,В level,В fill_value])
Get Addition of dataframe and other, element-wise (binary operator add ).
Aggregate using one or more operations over the specified axis.
Aggregate using one or more operations over the specified axis.
Align two objects on their axes with the specified join method.
all ([axis,В bool_only,В skipna,В level])
Return whether all elements are True, potentially over an axis.
any ([axis,В bool_only,В skipna,В level])
Return whether any element is True, potentially over an axis.
(DEPRECATED) Append rows of other to the end of caller, returning a new object.
apply (func[,В axis,В raw,В result_type,В args])
Apply a function along an axis of the DataFrame.
Apply a function to a Dataframe elementwise.
Convert time series to specified frequency.
Assign new columns to a DataFrame.
astype (dtype[,В copy,В errors])
Select values at particular time of day (e.g., 9:30AM).
backfill ([axis,В inplace,В limit,В downcast])
Select values between particular times of the day (e.g., 9:00-9:30 AM).
bfill ([axis,В inplace,В limit,В downcast])
Return the bool of a single element Series or DataFrame.
Make a box plot from DataFrame columns.
clip ([lower,В upper,В axis,В inplace])
Trim values at input threshold(s).
combine (other,В func[,В fill_value,В overwrite])
Perform column-wise combine with another DataFrame.
Compare to another DataFrame and show the differences.
Make a copy of this object’s indices and data.
Compute pairwise correlation of columns, excluding NA/null values.
corrwith (other[,В axis,В drop,В method])
Compute pairwise correlation.
count ([axis,В level,В numeric_only])
Count non-NA cells for each column or row.
Compute pairwise covariance of columns, excluding NA/null values.
Return cumulative maximum over a DataFrame or Series axis.
Return cumulative minimum over a DataFrame or Series axis.
Return cumulative product over a DataFrame or Series axis.
Return cumulative sum over a DataFrame or Series axis.
Generate descriptive statistics.
First discrete difference of element.
div (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator truediv ).
divide (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator truediv ).
Compute the matrix multiplication between the DataFrame and other.
Drop specified labels from rows or columns.
Return DataFrame with duplicate rows removed.
Return Series/DataFrame with requested index / column level(s) removed.
dropna ([axis,В how,В thresh,В subset,В inplace])
Remove missing values.
Return boolean Series denoting duplicate rows.
eq (other[,В axis,В level])
Get Equal to of dataframe and other, element-wise (binary operator eq ).
Test whether two objects contain the same elements.
Evaluate a string describing operations on DataFrame columns.
Provide exponentially weighted (EW) calculations.
expanding ([min_periods,В center,В axis,В method])
Provide expanding window calculations.
Transform each element of a list-like to a row, replicating index values.
ffill ([axis,В inplace,В limit,В downcast])
Fill NA/NaN values using the specified method.
filter ([items,В like,В regex,В axis])
Subset the dataframe rows or columns according to the specified index labels.
Select initial periods of time series data based on a date offset.
Return index for first non-NA value or None, if no non-NA value is found.
floordiv (other[,В axis,В level,В fill_value])
Get Integer division of dataframe and other, element-wise (binary operator floordiv ).
from_dict (data[,В orient,В dtype,В columns])
Construct DataFrame from dict of array-like or dicts.
Convert structured or record ndarray to DataFrame.
ge (other[,В axis,В level])
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge ).
Get item from object for given key (ex: DataFrame column).
Group DataFrame using a mapper or by a Series of columns.
gt (other[,В axis,В level])
Get Greater than of dataframe and other, element-wise (binary operator gt ).
Return the first n rows.
Make a histogram of the DataFrame’s columns.
Return index of first occurrence of maximum over requested axis.
Return index of first occurrence of minimum over requested axis.
Attempt to infer better dtypes for object columns.
Print a concise summary of a DataFrame.
insert (loc,В column,В value[,В allow_duplicates])
Insert column into DataFrame at specified location.
Fill NaN values using an interpolation method.
Whether each element in the DataFrame is contained in values.
Detect missing values.
DataFrame.isnull is an alias for DataFrame.isna.
Iterate over (column name, Series) pairs.
Iterate over (column name, Series) pairs.
Iterate over DataFrame rows as (index, Series) pairs.
Iterate over DataFrame rows as namedtuples.
join (other[,В on,В how,В lsuffix,В rsuffix,В sort])
Join columns of another DataFrame.
Get the ‘info axis’ (see Indexing for more).
kurt ([axis,В skipna,В level,В numeric_only])
Return unbiased kurtosis over requested axis.
kurtosis ([axis,В skipna,В level,В numeric_only])
Return unbiased kurtosis over requested axis.
Select final periods of time series data based on a date offset.
Return index for last non-NA value or None, if no non-NA value is found.
le (other[,В axis,В level])
Get Less than or equal to of dataframe and other, element-wise (binary operator le ).
(DEPRECATED) Label-based «fancy indexing» function for DataFrame.
lt (other[,В axis,В level])
Get Less than of dataframe and other, element-wise (binary operator lt ).
mad ([axis,В skipna,В level])
Return the mean absolute deviation of the values over the requested axis.
Replace values where the condition is True.
max ([axis,В skipna,В level,В numeric_only])
Return the maximum of the values over the requested axis.
mean ([axis,В skipna,В level,В numeric_only])
Return the mean of the values over the requested axis.
median ([axis,В skipna,В level,В numeric_only])
Return the median of the values over the requested axis.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
Return the memory usage of each column in bytes.
Merge DataFrame or named Series objects with a database-style join.
min ([axis,В skipna,В level,В numeric_only])
Return the minimum of the values over the requested axis.
mod (other[,В axis,В level,В fill_value])
Get Modulo of dataframe and other, element-wise (binary operator mod ).
mode ([axis,В numeric_only,В dropna])
Get the mode(s) of each element along the selected axis.
mul (other[,В axis,В level,В fill_value])
Get Multiplication of dataframe and other, element-wise (binary operator mul ).
multiply (other[,В axis,В level,В fill_value])
Get Multiplication of dataframe and other, element-wise (binary operator mul ).
ne (other[,В axis,В level])
Get Not equal to of dataframe and other, element-wise (binary operator ne ).
Return the first n rows ordered by columns in descending order.
Detect existing (non-missing) values.
DataFrame.notnull is an alias for DataFrame.notna.
Return the first n rows ordered by columns in ascending order.
Count number of distinct elements in specified axis.
pad ([axis,В inplace,В limit,В downcast])
pct_change ([periods,В fill_method,В limit,В freq])
Percentage change between the current and a prior element.
pipe (func,В *args,В **kwargs)
Apply chainable functions that expect Series or DataFrames.
pivot ([index,В columns,В values])
Return reshaped DataFrame organized by given index / column values.
Create a spreadsheet-style pivot table as a DataFrame.
alias of pandas.plotting._core.PlotAccessor
Return item and drop from frame.
pow (other[,В axis,В level,В fill_value])
Get Exponential power of dataframe and other, element-wise (binary operator pow ).
Return the product of the values over the requested axis.
Return the product of the values over the requested axis.
quantile ([q,В axis,В numeric_only,В interpolation])
Return values at the given quantile over requested axis.
Query the columns of a DataFrame with a boolean expression.
radd (other[,В axis,В level,В fill_value])
Get Addition of dataframe and other, element-wise (binary operator radd ).
Compute numerical data ranks (1 through n) along axis.
rdiv (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator rtruediv ).
Conform Series/DataFrame to new index with optional filling logic.
Return an object with matching indices as other object.
Alter axes labels.
Set the name of the axis for the index or columns.
Rearrange index levels using input order.
Resample time-series data.
Reset the index, or a level of it.
rfloordiv (other[,В axis,В level,В fill_value])
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv ).
rmod (other[,В axis,В level,В fill_value])
Get Modulo of dataframe and other, element-wise (binary operator rmod ).
rmul (other[,В axis,В level,В fill_value])
Get Multiplication of dataframe and other, element-wise (binary operator rmul ).
Provide rolling window calculations.
Round a DataFrame to a variable number of decimal places.
rpow (other[,В axis,В level,В fill_value])
Get Exponential power of dataframe and other, element-wise (binary operator rpow ).
rsub (other[,В axis,В level,В fill_value])
Get Subtraction of dataframe and other, element-wise (binary operator rsub ).
rtruediv (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator rtruediv ).
Return a random sample of items from an axis of object.
Return a subset of the DataFrame’s columns based on the column dtypes.
sem ([axis,В skipna,В level,В ddof,В numeric_only])
Return unbiased standard error of the mean over requested axis.
set_axis (labels[,В axis,В inplace])
Assign desired index to given axis.
set_flags (*[,В copy,В allows_duplicate_labels])
Return a new object with updated flags.
Set the DataFrame index using existing columns.
shift ([periods,В freq,В axis,В fill_value])
skew ([axis,В skipna,В level,В numeric_only])
Return unbiased skew over requested axis.
(DEPRECATED) Equivalent to shift without copying data.
Sort object by labels (along an axis).
Sort by the values along either axis.
alias of pandas.core.arrays.sparse.accessor.SparseFrameAccessor
Squeeze 1 dimensional axis objects into scalars.
Stack the prescribed level(s) from columns to index.
std ([axis,В skipna,В level,В ddof,В numeric_only])
Return sample standard deviation over requested axis.
sub (other[,В axis,В level,В fill_value])
Get Subtraction of dataframe and other, element-wise (binary operator sub ).
subtract (other[,В axis,В level,В fill_value])
Get Subtraction of dataframe and other, element-wise (binary operator sub ).
Return the sum of the values over the requested axis.
Interchange axes and swap values axes appropriately.
Return the last n rows.
take (indices[,В axis,В is_copy])
Return the elements in the given positional indices along an axis.
Copy object to the system clipboard.
Write object to a comma-separated values (csv) file.
Convert the DataFrame to a dictionary.
Write object to an Excel sheet.
Write a DataFrame to the binary Feather format.
Write a DataFrame to a Google BigQuery table.
Write the contained data to an HDF5 file using HDFStore.
Render a DataFrame as an HTML table.
Convert the object to a JSON string.
Render object to a LaTeX tabular, longtable, or nested table.
to_markdown ([buf,В mode,В index,В storage_options])
Print DataFrame in Markdown-friendly format.
to_numpy ([dtype,В copy,В na_value])
Convert the DataFrame to a NumPy array.
Write a DataFrame to the binary parquet format.
Convert DataFrame from DatetimeIndex to PeriodIndex.
Pickle (serialize) object to file.
to_records ([index,В column_dtypes,В index_dtypes])
Convert DataFrame to a NumPy record array.
Write records stored in a DataFrame to a SQL database.
Export DataFrame object to Stata dta format.
Render a DataFrame to a console-friendly tabular output.
Cast to DatetimeIndex of timestamps, at beginning of period.
Return an xarray object from the pandas object.
Render a DataFrame to an XML document.
Call func on self producing a DataFrame with the same axis shape as self.
Transpose index and columns.
truediv (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator truediv ).
truncate ([before,В after,В axis,В copy])
Truncate a Series or DataFrame before and after some index value.
tshift ([periods,В freq,В axis])
(DEPRECATED) Shift the time index, using the index’s frequency if available.
tz_convert (tz[,В axis,В level,В copy])
Convert tz-aware axis to target time zone.
Localize tz-naive index of a Series or DataFrame to target time zone.
Pivot a level of the (necessarily hierarchical) index labels.
Modify in place using non-NA values from another DataFrame.
Return a Series containing counts of unique rows in the DataFrame.
var ([axis,В skipna,В level,В ddof,В numeric_only])
Return unbiased variance over requested axis.
Replace values where the condition is False.
xs (key[,В axis,В level,В drop_level])
Return cross-section from the Series/DataFrame.
DataFrameВ¶
ConstructorВ¶
DataFrame ([data,В index,В columns,В dtype,В copy])
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Attributes and underlying dataВ¶
Axes
The index (row labels) of the DataFrame.
The column labels of the DataFrame.
Return the dtypes in the DataFrame.
Print a concise summary of a DataFrame.
Return a subset of the DataFrame’s columns based on the column dtypes.
Return a Numpy representation of the DataFrame.
Return a list representing the axes of the DataFrame.
Return an int representing the number of axes / array dimensions.
Return an int representing the number of elements in this object.
Return a tuple representing the dimensionality of the DataFrame.
Return the memory usage of each column in bytes.
Indicator whether Series/DataFrame is empty.
Return a new object with updated flags.
ConversionВ¶
Attempt to infer better dtypes for object columns.
Make a copy of this object’s indices and data.
Return the bool of a single element Series or DataFrame.
Indexing, iterationВ¶
Return the first n rows.
Access a single value for a row/column label pair.
Access a single value for a row/column pair by integer position.
Access a group of rows and columns by label(s) or a boolean array.
Purely integer-location based indexing for selection by position.
Insert column into DataFrame at specified location.
Iterate over info axis.
Iterate over (column name, Series) pairs.
Iterate over (column name, Series) pairs.
Get the ‘info axis’ (see Indexing for more).
Iterate over DataFrame rows as (index, Series) pairs.
Iterate over DataFrame rows as namedtuples.
(DEPRECATED) Label-based «fancy indexing» function for DataFrame.
Return item and drop from frame.
Return the last n rows.
DataFrame.xs (key[,В axis,В level,В drop_level])
Return cross-section from the Series/DataFrame.
Get item from object for given key (ex: DataFrame column).
Whether each element in the DataFrame is contained in values.
Replace values where the condition is False.
Replace values where the condition is True.
Query the columns of a DataFrame with a boolean expression.
Binary operator functionsВ¶
DataFrame.add (other[,В axis,В level,В fill_value])
Get Addition of dataframe and other, element-wise (binary operator add ).
DataFrame.sub (other[,В axis,В level,В fill_value])
Get Subtraction of dataframe and other, element-wise (binary operator sub ).
DataFrame.mul (other[,В axis,В level,В fill_value])
Get Multiplication of dataframe and other, element-wise (binary operator mul ).
DataFrame.div (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator truediv ).
Get Floating division of dataframe and other, element-wise (binary operator truediv ).
Get Integer division of dataframe and other, element-wise (binary operator floordiv ).
DataFrame.mod (other[,В axis,В level,В fill_value])
Get Modulo of dataframe and other, element-wise (binary operator mod ).
DataFrame.pow (other[,В axis,В level,В fill_value])
Get Exponential power of dataframe and other, element-wise (binary operator pow ).
Compute the matrix multiplication between the DataFrame and other.
DataFrame.radd (other[,В axis,В level,В fill_value])
Get Addition of dataframe and other, element-wise (binary operator radd ).
DataFrame.rsub (other[,В axis,В level,В fill_value])
Get Subtraction of dataframe and other, element-wise (binary operator rsub ).
DataFrame.rmul (other[,В axis,В level,В fill_value])
Get Multiplication of dataframe and other, element-wise (binary operator rmul ).
DataFrame.rdiv (other[,В axis,В level,В fill_value])
Get Floating division of dataframe and other, element-wise (binary operator rtruediv ).
Get Floating division of dataframe and other, element-wise (binary operator rtruediv ).
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv ).
DataFrame.rmod (other[,В axis,В level,В fill_value])
Get Modulo of dataframe and other, element-wise (binary operator rmod ).
DataFrame.rpow (other[,В axis,В level,В fill_value])
Get Exponential power of dataframe and other, element-wise (binary operator rpow ).
Get Less than of dataframe and other, element-wise (binary operator lt ).
Get Greater than of dataframe and other, element-wise (binary operator gt ).
Get Less than or equal to of dataframe and other, element-wise (binary operator le ).
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge ).
Get Not equal to of dataframe and other, element-wise (binary operator ne ).
Get Equal to of dataframe and other, element-wise (binary operator eq ).
Perform column-wise combine with another DataFrame.
Function application, GroupBy & windowВ¶
Apply a function along an axis of the DataFrame.
Apply a function to a Dataframe elementwise.
Apply chainable functions that expect Series or DataFrames.
Aggregate using one or more operations over the specified axis.
Aggregate using one or more operations over the specified axis.
Call func on self producing a DataFrame with the same axis shape as self.
Group DataFrame using a mapper or by a Series of columns.
Provide rolling window calculations.
Provide expanding window calculations.
Provide exponentially weighted (EW) calculations.
Computations / descriptive statsВ¶
Return a Series/DataFrame with absolute numeric value of each element.
DataFrame.all ([axis,В bool_only,В skipna,В level])
Return whether all elements are True, potentially over an axis.
DataFrame.any ([axis,В bool_only,В skipna,В level])
Return whether any element is True, potentially over an axis.
DataFrame.clip ([lower,В upper,В axis,В inplace])
Trim values at input threshold(s).
Compute pairwise correlation of columns, excluding NA/null values.
Compute pairwise correlation.
Count non-NA cells for each column or row.
Compute pairwise covariance of columns, excluding NA/null values.
Return cumulative maximum over a DataFrame or Series axis.
Return cumulative minimum over a DataFrame or Series axis.
Return cumulative product over a DataFrame or Series axis.
Return cumulative sum over a DataFrame or Series axis.
Generate descriptive statistics.
First discrete difference of element.
Evaluate a string describing operations on DataFrame columns.
Return unbiased kurtosis over requested axis.
Return unbiased kurtosis over requested axis.
Return the mean absolute deviation of the values over the requested axis.
Return the maximum of the values over the requested axis.
Return the mean of the values over the requested axis.
Return the median of the values over the requested axis.
Return the minimum of the values over the requested axis.
DataFrame.mode ([axis,В numeric_only,В dropna])
Get the mode(s) of each element along the selected axis.
Percentage change between the current and a prior element.
Return the product of the values over the requested axis.
Return the product of the values over the requested axis.
Return values at the given quantile over requested axis.
Compute numerical data ranks (1 through n) along axis.
Round a DataFrame to a variable number of decimal places.
Return unbiased standard error of the mean over requested axis.
Return unbiased skew over requested axis.
Return the sum of the values over the requested axis.
Return sample standard deviation over requested axis.
Return unbiased variance over requested axis.
Count number of distinct elements in specified axis.
Return a Series containing counts of unique rows in the DataFrame.
Reindexing / selection / label manipulationВ¶
Align two objects on their axes with the specified join method.
Select values at particular time of day (e.g., 9:30AM).
Select values between particular times of the day (e.g., 9:00-9:30 AM).
Drop specified labels from rows or columns.
Return DataFrame with duplicate rows removed.
Return boolean Series denoting duplicate rows.
Test whether two objects contain the same elements.
Subset the dataframe rows or columns according to the specified index labels.
Select initial periods of time series data based on a date offset.
Return the first n rows.
Return index of first occurrence of maximum over requested axis.
Return index of first occurrence of minimum over requested axis.
Select final periods of time series data based on a date offset.
Conform Series/DataFrame to new index with optional filling logic.
Return an object with matching indices as other object.
Alter axes labels.
Set the name of the axis for the index or columns.
Reset the index, or a level of it.
Return a random sample of items from an axis of object.
Assign desired index to given axis.
Set the DataFrame index using existing columns.
Return the last n rows.
Return the elements in the given positional indices along an axis.
Truncate a Series or DataFrame before and after some index value.
Missing data handlingВ¶
DataFrame.bfill ([axis,В inplace,В limit,В downcast])
Remove missing values.
DataFrame.ffill ([axis,В inplace,В limit,В downcast])
Fill NA/NaN values using the specified method.
Fill NaN values using an interpolation method.
Detect missing values.
DataFrame.isnull is an alias for DataFrame.isna.
Detect existing (non-missing) values.
DataFrame.notnull is an alias for DataFrame.notna.
DataFrame.pad ([axis,В inplace,В limit,В downcast])
Reshaping, sorting, transposingВ¶
Return Series/DataFrame with requested index / column level(s) removed.
Return reshaped DataFrame organized by given index / column values.
Create a spreadsheet-style pivot table as a DataFrame.
Rearrange index levels using input order.
Sort by the values along either axis.
Sort object by labels (along an axis).
Return the first n rows ordered by columns in descending order.
Return the first n rows ordered by columns in ascending order.
Stack the prescribed level(s) from columns to index.
Pivot a level of the (necessarily hierarchical) index labels.
Interchange axes and swap values axes appropriately.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
Transform each element of a list-like to a row, replicating index values.
Squeeze 1 dimensional axis objects into scalars.
Return an xarray object from the pandas object.
Transpose index and columns.
Combining / comparing / joining / mergingВ¶
(DEPRECATED) Append rows of other to the end of caller, returning a new object.
Assign new columns to a DataFrame.
Compare to another DataFrame and show the differences.
Join columns of another DataFrame.
Merge DataFrame or named Series objects with a database-style join.
Modify in place using non-NA values from another DataFrame.
Time Series-relatedВ¶
Convert time series to specified frequency.
(DEPRECATED) Equivalent to shift without copying data.
(DEPRECATED) Shift the time index, using the index’s frequency if available.
Return index for first non-NA value or None, if no non-NA value is found.
Return index for last non-NA value or None, if no non-NA value is found.
Resample time-series data.
Convert DataFrame from DatetimeIndex to PeriodIndex.
Cast to DatetimeIndex of timestamps, at beginning of period.
Convert tz-aware axis to target time zone.
Localize tz-naive index of a Series or DataFrame to target time zone.
FlagsВ¶
Flags (obj,В *,В allows_duplicate_labels)
Flags that apply to pandas objects.
MetadataВ¶
DataFrame.attrs is a dictionary for storing global metadata for this DataFrame.
DataFrame.attrs is considered experimental and may change without warning.
Dictionary of global attributes of this dataset.
PlottingВ¶
DataFrame plotting accessor and method
Draw a stacked area plot.
Vertical bar plot.
Make a horizontal bar plot.
Make a box plot of the DataFrame columns.
Generate Kernel Density Estimate plot using Gaussian kernels.
Generate a hexagonal binning plot.
Draw one histogram of the DataFrame’s columns.
Generate Kernel Density Estimate plot using Gaussian kernels.
Plot Series or DataFrame as lines.
Generate a pie plot.
Create a scatter plot with varying marker point size and color.
Make a box plot from DataFrame columns.
Make a histogram of the DataFrame’s columns.
Sparse accessorВ¶
Sparse-dtype specific methods and attributes are provided under the DataFrame.sparse accessor.
Ratio of non-sparse points to total (dense) data points.
Create a new DataFrame from a scipy sparse matrix.
Return the contents of the frame as a sparse SciPy COO matrix.
Convert a DataFrame with sparse values to dense.
Serialization / IO / conversionВ¶
Construct DataFrame from dict of array-like or dicts.
Convert structured or record ndarray to DataFrame.
Write a DataFrame to the binary parquet format.
Pickle (serialize) object to file.
Write object to a comma-separated values (csv) file.
Write the contained data to an HDF5 file using HDFStore.
Write records stored in a DataFrame to a SQL database.
Convert the DataFrame to a dictionary.
Write object to an Excel sheet.
Convert the object to a JSON string.
Render a DataFrame as an HTML table.
Write a DataFrame to the binary Feather format.
Render object to a LaTeX tabular, longtable, or nested table.
Export DataFrame object to Stata dta format.
Write a DataFrame to a Google BigQuery table.
Convert DataFrame to a NumPy record array.
Render a DataFrame to a console-friendly tabular output.
Copy object to the system clipboard.
Print DataFrame in Markdown-friendly format.
15 ways to create a Pandas DataFrame
A learner’s reference for different ways of creating a DataFrame with Pandas
Motivation
While doing EDA (exploratory data analysis) or developing / testing models, it is very common to use the powerful yet elegant pandas DataFrame for storing and manipulating data. And usually, it starts with “creating a dataframe”.
I usually encounter the following scenarios while starting some EDA or modeling with pandas:
I need to quickly create a dataframe of a few records to test a code.
I need to load a csv or json file into a dataframe.
I need to read an HTML table into a dataframe from a web page
I need to load json-like records into a dataframe without creating a json file
I need to load csv-like records into a dataframe without creating a csv file
I need to merge two dataframes, vertically or horizontally
I have to transform a column of a dataframe into one-hot columns
Each of these scenarios made me google the syntax or lookup the documentation every single time, until I slowly memorized them with practice of months and years.
Understanding the pain it took to lookup, I thought a quick lookup sheet for the multiple ways to create a dataframe in pandas may save some time. This may help learners until they become seasoned data analysts or data scientists.
So here are a few ways we can create a dataframe. If anyone reading this finds other elegant ways or methods, please feel free to comment or message me; I would love to add them in this page with your reference.
Using DataFrame constructor pd.DataFrame()
The pandas DataFrame() constructor offers many different ways to create and initialize a dataframe.
Using pandas library functions — read_csv, read_json
From other dataframes
Note: The two methods shown above are different — the copy() function creates a totally new dataframe object independent of the original one while the variable copy method just creates an alias variable for the original dataframe — no new dataframe object is created. If there is any change to the original dataframe, it is also reflected in the alias as shown below:
In the above example, the index of the 2nd dataframe is preserved in the concatenated dataframe. To reset the indexes to match with the entire dataframe, use the reset_index() function of the dataframe
NOTE: For horizontal concatenation,
One-Hot is basically a conversion of a column value into a set of derived columns like Binary Representation Any one of the one-hot column set is 1 and rest is 0.
If we know that a car has body types = SEDAN, SUV, VAN, TRUCK, then a Toyota corolla with body = ‘SEDAN’ will become one-hot encoded to
Each one hot column is basically of the format _
Below is an example:
I hope this “cheat-sheet” helps in the initial phases of learning EDA or modeling. For sure, with time and constant practice, all these will be memorized.
All the best then 🙂
Do share your valuable inputs if you have any other elegant ways of dataframe creation or if there is any new function that can create a dataframe for some specific purpose.