How to read pdf python

How to read pdf python

Working with PDF files in Python

All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic.
In this article, we will learn, how we can do various operations like:

using simple python scripts!
Installation
We will be using a third-party module, PyPDF2.
PyPDF2 is a python library built as a PDF toolkit. It is capable of:

To install PyPDF2, run the following command from the command line:

This module name is case-sensitive, so make sure the y is lowercase and everything else is uppercase. All the code and PDF files used in this tutorial/article are available here.
1. Extracting text from PDF file

Python

The output of the above program looks like this:

Let us try to understand the above code in chunks:

Note: While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. It isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.

2. Rotating PDF pages

How to Work With a PDF in Python

Table of Contents

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: How to Work With a PDF in Python

The Portable Document Format, or PDF, is a file format that can be used to present and exchange documents reliably across operating systems. While the PDF was originally invented by Adobe, it is now an open standard that is maintained by the International Organization for Standardization (ISO). You can work with a preexisting PDF in Python by using the PyPDF2 package.

PyPDF2 is a pure-Python package that you can use for many different types of PDF operations.

By the end of this article, you’ll know how to do the following:

Let’s get started!

Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

pdfrw : An Alternative

The biggest difference when it comes to pdfrw is that it integrates with the ReportLab package so that you can take a preexisting PDF and build a new one with ReportLab using some or all of the preexisting PDF.

Installation

Installing PyPDF2 can be done with pip or conda if you happen to be using Anaconda instead of regular Python.

Here’s how you would install PyPDF2 with pip :

The install is quite quick as PyPDF2 does not have any dependencies. You will likely spend as much time downloading the package as you will installing it.

Now let’s move on and learn how to extract some information from a PDF.

How to Extract Document Information From a PDF in Python

You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you’re doing certain types of automation on your preexisting PDF files.

Here are the current types of data that can be extracted:

Let’s write some code using that PDF and learn how you can get access to these attributes:

Note: That last code block uses Python 3’s new f-strings for string formatting. If you’d like to learn more, you can check out Python 3’s f-Strings: An Improved String Formatting Syntax (Guide).

The information variable has several instance attributes that you can use to get the rest of the metadata you want from the document. You print out that information and also return it for potential future use.

Now you’re ready to learn about rotating PDF pages.

How to Rotate Pages

Occasionally, you will receive PDFs that contain pages that are in landscape mode instead of portrait mode. Or perhaps they are even upside down. This can happen when someone scans a document to PDF or email. You could print the document out and read the paper version or you can use the power of Python to rotate the offending pages.

For this example, you can go and pick out a Real Python article and print it to PDF.

Let’s learn how to rotate a few of the pages of that article with PyPDF2 :

Note: The PyPDF2 package only allows you to rotate a page in increments of 90 degrees. You will receive an AssertionError otherwise.

Now let’s learn how you can merge multiple PDFs into one.

How to Merge PDFs

There are many situations where you will want to take two or more PDFs and merge them together into a single PDF. For example, you might have a standard cover page that needs to go on to many types of reports. You can use Python to help you do that sort of thing.

For this example, you can open up a PDF and print a page out as a separate PDF. Then do that again, but with a different page. That will give you a couple of inputs to use for example purposes.

Let’s go ahead and write some code that you can use to merge PDFs together:

You can use merge_pdfs() when you have a list of PDFs that you want to merge together. You will also need to know where to save the result, so this function takes a list of input paths and an output path.

Once you’re finished iterating over all of the pages of all of the PDFs in your list, you will write out the result at the end.

One item I would like to point out is that you could enhance this script a bit by adding in a range of pages to be added if you didn’t want to merge all the pages of each PDF. If you’d like a challenge, you could also create a command line interface for this function using Python’s argparse module.

Let’s find out how to do the opposite of merging!

How to Split PDFs

There are times where you might have a PDF that you need to split up into multiple PDFs. This is especially true of PDFs that contain a lot of scanned-in content, but there are a plethora of good reasons for wanting to split a PDF.

Here’s how you can use PyPDF2 to split your PDF into multiple files:

In this example, you once again create a PDF reader object and loop over its pages. For each page in the PDF, you will create a new PDF writer instance and add a single page to it. Then you will write that page out to a uniquely named file. When the script is finished running, you should have each page of the original PDF split into separate PDFs.

Now let’s take a moment to learn how you can add a watermark to your PDF.

How to Add Watermarks

Watermarks are identifying images or patterns on printed and digital documents. Some watermarks can only be seen in special lighting conditions. The reason watermarking is important is that it allows you to protect your intellectual property, such as your images or PDFs. Another term for watermark is overlay.

You can use Python and PyPDF2 to watermark your documents. You need to have a PDF that only contains your watermark image or text.

Let’s learn how to add a watermark now:

create_watermark() accepts three arguments:

In the code, you open up the watermark PDF and grab just the first page from the document as that is where your watermark should reside. Then you create a PDF reader object using the input_pdf and a generic pdf_writer object for writing out the watermarked PDF.

Finally, you write the newly watermarked PDF out to disk, and you’re done!

The last topic you will learn about is how PyPDF2 handles encryption.

How to Encrypt a PDF

PyPDF2 currently only supports adding a user password and an owner password to a preexisting PDF. In PDF land, an owner password will basically give you administrator privileges over the PDF and allow you to set permissions on the document. On the other hand, the user password just allows you to open the document.

As far as I can tell, PyPDF2 doesn’t actually allow you to set any permissions on the document even though it does allow you to set the owner password.

Regardless, this is how you can add a password, which will also inherently encrypt the PDF:

add_encryption() takes in the input and output PDF paths as well as the password that you want to add to the PDF. It then opens a PDF writer and a reader object, as before. Since you will want to encrypt the entire input PDF, you will need to loop over all of its pages and add them to the writer.

Note: PDF encryption uses either RC4 or AES (Advanced Encryption Standard) to encrypt the PDF according to pdflib.com.

Just because you have encrypted your PDF does not mean it is necessarily secure. There are tools to remove passwords from PDFs. If you’d like to learn more, Carnegie Mellon University has an interesting paper on the topic.

Conclusion

The PyPDF2 package is quite useful and is usually pretty fast. You can use PyPDF2 to automate large jobs and leverage its capabilities to help you do your job better!

In this tutorial, you learned how to do the following:

Further Reading

If you’d like to learn more about working with PDFs in Python, you should check out some of the following resources for more information:

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: How to Work With a PDF in Python

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

About Mike Driscoll

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Mike has been programming in Python for over a decade and loves writing about Python!

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Master Real-World Python Skills With Unlimited Access to Real Python

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Master Real-World Python Skills
With Unlimited Access to Real Python

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal. Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!

How to Process Text from PDF Files in Python?

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

PDFs are a common way to share text. PDF stands for Portable Document Format and uses the .pdf file extension. It was created in the early 1990s by Adobe Systems.

Reading PDF documents using python can help you automate a wide variety of tasks.

In this tutorial we will learn how to extract text from a PDF file in Python.

Let’s get started.

Reading and Extracting Text from a PDF File in Python

For the purpose of this tutorial we are creating a sample PDF with 2 pages. You can do so using any Word processor like Microsoft Word or Google Docs and save the file as a PDF.

Using PyPDF2 to Extract PDF Text

You can use PyPDF2 to extract text from a PDF. Let’s see how it works.

1. Install the package

To install PyPDF2 on your system enter the following command on your terminal. You can read more about the pip package manager.

2. Import PyPDF2

Open a new python notebook and start with importing PyPDF2.

3. Open the PDF in read-binary mode

Start with opening the PDF in read binary mode using the following line of code:

This will create a PdfFileReader object for our PDF and store it to the variable ‘pdf’.

4. Use PyPDF2.PdfFileReader() to read text

Now you can use the PdfFileReader() method from PyPDF2 to read the file.

To get the text from the first page of the PDF, use the following lines of code:

We get the output as:

Here we used the getPage method to store the page as an object. Then we used extractText() method to get text from the page object.

The text we get is of type String.

Similarly to get the second page from the PDF use:

We get the output as :

Complete Code to Read PDF Text using PyPDF2

The complete code from this section is given below:

If you notice, the formatting of the first page is a little off in the output above. This is because PyPDF2 is not very efficient at reading PDFs.

Luckily, Python has a better alternative to PyPDF2. We are going to look at that next.

Using PDFplumber to Extract Text

PDFplumber is another tool that can extract text from a PDF. It is more powerful as compared to PyPDF2.

1. Install the package

Let’s get started with installing PDFplumber.

2. Import pdfplumber

Start with importing PDFplumber using the following line of code :

3. Using PDFplumber to read pdfs

You can start reading PDFs using PDFplumber with the following piece of code:

This will get the text from first page of our PDF. The output comes as:

You can compare this with the output of PyPDF2 and see how PDFplumber is better when it comes to formatting.

PDFplumber also provides options to get other information from the PDF.

For example, you can use .page_number to get the page number.

To learn more about the methods under PDFPlumber refer to its official documentation.

Conclusion

This tutorial was about reading text from PDFs. We looked at two different tools and saw how one is better than the other.

Now that you know how to read text from a PDF, you should read our tutorial on tokenization to get started with Natural Language Processing!

Can Python Read PDF Files?

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Python is a great tool for task automation, it makes working with text files and data sheets really easy. But can you use Python to read PDF files?

There are plenty of great Python libraries that can be used to parse pdf files, for example: PDFMiner, PyPDF2, tabula-py, slate, PDFQuery, xpdf_python, pdflib and PyMuPDF

In this brief tutorial I’ll show you how to install and use each of these libraries to read pdfs.

1. Reading PDF File Contents With PDFMiner

PDFMiner is a library for pdf to text and text to pdf conversion. It can be used as an importable module in your Python scripts, but it also comes with a CLI interface, so you can invoke pdfminer directly from the command line as well.

Attention: The original pdfminer package is deprecated, as the repo has been abandoned by the original author. Make sure to install its community fork, pdfminer.six instead!

If you want to use it in your Python script you can simply do:

2. Extracting Text With PyPDF2

PyPDF2 is feature-rich Python library that makes manipulating PDF files easier. It can extract metadata, text and images, and can also modify PDF files by cropping, merging and splitting PDFs.

You can install it by running:

To read text from PDF files you can use the PdfFileReader class, like so:

This little snippet gets the number of pages from the metadata, then iterates through all the pages, and extracts the text content from each page one-by-one.

3. Importing Tabular Data Into Pandas With Tabula-py

Tabula-py is a bit more specific tool: it is specialized on reading tables from PDF files. It returns the data as a pandas DataFrame, but you can also export it into TSV or CSV format.

Installation is simple with pip:

Using it is pretty straightforward as well:

df will be a pandas DataFrame containing all the data that tabula-py manages to find in tabular format inside the input file.

4. Slate

Slate is a wrapper around PDFMiner. It provides roughly the same feature set, but with a much cleaner, pythonic interface.

contents will be a list of strings, where each element

5. Scraping And Querying PDF Files With PDFQuery

If you need to do some more sophisticated manipulation of PDF data besides just dumping all the contents of the file as raw text, your best bet would be PDFQuery. It allows you to traverse the document tree, just like you would the with an xml or html document.

PDFQuery supports both XPath and JQuery syntax for querying.

pdf variable will now contain a traversable and searchable representation of the PDF document. Contents of this document can be exported in arbitrary, user-defined format.

You can also search the contents of the document, for example:

6. Xpdf_python

xpdf_python is a wrapper for xpdf. It can export pdf files to text format.

As always installation is easy with pip:

To get the contents of a pdf file as a string:

7. Pdflib

Pdflib provides Python binding for the Poppler pdf library. Pdflib can be installed by running:

Parsing pdf files is pretty easy using pdflib:

The above snippet will gather all the text in the pdf in the content variable line-by-line.

8. PyMuPDF

PyMuPDF provides Python bindings for MuPDF, a lightweight PDF/e-book viewer.

Reading a PDF file into variable:

content will be a list of pages, containing the content of each page as a string element.

Summary

That was the 8 most popular Python libraries that can be used to read pdf data. So which one should you pick?

However, if you need nothing fancy, just dump the contents of the file, any of the others will do, but I’d probably go with pdflib or PyMuPDF`. They are actively maintained, fast, robust, easy to install, and provide a clean interface to work with.

How to read PDF files with Python

How to read pdf python. Смотреть фото How to read pdf python. Смотреть картинку How to read pdf python. Картинка про How to read pdf python. Фото How to read pdf python

Background

In a previous article, we talked about how to scrape tables from PDF files with Python. In this post, we’ll cover how to extract text from several types of PDFs. To read PDF files with Python, we can focus most of our attention on two packages – pdfminer and pytesseract.

pdfminer (specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you’re handling PDFs that are typed and you’re able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseract package comes in handy, which we’ll see later in the post.

Scraping hightlightable text

For the first example, let’s scrape a 10-k form from Apple (see here). First, we’ll just download this file to a local directory and save it as “apple_10k.pdf”. The first package we’ll be using to extract text is pdfminer. To download the version of the package we need, you can use pip (note we’re downloading pdfminer.six):

Next, let’s import the extract_text method from pdfminer.high_level. This module within pdfminer provides higher-level functions for scraping text from PDF files. The extract_text function, as can be seen below, shows that we can extract text from a PDF with one line code (minus the package import)! This is an advantage of pdfminer versus some other packages like PyPDF2.

The code above will extract the text from each page in the PDF. If we want to limit our extraction to specific pages, we just need to pass that specification to extract_text using the page_numbers parameter.

Scraping a password-protected PDF

If the PDF we want to scrape is password-protected, we just need to pass the password as a parameter to the same method as above.

Scraping text from scanned-in images

If a PDF contains scanned-in images of text, then it’s still possible to be scrapped, but requires a few additional steps. In this case, we’re going to be using two other Python packages – pytesseract and Wand. The second of these is used to convert PDFs into image files, while pytesseract is used to extract text from images. Since pytesseract doesn’t work directly on PDFs, we have to first convert our sample PDF into an image (or collection of image files).

Initial setup

Let’s get started by setting up the Wand package. Wand can be installed using pip:

This package also requires a tool called ImageMagick to be installed (see here for more details).

There are other options for packages that convert PDFs into images files. For example, pdf2image is another choice, but we’ll use Wand in this tutorial.

Additionally, let’s go ahead and install pytesseract. This package can also be installed using pip:

pytesseract depends upon tesseract being installed (see here for instructions). tesseract is an underlying utility that performs OCR (Optical Character Recognition) on images to extract text.

Converting PDFs into image files

Now, once our setup is complete, we can convert a PDF into a collection of image files. The way we do this is by converting each individual page into an image file. In addition to using Wand, we’ll also going to import the os package to help create the name of each image output file.

For this example, we’re going to take a scanned-in version of the first three pages of the 10k form from earlier in this post.

In the with statement above, we open a connection to the PDF file. The resolution parameter specifies the DPI we want for the image outputs – in this case 500. Within the for loop, we specify the output filename, save the image using Image.save, and lastly append the filename to the list of image files. This way, we can loop over the list of image files, and scrape the text from each.

This should create three separate image files:

Using pytesseract on each image file

Next, we can use pytesseract to extract the text from each image file. In the code below, we store the extracted text from each page as a separate element in a list.

Alternatively, we can use a list comprehension like below:

That’s all for now. If you enjoyed this post, please follow my blog on Twitter!

Источники информации:

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *