How to parse xml in python

How to parse xml in python

Parsing XML files in python

How to efficiently extract data from an XML file using simple python code in an easily manipulative form

XML (Extensible Markup Language) is a markup language which is very similar to HTML (Hypertext Markup Language). XML is used to structure data for transport and storage. It uses text and tags to describe the components in a file. XML files are a type of semi-structured data.

While working on a personal project in Python, I realized the need to extract the data from XML files into a suitable formats like CSV. It is easier to work with data present in such formats. Following is a simple way which I used to get the task done.

How to interpret the data in an XML file?

Above is a part of the XML file which we will be using for extracting data. You can download the original file from Kaggle.

1. Importing necessary python libraries

we will be using the xml package from which we import etree.cElementTree as ‘et’ for working simplicity. ‘Element’ is a flexible container object which is used to store hierarchical data structures in the memory.

2. Parsing the XML file

3. Creating lists for record values

4. Converting the data

Now using the root, we iterate through our file and store values in our lists that we created in the previous step.

Using root.iter(‘__’) function by filling in value for each respective field/tag, we can write a separate for loop for each field. I have added a print statement in each for loop to understand the process flow during execution.

We have now parsed our data and stored the values in separate lists. The next step is to convert it into a pandas dataframe to make the data easily usable.

5. Creating a pandas dataframe

Selecting all the lists in desired order and adding column names for the respective columns, we used pd.DataFrame() function to create a dataframe.

After executing the above code, you can view the created dataframe, it will look as follows,

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

6. Saving the extracted data (CSV Format)

We can now save our dataframe as a csv file for easy storage and usage later.

We have successfully parsed the XML file and extracted data for use!

For complete code and data, you can also visit the following Github repository.

Парсинг XML Python

Вы когда-нибудь сталкивались с надоедливым XML-файлом, который вам нужно проанализировать, чтобы получить важные значения? Давайте узнаем, как создать парсер Python XML.How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

Мы рассмотрим, как мы можем анализировать подобные XML-файлы с помощью Python, чтобы получить соответствующие атрибуты и значения.

Метод 1: Использование ElementTree (рекомендуется)

Мы можем использовать библиотеку ElementTree Python для решения этой задачи.

Это самый простой и рекомендуемый вариант для создания синтаксического анализатора Python XML, поскольку эта библиотека по умолчанию входит в состав Python.

Она не только обеспечивает легкий доступ, поскольку уже установлена, но и работает довольно быстро. Давайте посмотрим, как именно мы можем извлечь атрибуты из нашего тестового файла.

Мы будем использовать интерфейс xml.etree.ElementTree внутри основного xml пакета.

Дерево синтаксического анализатора

Давайте сначала построим корневой узел этого дерева синтаксического анализа. Это самый верхний узел, он необходим нам для начала синтаксического анализа.

К счастью для нас, в этом API уже есть следующий метод:

Это автоматически прочитает входной XML-файл и получит для нас корневой узел.

Похоже, он проанализирован. Но мы пока не можем это проверить. Итак, давайте проанализируем другие атрибуты и попробуем получить значение.

Получение значения соответствующих атрибутов

Итак, теперь наша задача — получить значение внутри атрибута с помощью нашего Python XML Parser.

Его позиция от корневого узла

Мы получили все значения на этом уровне нашего дерева синтаксического анализа XML! Мы успешно проанализировали наш XML-файл.

Возьмем другой пример, чтобы все прояснить.

Теперь предположим, что XML-файл выглядит так:

Получить текстовое значение просто. Просто используйте:

Итак, наша полная программа для этого парсера будет:

Вы можете расширить эту логику на любое количество уровней и для файлов XML произвольной длины! Вы также можете записать новое дерево синтаксического анализа в другой файл XML.

Метод 2: использование BeautifulSoup (надежный)

Это также еще один хороший выбор, если по какой-то причине исходный XML плохо отформатирован. XML может работать не очень хорошо, если вы не выполните предварительную обработку файла.

Оказывается, BeautifulSoup очень хорошо работает со всеми этими типами файлов, поэтому, если вы хотите проанализировать любой XML-файл, используйте этот подход.

Чтобы установить его, используйте pip и установите модуль bs4 :

Я дам вам небольшой фрагмент нашего предыдущего XML-файла:

XML parsing in Python

This article focuses on how one can parse a given XML file and extract some useful data out of it in a structured way.

XML: XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable.That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.
The XML file to be parsed in this tutorial is actually a RSS feed.

RSS: RSS(Rich Site Summary, often called Really Simple Syndication) uses a family of standard web feed formats to publish frequently updated informationlike blog entries, news headlines, audio, video. RSS is XML formatted plain text.

Python Module used: This article will focus on using inbuilt xml module in python for parsing XML and the main focus will be on the ElementTree XML API of this module.

Implementation:

Above code will:

Let us try to understand the code in pieces:

Here, we are using xml.etree.ElementTree (call it ET, in short) module. Element Tree has two classes for this purpose – ElementTree represents the whole XML
document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

Ok, so let’s go through the parseXML() function now:

Here, we create an ElementTree object by parsing the passed xmlfile.

getroot() function return the root of tree as an Element object.

Now, once you have taken a look at the structure of your XML file, you will notice that we are interested only in item element.
./channel/item is actually XPath syntax (XPath is a language for addressing parts of an XML document). Here, we want to find all item grand-children of channel children of the root(denoted by ‘.’) element.
You can read more about supported XPath syntax here.

Now, we know that we are iterating through item elements where each item element contains one news. So, we create an empty news dictionary in which we will store all data available about news item. To iterate though each child element of an element, we simply iterate through it, like this:

Now, notice a sample item element here:
How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

We will have to handle namespace tags separately as they get expanded to their original value, when parsed. So, we do something like this:

child.attrib is a dictionary of all the attributes related to an element. Here, we are interested in url attribute of media:content namespace tag.
Now, for all other children, we simply do:

child.tag contains the name of child element. child.text stores all the text inside that child element. So, finally, a sample item element is converted to a dictionary and looks like this:

So now, here is how our formatted data looks like now:
How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

As you can see, the hierarchical XML file data has been converted to a simple CSV file so that all news stories are stored in form of a table. This makes it easier to extend the database too.
Also, one can use the JSON-like data directly in their applications! This is the best alternative for extracting data from websites which do not provide a public API but provide some RSS feeds.

All the code and files used in above article can be found here.

What next?

This article is contributed by Nikhil Kumar. If you like GeeksforGeeks and would like to contribute, you can also write an article and mail your article to review-team@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above

A Roadmap to XML Parsers in Python

Table of Contents

If you’ve ever tried to parse an XML document in Python before, then you know how surprisingly difficult such a task can be. On the one hand, the Zen of Python promises only one obvious way to achieve your goal. At the same time, the standard library follows the batteries included motto by letting you choose from not one but several XML parsers. Luckily, the Python community solved this surplus problem by creating even more XML parsing libraries.

Jokes aside, all XML parsers have their place in a world full of smaller or bigger challenges. It’s worthwhile to familiarize yourself with the available tools.

In this tutorial, you’ll learn how to:

You can use this tutorial as a roadmap to guide you through the confusing world of XML parsers in Python. By the end of it, you’ll be able to pick the right XML parser for a given problem. To get the most out of this tutorial, you should already be familiar with XML and its building blocks, as well as how to work with files in Python.

Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.

Choose the Right XML Parsing Model

It turns out that you can process XML documents using a few language-agnostic strategies. Each demonstrates different memory and speed trade-offs, which can partially justify the wide range of XML parsers available in Python. In the following section, you’ll find out their differences and strengths.

Document Object Model (DOM)

Historically, the first and the most widespread model for parsing XML has been the DOM, or the Document Object Model, originally defined by the World Wide Web Consortium (W3C). You might have already heard about the DOM because web browsers expose a DOM interface through JavaScript to let you manipulate the HTML code of your websites. Both XML and HTML belong to the same family of markup languages, which makes parsing XML with the DOM possible.

The DOM is arguably the most straightforward and versatile model to use. It defines a handful of standard operations for traversing and modifying document elements arranged in a hierarchy of objects. An abstract representation of the entire document tree is stored in memory, giving you random access to the individual elements.

While the DOM tree allows for fast and omnidirectional navigation, building its abstract representation in the first place can be time-consuming. Moreover, the XML gets parsed at once, as a whole, so it has to be reasonably small to fit the available memory. This renders the DOM suitable only for moderately large configuration files rather than multi-gigabyte XML databases.

Use a DOM parser when convenience is more important than processing time and when memory is not an issue. Some typical use cases are when you need to parse a relatively small document or when you only need to do the parsing infrequently.

Simple API for XML (SAX)

To address the shortcomings of the DOM, the Java community came up with a library through a collaborative effort, which then became an alternative model for parsing XML in other languages. There was no formal specification, only organic discussions on a mailing list. The end result was an event-based streaming API that operates sequentially on individual elements rather than the whole tree.

Elements are processed from top to bottom in the same order they appear in the document. The parser triggers user-defined callbacks to handle specific XML nodes as it finds them in the document. This approach is known as “push” parsing because elements are pushed to your functions by the parser.

SAX also lets you discard elements if you’re not interested in them. This means it has a much lower memory footprint than DOM and can deal with arbitrarily large files, which is great for single-pass processing such as indexing, conversion to other formats, and so on.

However, finding or modifying random tree nodes is cumbersome because it usually requires multiple passes on the document and tracking the visited nodes. SAX is also inconvenient for handling deeply nested elements. Finally, the SAX model just allows for read-only parsing.

In short, SAX is cheap in terms of space and time but more difficult to use than DOM in most cases. It works well for parsing very large documents or parsing incoming XML data in real time.

Streaming API for XML (StAX)

Although somewhat less popular in Python, this third approach to parsing XML builds on top of SAX. It extends the idea of streaming but uses a “pull” parsing model instead, which gives you more control. You can think of StAX as an iterator advancing a cursor object through an XML document, where custom handlers call the parser on demand and not the other way around.

Note: It’s possible to combine more than one XML parsing model. For example, you can use SAX or StAX to quickly find an interesting piece of data in the document and then build a DOM representation of only that particular branch in memory.

Using StAX gives you more control over the parsing process and allows for more convenient state management. The events in the stream are only consumed when requested, enabling lazy evaluation. Other than that, its performance should be on par with SAX, depending on the parser implementation.

Learn About XML Parsers in Python’s Standard Library

In this section, you’ll take a look at Python’s built-in XML parsers, which are available to you in nearly every Python distribution. You’re going to compare those parsers against a sample Scalable Vector Graphics (SVG) image, which is an XML-based format. By processing the same document with different parsers, you’ll be able to choose the one that suits you best.

The sample image, which you’re about to save in a local file for reference, depicts a smiley face. It consists of the following XML content:

It starts with an XML declaration, followed by a Document Type Definition (DTD) and the root element. The DTD is optional, but it can help validate your document structure if you decide to use an XML validator. The root element specifies the default namespace xmlns as well as a prefixed namespace xmlns:inkscape for editor-specific elements and attributes. The document also contains:

Go ahead, save the XML in a file named smiley.svg, and open it using a modern web browser, which will run the JavaScript snippet present at the end:

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

The code adds an interactive component to the image. When you hover the mouse over the smiley face, it blinks its eyes. If you want to edit the smiley face using a convenient graphical user interface (GUI), then you can open the file using a vector graphics editor such as Adobe Illustrator or Inkscape.

Note: Unlike JSON or YAML, some features of XML can be exploited by hackers. The standard XML parsers available in the xml package in Python are insecure and vulnerable to an array of attacks. To safely parse XML documents from an untrusted source, prefer secure alternatives. You can jump to the last section in this tutorial for more details.

It’s worth noting that Python’s standard library defines abstract interfaces for parsing XML documents while letting you supply concrete parser implementation. In practice, you rarely do that because Python bundles a binding for the Expat library, which is a widely used open-source XML parser written in C. All of the following Python modules in the standard library use Expat under the hood by default.

Unfortunately, while the Expat parser can tell you if your document is well-formed, it can’t validate the structure of your documents against an XML Schema Definition (XSD) or a Document Type Definition (DTD). For that, you’ll have to use one of the third-party libraries discussed later.

xml.dom.minidom : Minimal DOM Implementation

Considering that parsing XML documents using the DOM is arguably the most straightforward, you won’t be that surprised to find a DOM parser in the Python standard library. What is surprising, though, is that there are actually two DOM parsers.

The xml.dom package houses two modules to work with DOM in Python:

The second module has a slightly misleading name because it defines a streaming pull parser, which can optionally produce a DOM representation of the current node in the document tree. You’ll find more information about the pulldom parser later.

There are two functions in minidom that let you parse XML data from various data sources. One accepts either a filename or a file object, while another one expects a Python string:

The triple-quoted string helps embed a multiline string literal without using the continuation character ( \ ) at the end of each line. In any case, you’ll end up with a Document instance, which exhibits the familiar DOM interface, letting you traverse the tree.

Apart from that, you’ll be able to access the XML declaration, DTD, and the root element:

Definition StyleImplementation
DTD
PythonlinearGradient.setIdAttribute(«id»)

However, using a DTD isn’t enough to fix the problem if your document has a default namespace, which is the case for the sample SVG image. To address this, you can visit all elements recursively in Python, check whether they have the id attribute, and indicate it as their ID in one go:

Now, you’re getting the expected XML element corresponding to the id attribute’s value.

The first argument must be the XML namespace, which typically has the form of a domain name, while the second argument is the tag name. Notice that the namespace prefix is irrelevant! To search all namespaces, you can provide a wildcard character ( * ).

Note: To find the namespaces declared in your XML document, you can check out the root element’s attributes. In theory, they could be declared on any element, but the top-level one is where you’d usually find them.

Once you locate the element you’re interested in, you may use it to walk over the tree. However, another jarring quirk with minidom is how it handles whitespace characters between elements:

The newline characters and leading indentation are captured as separate tree elements, which is what the specification requires. Some parsers let you ignore these, but not the Python one. What you can do, however, is collapse whitespace in such nodes manually:

Elements expose a few helpful methods and properties to let you query their details:

For instance, you can check an element’s namespace, tag name, or attributes. If you ask for a missing attribute, then you’ll get an empty string ( » ).

Dealing with namespaced attributes isn’t much different. You just have to remember to prefix the attribute name accordingly or provide the domain name:

Since this tutorial is only about XML parsing, you’ll need to check the minidom documentation for methods that modify the DOM tree. They mostly follow the W3C specification.

As you can see, the minidom module isn’t terribly convenient. Its main advantage comes from being part of the standard library, which means you don’t have to install any external dependencies in your project to work with the DOM.

xml.sax : The SAX Interface for Python

To start working with SAX in Python, you can use the same parse() and parseString() convenience functions as before, but from the xml.sax package instead. You also have to provide at least one more required argument, which must be a content handler instance. In the spirit of Java, you provide one by subclassing a specific base class:

The content handler receives a stream of events corresponding to elements in your document as it’s being parsed. Running this code won’t do anything useful yet because your handler class is empty. To make it work, you’ll need to overload one or more callback methods from the superclass.

Fire up your favorite editor, type the following code, and save it in a file named svg_handler.py :

This modified content handler prints out a few events onto the standard output. The SAX parser will call these three methods for you in response to finding the start tag, end tag, and some text between them. When you open an interactive session of the Python interpreter, import your content handler and give it a test drive. It should produce the following output:

That’s essentially the observer design pattern, which lets you translate XML into another hierarchical format incrementally. Say you wanted to convert that SVG file into a simplified JSON representation. First, you’ll want to store your content handler object in a separate variable to extract information from it later:

The SAX parser gives you attributes as a mapping that you can convert to a plain Python dictionary with a call to the dict() function. The element value is often spread over multiple pieces that you can concatenate using the plus operator ( + ) or a corresponding augmented assignment statement:

Aggregating text in such a way will ensure that multiline content ends up in the current element. For example, the

How to parse XML and get instances of a particular node attribute?

I have many rows in XML and I’m trying to get instances of a particular node attribute.

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

18 Answers 18

Trending sort

Trending sort is based off of the default sorting method — by highest score — but it boosts votes that have happened recently, helping to surface more up-to-date answers.

It falls back to sorting by highest score if no posts are trending.

Switch to Trending sort

First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

minidom is the quickest and pretty straight forward.

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

The relevant metrics can be found in the table below, copied from the cElementTree website:

As pointed out by @jfs, cElementTree comes bundled with Python:

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

I suggest xmltodict for simplicity.

It parses your XML to an OrderedDict;

Taking your sample text:

Python has an interface to the expat XML parser.

It’s a non-validating parser, so bad XML will not be caught. But if you know your file is correct, then this is pretty good, and you’ll probably get the exact info you want and you can discard the rest on the fly.

How to parse xml in python. Смотреть фото How to parse xml in python. Смотреть картинку How to parse xml in python. Картинка про How to parse xml in python. Фото How to parse xml in python

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

Your XML file (a little bit changed):

Accessing the attributes with untangle :

The output will be:

More information about untangle can be found in «untangle».

Also, if you are curious, you can find a list of tools for working with XML and Python in «Python and XML». You will also see that the most common ones were mentioned by previous answers.

I might suggest declxml.

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

Parsing into Python data structures is straightforward:

Which produces the output:

You can also use the same processor to serialize data to XML

Which produces the following output

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

Источники информации:

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *