How to Read Multiple Files in Python
Introduction
If you have been part of the information science (or any data!) manufacture, you would know the challenge of working with different data types. Dissimilar formats, dissimilar compression, different parsing on unlike systems – you could be quickly pulling your hair! Oh and I take non talked nearly the unstructured data or semi-structured data yet.
For any information scientist or data engineer, dealing with different formats can get a slow task. In existent-world, people rarely go neat tabular information. Thus, it is mandatory for any data scientist (or a data engineer) to be aware of different file formats, common challenges in handling them and the best / efficient means to handle this data in existent life.
This article provides mutual formats a data scientist or a information engineer must be aware of. I volition first introduce you to unlike common file formats used in the industry. Later, nosotros'll see how to read these file formats in Python.
P.Southward. In residual of this article, I will be referring to a data scientist, just the same applies to a data engineer or any data scientific discipline professional.
Table of Contents
- What is a file format?
- Why should a data scientist understand dissimilar file formats?
- Different file formats and how to read them in Python?
- Comma-separated values
- XLSX
- ZIP
- Plain Text (txt)
- JSON
- XML
- HTML
- Images
- Hierarchical Information Format
- DOCX
- MP3
- MP4
1. What is a file format?
A file format is a standard way in which information is encoded for storage in a file. First, the file format specifies whether the file is a binary or ASCII file. Second, it shows how the information is organized. For example, comma-separated values (CSV) file format stores tabular data in plain text.
To place a file format, you lot can normally look at the file extension to get an idea. For example, a file saved with name "Information" in "CSV" format will announced every bit "Data.csv". By noticing ".csv" extension we can conspicuously identify that it is a "CSV" file and data is stored in a tabular format.
2. Why should a information scientist understand different file formats?
Usually, the files y'all volition come across will depend on the awarding y'all are building. For example, in an image processing system, you lot need image files as input and output. So you will by and large see files in jpeg, gif or png format.
As a data scientist, you need to sympathise the underlying structure of various file formats, their advantages and dis-advantages. Unless you understand the underlying structure of the data, you lot will not be able to explore information technology. Besides, at times you need to make decisions about how to shop data.
Choosing the optimal file format for storing data can amend the performance of your models in information processing.
At present, we will look at the following file formats and how to read them in Python:
- Comma-separated values
- XLSX
- Cypher
- Plain Text (txt)
- JSON
- XML
- HTML
- Images
- Hierarchical Data Format
- DOCX
- MP3
- MP4
3. Different file formats and how to read them in Python
3.1 Comma-separated values
Comma-separated values file format falls under spreadsheet file format.
What is Spreadsheet File Format?
In spreadsheet file format, information is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file tin can have different types. For instance, a column can be of string type, a date type or an integer type. Some of the most popular spreadsheet file formats are Comma Separated Values ( CSV ), Microsoft Excel Spreadsheet ( xls ) and Microsoft Excel Open XML Spreadsheet ( xlsx ).
Each line in CSV file represents an ascertainment or commonly called a tape. Each record may comprise one or more than fields which are separated past a comma.
Sometimes you may come across files where fields are not separated past using a comma but they are separated using tab. This file format is known as TSV (Tab Separated Values) file format.
The below prototype shows a CSV file which is opened in Notepad.
Reading the data from CSV in Python
Let us look at how to read a CSV file in Python. For loading the data you can use the "pandas" library in python.
import pandas every bit pd
df = pd.read_csv("/domicile/Loan_Prediction/train.csv")
Higher up code volition load the railroad train.csv file in DataFrame df.
iii.ii XLSX files
XLSX is a Microsoft Excel Open XML file format. It also comes nether the Spreadsheet file format. Information technology is an XML-based file format created past Microsoft Excel. The XLSX format was introduced with Microsoft Function 2007.
In XLSX information is organized under the cells and columns in a sheet. Each XLSX file may comprise one or more sheets. So a workbook can contain multiple sheets.
The below image shows a "xlsx" file which is opened in Microsoft Excel.
In above paradigm, yous can encounter that there are multiple sheets present (bottom left) in this file, which are Customers, Employees, Invoice, Order. The image shows the data of only ane canvass – "Invoice".
Reading the data from XLSX file
Allow's load the information from XLSX file and define the canvass proper noun. For loading the information y'all can utilise the Pandas library in python.
import pandas as pd
df = pd.read_excel("/habitation/Loan_Prediction/railroad train.xlsx", sheetname = "Invoice")
Above code will load the sheet "Invoice" from "train.xlsx" file in DataFrame df.
3.3 Null files
ZIP format is an annal file format.
What is Annal File format?
In Annal file format, you create a file that contains multiple files forth with metadata. An archive file format is used to collect multiple data files together into a single file. This is done for but compressing the files to use less storage space.
There are many pop computer data archive format for creating archive files. Cypher, RAR and Tar existence the most popular annal file format for compressing the data.
So, a ZIP file format is a lossless compression format, which ways that if yous compress the multiple files using Nothing format you can fully recover the data after decompressing the ZIP file. ZIP file format uses many compression algorithms for compressing the documents. You can easily place a Nix file by the .zip extension.
Reading a .Goose egg file in Python
You can read a cypher file by importing the "zipfile" parcel. Below is the python code which tin read the "railroad train.csv" file that is inside the "T.zero".
import zipfile archive = zipfile.ZipFile('T.zip', 'r') df = archive.read('train.csv') Here, I have discussed i of the famous annal format and how to open up information technology in python. I am not mentioning other annal formats. If you want to read about different archive formats and their comparisons you can refer this link.
three.4 Plain Text (txt) file format
In Manifestly Text file format, everything is written in evidently text. Usually, this text is in unstructured class and there is no meta-data associated with it. The txt file format can easily exist read by any program. But interpreting this is very hard by a calculator program.
Permit's take a simple instance of a text File.
The following case shows text file data that contain text:
"In my previous article, I introduced you lot to the basics of Apache Spark, different data representations (RDD / DataFrame / Dataset) and basics of operations (Transformation and Activity). We even solved a machine learning trouble from one of our by hackathons. In this article, I will continue from the place I left in my previous article. I will focus on manipulating RDD in PySpark by applying operations (Transformation and Actions)."
Suppose the above text written in a file called text.txt and y'all desire to read this so y'all can refer the below code.
text_file = open("text.txt", "r") lines = text_file.read() 3.5 JSON file format
JavaScript Object Annotation(JSON) is a text-based open standard designed for exchanging the data over web. JSON format is used for transmitting structured data over the web. The JSON file format tin can be easily read in any programming language because information technology is linguistic communication-independent data format.
Let's take an instance of a JSON file
The following example shows how a typical JSON file stores data of employees.
{ "Employee": [ { "id":"1", "Name": "Ankit", "Sal": "1000", }, { "id":"2", "Name": "Faizy", "Sal": "2000", } ] } Reading a JSON file
Let'south load the data from JSON file. For loading the data yous tin can employ the pandas library in python.
import pandas every bit pd
df = pd.read_json("/home/kunal/Downloads/Loan_Prediction/train.json")
3.half dozen XML file format
XML is besides known as Extensible Markup Language. As the proper name suggests, it is a markup linguistic communication. It has certain rules for encoding data. XML file format is a human being-readable and automobile-readable file format. XML is a self-descriptive linguistic communication designed for sending information over the internet. XML is very similar to HTML, merely has some differences. For example, XML does not utilise predefined tags as HTML.
Let'south accept the simple example of XML File format.
The following example shows an xml certificate that contains the data of an employee.
<?xml version="1.0"?> <contact-info> <proper noun>Ankit</name> <company>Anlytics Vidhya</visitor> <phone>+9187654321</phone> </contact-info>
The "<?xml version="i.0″?>" is a XML declaration at the start of the file (it is optional). In this deceleration, version specifies the XML version and encoding specifies the character encoding used in the document. <contact-info> is a tag in this document. Each XML-tag needs to exist airtight.
Reading XML in python
For reading the data from XML file yous tin import xml.etree. ElementTree library.
Let'southward import an xml file called train and print its root tag.
import xml.etree.ElementTree as ET tree = ET.parse('/habitation/sunilray/Desktop/two sigma/train.xml') root = tree.getroot() print root.tag 3.7 HTML files
HTML stands for Hyper Text Markup Linguistic communication. It is the standard markup linguistic communication which is used for creating Web pages. HTML is used to draw structure of web pages using markup. HTML tags are same as XML simply these are predefined. You lot tin can hands identify HTML document subsection on basis of tags such as <head> represent the heading of HTML document. <p> "paragraph" paragraph in HTML. HTML is non example sensitive.
The following case shows an HTML document.
<!DOCTYPE html> <html> <head> <title>Page Title</championship> </head> <body><h1>My First Heading</h1> <p>My kickoff paragraph.</p></body> </html>
Each tag in HTML is enclosed under the angular bracket(<>). The <!DOCTYPE html> tag defines that document is in HTML format. <html> is the root tag of this certificate. The <head> element contains heading function of this document. The <title>, <body>, <h1>, <p> represent the title, torso, heading and paragraph respectively in the HTML document.
Reading the HTML file
For reading the HTML file, you tin apply BeautifulSoup library. Please refer to this tutorial, which volition guide you how to parse HTML documents. Beginner'due south guide to Web Scraping in Python (using BeautifulSoup)
3.8 Epitome files
Prototype files are probably the almost fascinating file format used in data science. Whatsoever computer vision application is based on image processing. So it is necessary to know different image file formats.
Usual image files are 3-Dimensional, having RGB values. But, they tin also exist 2-Dimensional (grayscale) or 4-Dimensional (having intensity) – an Image consisting of pixels and meta-data associated with it.
Each image consists 1 or more than frames of pixels. And each frame is made up of two-dimensional array of pixel values. Pixel values tin can exist of whatsoever intensity. Meta-data associated with an image, tin be an image type (.png) or pixel dimensions.
Let'south take the example of an image by loading information technology.
from scipy import misc f = misc.confront() misc.imsave('face.png', f) # uses the Image module (PIL) import matplotlib.pyplot as plt plt.imshow(f) plt.show()
Now, let'due south check the blazon of this paradigm and its shape.
type(f) , f.shape
numpy.ndarray,(768, 1024, 3)
If you want to read about image processing you lot can refer this commodity. This article volition teach you image processing with an instance – Basics of Image Processing in Python
3.ix Hierarchical Data Format (HDF)
In Hierarchical Data Format ( HDF ), you tin store a large corporeality of data easily. Information technology is not but used for storing high volumes or complex information but also used for storing pocket-size volumes or simple data.
The advantages of using HDF are as mentioned below:
- It can be used in every size and blazon of organization
- It has flexible, efficient storage and fast I/O.
- Many formats back up HDF.
In that location are multiple HDF formats present. But, HDF5 is the latest version which is designed to address some of the limitations of the older HDF file formats. HDF5 format has some similarity with XML. Like XML, HDF5 files are self-describing and permit users to specify circuitous data relationships and dependencies.
Let's take the example of an HDF5 file format which tin can be identified using .h5 extension.
Read the HDF5 file
Y'all tin read the HDF file using pandas. Beneath is the python code can load the train.h5 data into the "t".
t = pd.read_hdf('train.h5')
3.10 PDF file format
PDF (Portable Certificate Format) is an incredibly useful format used for interpretation and brandish of text documents along with incorporated graphics. A special feature of a PDF file is that it can be secured by a countersign.
Here's an example of a pdf file.
Reading a PDF file
On the other hand, reading a PDF format through a plan is a complex chore. Although there exists a library which do a good job in parsing PDF file, one of them is PDFMiner. To read a PDF file through PDFMiner, yous take to:
- Download PDFMiner and install it through the website
- Excerpt PDF file past the following lawmaking
pdf2txt.py <pdf_file>.pdf
three.xi DOCX file format
Microsoft word docx file is another file format which is regularly used by organizations for text based information. Information technology has many characteristics, like inline addition of tables, images, hyperlinks, etc. which helps in making docx an incredibly important file format.
The reward of a docx file over a PDF file is that a docx file is editable. You can also change a docx file to any other format.
Here'due south an example of a docx file:
Reading a docx file
Similar to PDF format, python has a customs contributed library to parse a docx file. It is called python-docx2txt.
Installing this library is easy through pip by:
pip install docx2txt
To read a docx file in Python use the following code:
import docx2txt text = docx2txt.procedure("file.docx") iii.12 MP3 file format
MP3 file format comes nether the multimedia file formats. Multimedia file formats are similar to paradigm file formats, but they happen to be ane the most complex file formats.
In multimedia file formats, you tin can store variety of data such as text image, graphical, video and audio data. For instance, A multimedia format can allow text to be stored every bit Rich Text Format (RTF) information rather than ASCII information which is a plain-text format.
MP3 is ane of the most mutual sound coding formats for digital audio. A mp3 file format uses the MPEG-1 (Film Experts Group – 1) encoding format which is a standard for lossy compression of video and audio. In lossy compression, one time you lot have compressed the original file, yous cannot recover the original information.
A mp3 file format compresses the quality of audio by filtering out the audio which can non be heard by humans. MP3 compression usually achieves 75 to 95% reduction in size, so it saves a lot of space.
mp3 File Format Structure
A mp3 file is made up of several frames. A frame tin be further divided into a header and data block. Nosotros call these sequence of frames an elementary stream.
A header in mp3 unremarkably, identify the first of a valid frame and a information blocks contain the (compressed) sound information in terms of frequencies and amplitudes. If you lot desire to know more about mp3 file structure you can refer this link.
Reading the multimedia files in python
For reading or manipulating the multimedia files in Python y'all can use a library called PyMedia.
3.13 MP4 file format
MP4 file format is used to store videos and movies. It contains multiple images (called frames), which play in course of a video every bit per a specific time menstruation. There are two methods for interpreting a mp4 file. One is a closed entity, in which the whole video is considered as a single entity. And other is mosaic of images, where each image in the video is considered equally a different entity and these images are sampled from the video.
Here's is an case of mp4 video
Reading an mp4 file
MP4 also has a community built library for reading and editing mp4 files, called MoviePy.
Y'all tin can install the library from this link. To read a mp4 video clip, in Python use the following code.
from moviepy.editor import VideoFileClip prune = VideoFileClip('<video_file>.mp4') You lot can so display this in jupyter notebook every bit beneath
ipython_display(clip)
Terminate Notes
In this article, I have introduced you to some of the basic file formats, which are used by data scientist on a mean solar day to day footing. At that place are many file formats I take not covered. Good matter is that I don't demand to comprehend all of them in one article.
I promise you found this commodity helpful. I would encourage you to explore more than file formats. Proficient luck! If yous notwithstanding have whatsoever difficulty in understanding a specific information format, I'd similar to interact with y'all in comments. If you lot have any more doubts or queries feel complimentary to drop in your comments beneath.
Learn, compete, hack and become hired
Source: https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/
Belum ada Komentar untuk "How to Read Multiple Files in Python"
Posting Komentar