Document Processing¶
HTML¶
Boilerpipe¶
Boilerplate Removal and Fulltext Extraction from HTML pages
Goose3¶
The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.
Newspaper3k¶
Article scraping & curation
Python-Readability¶
Given a html document, it pulls out the main body text and cleans it up.
Office¶
File Extensions:
Type |
MS Office |
Open Document |
---|---|---|
Text |
.doc/.docx |
.odt |
Text Template |
.dot |
.ott |
Master Document |
.doc/.docx |
.odm |
Spreadsheet |
.xls/.xlsx |
.odt |
Spreadsheet Template |
.xst |
.ots |
Drawing |
.odg |
|
Drawing Template |
.otg |
|
Presentation |
.ppt/.pptx |
.odp |
Presentation Template |
.pot |
.otp |
Formula |
.odf |
|
Chart |
.odc |
|
Database |
.mdb |
.odb |
Spreadsheets¶
openpyxl:
openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.
xlrd:
xlrd is a library for reading data and formatting information from Excel files in the historical .xls format.
PDF¶
OCRmyPDF¶
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.
PDFMiner¶
It is a tool for extracting information from PDF documents.
PyPDF4¶
PyPDF4 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.