Document Processing¶

HTML¶

Boilerpipe¶

Boilerplate Removal and Fulltext Extraction from HTML pages

GitHub

Goose3¶

The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Docs
GitHub

Newspaper3k¶

Article scraping & curation

Docs
GitHub

Python-Readability¶

Given a html document, it pulls out the main body text and cleans it up.

GitHub

Office¶

File Extensions:

Type	MS Office	Open Document
Text	.doc/.docx	.odt
Text Template	.dot	.ott
Master Document	.doc/.docx	.odm
Spreadsheet	.xls/.xlsx	.odt
Spreadsheet Template	.xst	.ots
Drawing		.odg
Drawing Template		.otg
Presentation	.ppt/.pptx	.odp
Presentation Template	.pot	.otp
Formula		.odf
Chart		.odc
Database	.mdb	.odb

Spreadsheets¶

openpyxl:

openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.

Source
Docs

xlrd:

xlrd is a library for reading data and formatting information from Excel files in the historical .xls format.

Github
Docs

Working with Excel Files in Python

PDF¶

OCRmyPDF¶

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

GitHub

PDFMiner¶

It is a tool for extracting information from PDF documents.

Github
Docs

PyPDF4¶

PyPDF4 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

GitHub