Document Processing

HTML

Boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages

Goose3

The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Newspaper3k

Article scraping & curation

Python-Readability

Given a html document, it pulls out the main body text and cleans it up.

Office

File Extensions:

Type

MS Office

Open Document

Text

.doc/.docx

.odt

Text Template

.dot

.ott

Master Document

.doc/.docx

.odm

Spreadsheet

.xls/.xlsx

.odt

Spreadsheet Template

.xst

.ots

Drawing

.odg

Drawing Template

.otg

Presentation

.ppt/.pptx

.odp

Presentation Template

.pot

.otp

Formula

.odf

Chart

.odc

Database

.mdb

.odb

Spreadsheets

openpyxl:

openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.

xlrd:

xlrd is a library for reading data and formatting information from Excel files in the historical .xls format.

Working with Excel Files in Python

PDF

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

PDFMiner

It is a tool for extracting information from PDF documents.

PyPDF4

PyPDF4 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.