Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.
(All the examples assume your PDF file is called example.pdf)
Commandline
If you want to extract text just once you can use the commandline tool pdf2txt.py:
$ pdf2txt.py example.pdf
High-level api
If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.
from pdfminer.high_level import extract_text
# Extract text from a pdf.
text = extract_text('example.pdf')
# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')
Composable api
There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())