Convert CreationTime of PDF to a readable format in Python
Asked Answered
S

5

12

I'm working on PDF with Python and I'm accessing the file's meta data by using PDFMiner. I extract the info using this:

from pdfminer.pdfparser import PDFParser, PDFDocument    
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

print doc.info[0]['CreationDate']
# And return this value "D:20130501200439+01'00'"

How can I convert D:20130501200439+01'00' into a readable format in Python?

Sharmainesharman answered 12/5, 2013 at 0:39 Comment(4)
That looks readable already? The first six digits appear to be yyyymmdd, certainly. Just slice it and convert to an integer, eg year = int(doc.info[0]['CreationDate'][2:6])Maladroit
Yes, it is yyyymm but not for dd. The actual date of the file is Thursday, May 02, 2013, 3:04:39 AM.Sharmainesharman
@Sharmainesharman Are you sure about that? in what timezone?Ivoryivorywhite
Yup @SanjayManohar, no issue with the timezone. The PDF files are just downloaded from internet and I am just extraction the Create Time of each file.Sharmainesharman
L
7

Is "+01'00'" the timezone information? Not taking that into account, you can create a datetime object as follows...

>>>from time import mktime, strptime
>>>from datetime import datetime
...
>>>datestring = doc.info[0]['CreationDate'][2:-7]
>>>ts = strptime(datestring, "%Y%m%d%H%M%S")
>>>dt = datetime.fromtimestamp(mktime(ts))
datetime(2013, 5, 1, 20, 4, 30)
Lolitaloll answered 12/5, 2013 at 1:6 Comment(1)
Good one, I tried this with other PDF files and got the yyyymmdd accurate but not the time, don't need time anyway.Sharmainesharman
C
10

I found the format documented here. I needed to cope with the timezones too because I have 160k documents from all over to deal with. Here is my full solution:

import datetime
import re
from dateutil.tz import tzutc, tzoffset


pdf_date_pattern = re.compile(''.join([
    r"(D:)?",
    r"(?P<year>\d\d\d\d)",
    r"(?P<month>\d\d)",
    r"(?P<day>\d\d)",
    r"(?P<hour>\d\d)",
    r"(?P<minute>\d\d)",
    r"(?P<second>\d\d)",
    r"(?P<tz_offset>[+-zZ])?",
    r"(?P<tz_hour>\d\d)?",
    r"'?(?P<tz_minute>\d\d)?'?"]))


def transform_date(date_str):
    """
    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
    http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
    (D:YYYYMMDDHHmmSSOHH'mm')
    :param date_str: pdf date string
    :return: datetime object
    """
    global pdf_date_pattern
    match = re.match(pdf_date_pattern, date_str)
    if match:
        date_info = match.groupdict()

        for k, v in date_info.iteritems():  # transform values
            if v is None:
                pass
            elif k == 'tz_offset':
                date_info[k] = v.lower()  # so we can treat Z as z
            else:
                date_info[k] = int(v)

        if date_info['tz_offset'] in ('z', None):  # UTC
            date_info['tzinfo'] = tzutc()
        else:
            multiplier = 1 if date_info['tz_offset'] == '+' else -1
            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed
            del date_info[k]

        return datetime.datetime(**date_info)
Coot answered 7/11, 2014 at 7:58 Comment(1)
I've been looking for something explaining this format. +1 for linking an explanation.Aliber
L
7

Is "+01'00'" the timezone information? Not taking that into account, you can create a datetime object as follows...

>>>from time import mktime, strptime
>>>from datetime import datetime
...
>>>datestring = doc.info[0]['CreationDate'][2:-7]
>>>ts = strptime(datestring, "%Y%m%d%H%M%S")
>>>dt = datetime.fromtimestamp(mktime(ts))
datetime(2013, 5, 1, 20, 4, 30)
Lolitaloll answered 12/5, 2013 at 1:6 Comment(1)
Good one, I tried this with other PDF files and got the yyyymmdd accurate but not the time, don't need time anyway.Sharmainesharman
R
5

use Python 3's datetime.strptime; just remove the apostrophes first:

from datetime import datetime

creation_date = "D:20130501200439+01'00'"

dt = datetime.strptime(creation_date.replace("'", ""), "D:%Y%m%d%H%M%S%z")

print(repr(dt))
# datetime.datetime(2013, 5, 1, 20, 4, 39, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))

print(dt.isoformat())
# 2013-05-01T20:04:39+01:00

once you have a datetime object, you can format back to string however you like for a "readable" output, see strptime/strftime directives.

Receivership answered 16/8, 2021 at 8:20 Comment(0)
R
2

Guess I don't have the rep to comment on Paul Whipp's illustrative answer, but I've amended it to handle a form of the Y2K bug present in some of my old files. The year 2000 was written 19100, so the relevant line of pdf_date_pattern became

r"(?P<year>191\d\d|\d\d\d\d)",

and I added an elif to the transform values loop:

elif k == 'year' and len(v) == 5:
    date_info[k] = int('20' + v[3:])
Rivalry answered 30/11, 2019 at 16:59 Comment(0)
C
0

I vote FObersteiner's answer.

For the lazy ones, the reverse function could be:

def get_pdf_time_from_datetime(datetime_obj: datetime) -> str:
    dt, tz = datetime_obj.strftime("D:%Y%m%d%H%M%S%z").split("+")
    h = tz[:2]
    m = tz[2:4]
    result = f"{dt}+{h}'{m}'"
    return result
Chelsiechelsy answered 5/6 at 13:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.