How to parse .ttl files with RDFLib?
Asked Answered
T

4

14

I have a file in .ttl form. It has 4 attributes/columns containing quadruples of the following form:

  1. (id, student_name, student_address, student_phoneno).
  2. (id, faculty_name, faculty_address, faculty_phoneno).

I know how to parse .n3 form triples with RDFLib;

from rdflib import Graph
g = Graph()
g.parse("demo.nt", format="nt")

but I am not sure as to how to parse these quadruples.

My intent is to parse and extract all the information pertaining to a particular id. The id can be same for both student and faculty.

How can I use RDFLib to process these quadruples and use it for aggregation based on id?

Example snippet from .ttl file:

#@ <id1>
<Alice> <USA> <12345>

#@ <id1>
<Jane> <France> <78900>
Tollgate answered 2/3, 2013 at 7:3 Comment(5)
Is the ttl referenced in the question the same as the one referenced by the tag?Cowpoke
I think its Turtle - Terse RDF Triple LanguageElisabeth
@Elisabeth Yes you are correct. Turtle-Terse RDF Triple LanguageTollgate
@KeiraShaw why not just regex?Cowpoke
@SnakesandCoffee Thanks but I fail to understand how can I use regex on it. Id have id of the form "#@<id1>". I am new to python. Can you pls explain. Thanks for the reply.Tollgate
B
14

Turtle is a subset of Notation 3 syntax so rdflib should be able to parse it using format='n3'. Check whether rdflib preserves comments (ids are specified in the comments (#...) in your sample). If not and the input format is as simple as shown in your example then you could parse it manually:

import re
from collections import namedtuple
from itertools import takewhile

Entry = namedtuple('Entry', 'id name address phone')

def get_entries(path):
    with open(path) as file:
        # an entry starts with `#@` line and ends with a blank line
        for line in file:
            if line.startswith('#@'):
                buf = [line]
                buf.extend(takewhile(str.strip, file)) # read until blank line
                yield Entry(*re.findall(r'<([^>]+)>', ''.join(buf)))

print("\n".join(map(str, get_entries('example.ttl'))))

Output:

Entry(id='id1', name='Alice', address='USA', phone='12345')
Entry(id='id1', name='Jane', address='France', phone='78900')

To save entries to a db:

import sqlite3

with sqlite3.connect('example.db') as conn:
    conn.execute('''CREATE TABLE IF NOT EXISTS entries
             (id text, name text, address text, phone text)''')
    conn.executemany('INSERT INTO entries VALUES (?,?,?,?)',
                     get_entries('example.ttl'))

To group by id if you need some postprocessing in Python:

import sqlite3
from itertools import groupby
from operator import itemgetter

with sqlite3.connect('example.db') as c:
    rows = c.execute('SELECT * FROM entries ORDER BY id LIMIT ?', (10,))
    for id, group in groupby(rows, key=itemgetter(0)):
        print("%s:\n\t%s" % (id, "\n\t".join(map(str, group))))

Output:

id1:
    ('id1', 'Alice', 'USA', '12345')
    ('id1', 'Jane', 'France', '78900')
Birth answered 2/3, 2013 at 15:9 Comment(1)
just format="ttl" as shown my the next answer below!Highhat
P
6

Looks like turtle is supported at least as of rdflib 5.0.0. I did

from rdflib import Graph
graph = Graph()
graph.parse('myfile.ttl', format='ttl')

This parsed in just fine.

Padraic answered 10/7, 2021 at 18:22 Comment(0)
S
0

You can do as Snakes and Coffee suggests, only wrap that function (or its code) in a loop with yield statements. This creates a generator, which can be called iteratively to create the next line's dicts on the fly. Assuming you were going to write these to a csv, for instance, using Snakes' parse_to_dict:

import re
import csv

writer = csv.DictWriter(open(outfile, "wb"), fieldnames=["id", "name", "address", "phone"])
# or whatever

You can create a generator as a function or with an inline comprehension:

def dict_generator(lines): 
    for line in lines: 
        yield parse_to_dict(line)

--or--

dict_generator = (parse_to_dict(line) for line in lines)

These are pretty much equivalent. At this point you can get a dict-parsed line by calling dict_generator.next(), and you'll magically get one at a time- no additional RAM thrashing involved.

If you have 16 gigs of raw data, you might consider making a generator to pull the lines in, too. They're really useful.

More info on generators from SO and some docs: What can you use Python generator functions for? http://wiki.python.org/moin/Generators

Shadwell answered 2/3, 2013 at 7:56 Comment(1)
Snakes and coffee..parse_to_dict line is not there and I forgot what did that line intend to doTollgate
E
-2

It seems there is currently no such library present to parse the Turtle - Terse RDF Triple Language

As you already know the grammar , your best bet is to use PyParsing to first create a grammar and then parse the file.

I would also suggest to adapt the following EBNF implementation for your need

Elisabeth answered 2/3, 2013 at 7:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.