How to convert .dat to .csv using python?

G

8

I have a file.dat which looks like:

id       | user_id | venue_id | latitude  | longitude | created_at

---------+---------+----------+-----------+-----------+-----------------

984301   |2041916  |5222      |           |           |2012-04-21 17:39:01

984222   |15824    |5222      |38.8951118 |-77.0363658|2012-04-21 17:43:47

984315   |1764391  |5222      |           |           |2012-04-21 17:37:18

984234   |44652    |5222      |33.800745  |-84.41052  | 2012-04-21 17:43:43

I need to get csv file with deleted empty latitude and longtitude rows, like:

id,user_id,venue_id,latitude,longitude,created_at

984222,15824,5222,38.8951118,-77.0363658,2012-04-21T17:43:47

984234,44652,5222,33.800745,-84.41052,2012-04-21T17:43:43

984291,105054,5222,45.5234515,-122.6762071,2012-04-21T17:39:22

I try to do that, using next code:

with open('file.dat', 'r') as input_file:
    lines = input_file.readlines()
    newLines = []
    for line in lines:
        newLine = line.strip('|').split()
        newLines.append(newLine)

with open('file.csv', 'w') as output_file:
    file_writer = csv.writer(output_file)
    file_writer.writerows(newLines)

But all the same I get a csv file with "|" symbols and empty latitude/longtitude rows. Where is mistake? In general I need to use resulting csv-file in DateFrame, so maybe there is some way to reduce number of actions.

Gymno answered 25/4, 2016 at 15:34 Comment(1)

didn't you meant newLine = line.strip().split('|') instead – Welborn 25/4, 2016 at 15:40

C

11

str.strip() removes leading and trailing characters from a string.
You want to split the lines on "|", then strip each element of the resulting list:

import csv

with open('file.dat') as dat_file, open('file.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)

    for line in dat_file:
        row = [field.strip() for field in line.split('|')]
        if len(row) == 6 and row[3] and row[4]:
            csv_writer.writerow(row)

Chibouk answered 25/4, 2016 at 15:42 Comment(0)

C

5

Use this:

data = pd.read_csv('file.dat', sep='|', header=0, skipinitialspace=True)
data.dropna(inplace=True)

Cadmium answered 20/10, 2017 at 14:25 Comment(1)

data.dropna(inplace=True) drop all rows with NaN values. For delete header "separator" better use data.drop(0) – Roe 28/5, 2020 at 7:12

F

2

I used standard python features without pre-processing data. I got an idea from one of the previous answers and improved it. If data headers contain spaces (it is often the situation in CSV) we should determine column names by ourselves and skip line 1 with headers. After that, we can remove NaN values only by specific columns.

data = pd.read_csv("checkins.dat", sep='|', header=None, skiprows=1,
                   low_memory = False, skipinitialspace=True,
                   names=['id','user_id','venue_id','latitude','longitude','created_at'])
data.dropna(subset=['latitude', 'longitude'], inplace = True)

Forrestforrester answered 20/10, 2020 at 6:25 Comment(0)

N

1

Using split() without parameters will result in splitting after a space example "test1 test2".split() results in ["test1", "test2"]

instead, try this:

newLine = line.split("|")

Noakes answered 25/4, 2016 at 15:41 Comment(0)

G

1

Maybe it's better to use a map() function instead of list comprehensions as it must be working faster. Also writing a csv-file is easy with csv module.

import csv
with open('file.dat', 'r') as fin:
with open('file.csv', 'w') as fout:
    for line in fin:
        newline = map(str.strip, line.split('|'))
        if len(newline) == 6 and newline[3] and newline[4]:
            csv.writer(fout).writerow(newline)

Gravimetric answered 19/7, 2016 at 6:58 Comment(4)

Please edit with more information. Code-only and "try this" answers are discouraged, because they contain no searchable content, and don't explain why someone should "try this". – Propaedeutic 19/7, 2016 at 7:20

1. The syntax is invalid: the block starting from line 3 needs to be indented one level deeper. – Pennyworth 2/4, 2018 at 14:34

2. Line 6 is invalid in Python 3, because map is an interator and it's not possible to get its len. – Pennyworth 2/4, 2018 at 14:37

3. In Python 3, opening a file for writing CSV requires passing newline=''. Otherwise, the output will be interspersed with empty lines. – Pennyworth 2/4, 2018 at 14:39

S

1

with open("filename.dat") as f:
    with open("filename.csv", "w") as f1:
        for line in f:
            f1.write(line)

This can be used to convert a .dat file to .csv file

Sentry answered 12/4, 2018 at 0:24 Comment(0)

K

0

Combining previous answers I wrote my code for Python 2.7:

import csv

lat_index = 3
lon_index = 4
fields_num = 6
csv_counter = 0

with open("checkins.dat") as dat_file:
    with open("checkins.csv", "w") as csv_file:
        csv_writer = csv.writer(csv_file)
        for dat_line in dat_file:
            new_line = map(str.strip, dat_line.split('|'))
            if len(new_line) == fields_num and new_line[lat_index] and new_line[lon_index]:
                csv_writer.writerow(new_line)
                csv_counter += 1

print("Done. Total rows written: {:,}".format(csv_counter))

Karlis answered 24/11, 2018 at 5:58 Comment(0)

I

0

This has worked for me:

data = pd.read_csv('file.dat',sep='::',names=list_for_names_of_columns)

Incomplete answered 12/8, 2019 at 8:15 Comment(0)

Recommended topics

Hot tags