python script to concatenate all the files in the directory into one file
Asked Answered
C

7

26

I have written the following script to concatenate all the files in the directory into one single file.

Can this be optimized, in terms of

  1. idiomatic python

  2. time

Here is the snippet:

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")
Cottonwood answered 19/7, 2013 at 15:8 Comment(2)
optimized for time? use "cat *.txt > all.txt" :)Coeternal
possible duplicate of combine multiple text files into one text file using pythonBreeden
H
51

Use shutil.copyfileobj to copy data:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil reads from the readfile object in chunks, writing them to the outfile fileobject directly. Do not use readline() or a iteration buffer, since you do not need the overhead of finding line endings.

Use the same mode for both reading and writing; this is especially important when using Python 3; I've used binary mode for both here.

Healion answered 19/7, 2013 at 15:11 Comment(4)
Why is important to use the same mode for writing and reading?Athanasian
@JuanDavid: because shutil will use .read() calls on one, .write() calls on the other file object, passing the read data from one to the other. If one is open in binary mode, the other in text, you are passing through incompatible data (binary data to a text file, or text data to a binary file).Healion
The code here doesn't work with CSV files, dang. But it did give me some good inspiration for how to accomplish this with CSV. I'm relatively new to Python.Myrta
@bretts: the contents of the files shouldn't matter; perhaps your CSV files are missing the last newline separator, or are using different delimiter formats?Healion
B
4

Using Python 2.7, I did some "benchmark" testing of

outfile.write(infile.read())

vs

shutil.copyfileobj(readfile, outfile)

I iterated over 20 .txt files ranging in size from 63 MB to 313 MB with a joint file size of ~ 2.6 GB. In both methods, normal read mode performed better than binary read mode and shutil.copyfileobj was generally faster than outfile.write.

When comparing the worst combination (outfile.write, binary mode) with the best combination (shutil.copyfileobj, normal read mode), the difference was quite significant:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

The outfile had a final size of 2620 MB in normal read mode vs 2578 MB in binary read mode.

Bellona answered 27/10, 2015 at 10:57 Comment(2)
Interesting. What platform was that?Hibbler
I roughly work on two platforms: Linux Fedora 16, different nodes or Windows 7 Enterprise SP1 with an Intel Core(TM)2 Quad CPU Q9550, 2.83 GHz. I think it was the latter.Bellona
M
3

You can iterate over the lines of a file object directly, without reading the whole thing into memory:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)
Mcspadden answered 19/7, 2013 at 15:11 Comment(0)
J
3

I was curious to check more on performance and I used answers of Martijn Pieters and Stephen Miller.

I tried binary and text modes with shutil and without shutil. I tried to merge 270 files.

Text mode -

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

Binary mode -

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

Running times for binary mode -

Shutil - 20.161773920059204
Normal - 17.327500820159912

Running times for text mode -

Shutil - 20.47757601737976
Normal - 13.718038082122803

Looks like in both modes, shutil performs same while text mode is faster than binary.

OS: Mac OS 10.14 Mojave. Macbook Air 2017.

Jello answered 22/3, 2019 at 8:37 Comment(0)
A
2

No need to use that many variables.

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")
Arlinda answered 19/7, 2013 at 15:15 Comment(0)
S
1

The fileinput module provides a natural way to iterate over multiple files

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)
Shirlyshiroma answered 19/7, 2013 at 15:15 Comment(2)
This would be even better if it didn't confine itself to reading a line at a time.Hike
@Marcin, that is correct. I used to think this was a cool solution - until I saw Martijn Pieter's shutil.copyfileobj humdinger.Shirlyshiroma
I
0

I found the above answers a bit difficult to implement, here's a simplified code to concatenate all csv files:

path=fr"C:\Users\anirudh.sharma\OneDrive - Nihilent Limited\Schneider\Forecast\ARIMA\Target_may23"

all_files=glob.glob(os.path.join(path,"*.csv"))
df_from_each_file=[pd.read_csv(f) for f in all_files]
concatinated_df=pd.concat(df_from_each_file,ignore_index=False)

All you need to do is just change the path.

Inspector answered 31/1 at 9:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.