How do I concatenate text files in Python?
Asked Answered
E

12

233

I have a list of 20 file names, like ['file1.txt', 'file2.txt', ...]. I want to write a Python script to concatenate these files into a new file. I could open each file by f = open(...), read line by line by calling f.readline(), and write each line into that new file. It doesn't seem very "elegant" to me, especially the part where I have to read/write line by line.

Is there a more "elegant" way to do this in Python?

Enright answered 28/11, 2012 at 19:54 Comment(5)
Its not python, but in shell scripting you could do something like cat file1.txt file2.txt file3.txt ... > output.txt. In python, if you don't like readline(), there is always readlines() or simply read().Grizel
@Grizel simply run the cat file1.txt file2.txt file3.txt command using subprocess module and you're done. But I am not sure if cat works in windows.Nine
As a note, the way you describe is a terrible way to read a file. Use the with statement to ensure your files are closed properly, and iterate over the file to get lines, rather than using f.readline().Elope
@Grizel cat doesn't work when the text file is unicode.Discretionary
Actual analysis waymoot.org/home/python_stringCaras
A
323

This should do it

For large files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

For small files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

… and another interesting one that I thought of:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for line in itertools.chain.from_iterable(itertools.imap(open, filnames)):
        outfile.write(line)

Sadly, this last method leaves a few open file descriptors, which the GC should take care of anyway. I just thought it was interesting

Antony answered 28/11, 2012 at 19:57 Comment(20)
This will, for large files, be very memory inefficient.Elope
@Antony I don't see this code to be very time efficient for files that aren't large and that can be read entirely at once. In my opinion, it's impossible to write a code that is efficient as well for big files than for not-big filesDigest
@eyquem: Have you actually done performance tests or profiling on any of these solutions, or are you just guessing what's going to be fast based on your intuitions of how computers work?Berty
@inspectorG4dget: I wasn't asking you, I was asking eyquem, who complained that your solution wasn't going to be efficient. I'm willing to bet it's more than efficient enough for the OP's use case, and for whatever use case eyquem has in mind. If he thinks it isn't, it's his responsibility to prove that before demanding that you optimize it.Berty
@Berty I haven't tested running times of programs recently, but that's not intuition. Some months ago I was fond of testing run times for plenty of programs and I keep in mind the results. Though, there are sometimes surprises in the results, and you are right, the best would be to do the tests. I haven't enough motivation for that presently.Digest
You could just read a certain size until the end: f.read(4096) # or whatever sizeForebear
what are we considering a large file to be?Orva
@dee: a file so large that it's contents don't fit into main memoryAntony
why would you decode and re-encode the whole thing? and search for newlines and all the unnecessary stuff when all that’s required is concatenating the files. the shutil.copyfileobj answer below will be much faster.Botanist
I actually did some brief profiling for small files (i.e. size of each file < 32GB = system memory). The results are: 1) The method for large files is fastest. 2) The method for small files is slower by 20%. 3) Using shutil (Meow's anwser) is slower by 150% 4) Using fileinput did not work. Usual disclaimers apply. CiaoWisecrack
@Antony how to concatenate only first 130 lines of all the files in the filenamesSubjunctive
@GurminderBharani: replace for line in infile with for _, line in zip(range(130, infile))Antony
@Wisecrack I did a "large" text profiling (each file was 5MB). You can reproduce my results by copy and pasting their code and using this as your text. Using filenames = ['big.txt',]*1000 I got that this answers large result averaged ~130s over three runs, while Meow's answer averaged ~60s over three. Meow's solution seemed to be a clear winner for large files.Lactometer
For reference, abarnert's solution (modified to work in Python 2.7) took ~300s.Lactometer
Just to reiterate: this is the wrong answer, shutil.copyfileobj is the right answer.Internationale
becarefull if for the list filenames you use something of the sort of os.listdir(...). The result is non sorted in alphanumeric order. Use sorted(os.listdir(..))Pilch
@Antony when concatenating files I noted that sometimes the last line from one file gets merged with the first line from the new file and printed on the same line. Is it possible to detect such situations in python. I guess that this happens when a new line is missing. I did search of the type if '\n' not in line: print("line"+"\n") somehow solved the problem, but is it possible detect this situation automatically in python.Polystyrene
@AlexanderCska: yes, that's possible. It can be done with if not line.endswith('\n'): outfile.write('\n'). But that would require you to add the check for each file-openAntony
Is there a method to use this code, but don't write all file names in the filenames, just one code to load all files of a datafolderNations
@J_Martin: filenames = glob.glob(os.path.join('datafolder', "*"))Antony
B
280

Use shutil.copyfileobj.

It automatically reads the input files chunk by chunk for you, which is more more efficient and reading the input files in and will work even if some of the input files are too large to fit into memory:

import shutil

with open('output_file.txt','wb') as wfd:
    for f in ['seg1.txt','seg2.txt','seg3.txt']:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)
Bloch answered 22/11, 2014 at 12:35 Comment(5)
for i in glob.glob(r'c:/Users/Desktop/folder/putty/*.txt'): well i replaced the for statement to include all the files in directory but my output_file started growing really huge like in 100's of gb in very quick time.Derekderelict
Note, that is will merge last strings of each file with first strings of next file if there are no EOL characters. In my case I got totally corrupted result after using this code. I added wfd.write(b"\n") after copyfileobj to get normal resultSamaniego
@Samaniego I would say that is not a pure concatenation in that case, but hey, whatever suits your needs.Clearcole
This is by far the best answer!Oxyacetylene
This is super fast and as I required. yes it does not add new line between "two files end and start" and exactly this I needed. so dont update it :DPorta
B
66

That's exactly what fileinput is for:

import fileinput
with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin:
    for line in fin:
        fout.write(line)

For this use case, it's really not much simpler than just iterating over the files manually, but in other cases, having a single iterator that iterates over all of the files as if they were a single file is very handy. (Also, the fact that fileinput closes each file as soon as it's done means there's no need to with or close each one, but that's just a one-line savings, not that big of a deal.)

There are some other nifty features in fileinput, like the ability to do in-place modifications of files just by filtering each line.


As noted in the comments, and discussed in another post, fileinput for Python 2.7 will not work as indicated. Here slight modification to make the code Python 2.7 compliant

with open('outfilename', 'w') as fout:
    fin = fileinput.input(filenames)
    for line in fin:
        fout.write(line)
    fin.close()
Berty answered 28/11, 2012 at 20:7 Comment(10)
@Lattyware: I think most people who learn about fileinput are told that it's a way to turn a simple sys.argv (or what's left as args after optparse/etc.) into a big virtual file for trivial scripts, and don't think to use it for anything else (i.e., when the list isn't command-line args). Or they do learn, but then forget—I keep re-discovering it every year or two…Berty
@abament I think for line in fileinput.input() isn't the best way to choose in this particular case: the OP wants to concatenate files, not read them line by line which is a theoretically longer process to executeDigest
@eyquem: It's not a longer process to execute. As you yourself pointed out, line-based solutions don't read one character at a time; they read in chunks and pull lines out of a buffer. The I/O time will completely swamp the line-parsing time, so as long as the implementor didn't do something horribly stupid in the buffering, it will be just as fast (and possibly even faster than trying to guess at a good buffer size yourself, if you think 10000 is a good choice).Berty
@Berty NO, 10000 isn't a good choice. It is indeed a very bad choice because it isn't a power of 2 and it is ridiculously a little size. Better sizes would be 2097152 (221), 16777216 (224) or even 134217728 (2**27) , why not ?, 128 MB is nothing in a RAM of 4 GB.Digest
Huge buffers really don't help much. In fact, if you're reading more than your OS's typical readahead cache size, you'll end up waiting around for data when you could be writing. Plus, run a dozen apps that all think 128MB is nothing, and suddenly your system is thrashing swap and slowing to a crawl. It really is very easy to test this stuff, so try it and see.Berty
@Berty Yes yes yes, but a learned guy who understands what he does won't trigger such a program while running 2**4 other applications. - You're right, I should better test - And oh I understand, by your use of the 'readahead' word, that you are a Linux user, aren't you ? That's why you know more about the innards than commonly, I guessDigest
@eyquem: Actually, I'm on a Mac. I'm currently running 147 processes, most of which are using at least 64MB of VM. In fact, it's very hard not to be running a whole lot more than 2**4 processes on any modern Windows, Mac, Linux system… or, for that matter, iOS or Android phone.Berty
@Berty Gargl. There are presently 33 processes in my computer, only 7 among them being more than 20 000 KB....Digest
You probably don't have the "show all processes" (or whatever it's called in current Windows) enabled. But anyway, a learned guy like you or me is always running well over 2**4 other applications when he triggers anything.Berty
Example code not quite valid for Python 2.7.10 and later: #30835590Cabasset
P
9
outfile.write(infile.read()) # time: 2.1085190773010254s
shutil.copyfileobj(fd, wfd, 1024*1024*10) # time: 0.60599684715271s

A simple benchmark shows that the shutil performs better.

Panicstricken answered 26/4, 2018 at 8:10 Comment(0)
A
8

I don't know about elegance, but this works:

    import glob
    import os
    for f in glob.glob("file*.txt"):
         os.system("cat "+f+" >> OutFile.txt")
Aarau answered 3/6, 2014 at 1:39 Comment(3)
you can even avoid the loop: import os; os.system("cat file*.txt >> OutFile.txt")Wotton
not crossplatform and will break for file names with spaces in themBotanist
This is insecure; also, cat can take a list of files, so no need to repeatedly call it. You can easily make it safe by calling subprocess.check_call instead of os.systemFretted
M
7

What's wrong with UNIX commands ? (given you're not working on Windows) :

ls | xargs cat | tee output.txt does the job ( you can call it from python with subprocess if you want)

Midget answered 28/11, 2012 at 20:0 Comment(6)
because this is a question about python.Weismannism
Nothing wrong in general, but this answer is broken (don't pass the output of ls to xargs, just pass the list of files to cat directly: cat * | tee output.txt).Fretted
If it can insert filename as well that would be great.Appel
@Appel To specify input file names, you can use cat file1.txt file2.txt | tee output.txtAnti
... and you can disable sending to stdout (printing in Terminal) by adding 1> /dev/null to the end of the commandAnti
How will this solution hold up if the files have spaces or (other) obscure caracters in their names?Clearcole
G
6

If you have a lot of files in the directory then glob2 might be a better option to generate a list of filenames rather than writing them by hand.

import glob2

filenames = glob2.glob('*.txt')  # list of all .txt files in the directory

with open('outfile.txt', 'w') as f:
    for file in filenames:
        with open(file) as infile:
            f.write(infile.read()+'\n')
Gibbs answered 6/5, 2017 at 9:45 Comment(2)
What does this have to do with the question? Why use glob2 instead of the glob module, or the globbing functionality in pathlib?Thomasinathomasine
very good and complete Python code. Works brilliant. Thanks,Robinetta
A
3

An alternative to @inspectorG4dget answer (best answer to date 29-03-2016). I tested with 3 files of 436MB.

@inspectorG4dget solution: 162 seconds

The following solution : 125 seconds

from subprocess import Popen
filenames = ['file1.txt', 'file2.txt', 'file3.txt']
fbatch = open('batch.bat','w')
str ="type "
for f in filenames:
    str+= f + " "
fbatch.write(str + " > file4results.txt")
fbatch.close()
p = Popen("batch.bat", cwd=r"Drive:\Path\to\folder")
stdout, stderr = p.communicate()

The idea is to create a batch file and execute it, taking advantage of "old good technology". Its semi-python but works faster. Works for windows.

Amherst answered 29/3, 2016 at 3:53 Comment(0)
E
2

Check out the .read() method of the File object:

http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

You could do something like:

concat = ""
for file in files:
    concat += open(file).read()

or a more 'elegant' python-way:

concat = ''.join([open(f).read() for f in files])

which, according to this article: http://www.skymind.com/~ocrow/python_string/ would also be the fastest.

Enriquetaenriquez answered 28/11, 2012 at 20:4 Comment(1)
This will produce a giant string, which, depending on the size of the files, could be larger than the available memory. As Python provides easy lazy access to files, it's a bad idea.Elope
D
2

If the files are not gigantic:

with open('newfile.txt','wb') as newf:
    for filename in list_of_files:
        with open(filename,'rb') as hf:
            newf.write(hf.read())
            # newf.write('\n\n\n')   if you want to introduce
            # some blank lines between the contents of the copied files

If the files are too big to be entirely read and held in RAM, the algorithm must be a little different to read each file to be copied in a loop by chunks of fixed length, using read(10000) for example.

Digest answered 28/11, 2012 at 20:4 Comment(17)
@Lattyware Because I'm quite sure the execution is faster. By the way, in fact, even when the code orders to read a file line by line, the file is read by chunks, that are put in cache in which each line is then read one after the other. The better procedure would be to put the length of read chunk equal to the size of the cache. But I don't know how to determine this cache's size.Digest
That's the implementation in CPython, but none of that is guaranteed. Optimizing like that is a bad idea as while it may be effective on some systems, it may not on others.Elope
Yes, of course line-by-line reading is buffered. That's exactly why it's not that much slower. (In fact, in some cases, it may even be slightly faster, because whoever ported Python to your platform chose a much better chunk size than 10000.) If the performance of this really matters, you'll have to profile different implementations. But 99.99…% of the time, either way is more than fast enough, or the actual disk I/O is the slow part and it doesn't matter what your code does.Berty
Also, if you really do need to manually optimize the buffering, you'll want to use os.open and os.read, because plain open uses Python's wrappers around C's stdio, which means either 1 or 2 extra buffers getting in your way.Berty
PS, as for why 10000 is bad: Your files are probably on a disk, with blocks that are some power of bytes long. Let's say they're 4096 bytes. So, reading 10000 bytes means reading two blocks, then part of the next. Reading another 10000 means reading the rest of the next, then two blocks, then part of the next. Count up how many partial or complete block reads you have, and you're wasting a lot of time. Fortunately, the Python, stdio, filesystem, and kernel buffering and caching will hide most of these problems from you, but why try to create them in the first place?Berty
@Berty You're perfectly right concerning the size 10000 being bad. I wrote it too rapidly, though I knew that I is better to choose a size being a power of 2, but I had forgotten why exactly. As you said, I keep re-learning things that I already knew once.Digest
@Berty Another point is that I don't know if transfers of data are controlled more by Python's implementation or by the OS. And I wonder how one can know that.Digest
@Berty I did a mistake when I wrote "cache" instead of "buffer". When refering to the buffering process, I meant that the reason of this process is that it gives more efficiency in the reading of data. And that what is true for reading lines one after the other in a buffer and re-writing them one after the other on disk, is also true at a higher level for reading chunks one after the other and putting them in RAM one after the other before re-writing them from RAM to disk one after the other. - The point being, as you said, that the reading and writing on disks are the slowest processDigest
@eyquem: You can know by profiling, debugging, and/or reading the code (assuming your OS is open source, at least in the relevant parts—Python of course is). If it doesn't seem worth the effort to do any of those things, you probably don't really need to know the answer. (Usually, even if you need to optimize your code, you care more about profiling your code than what's happening under the covers. But occasionally you do need to know what's happening under the covers to figure it out—or you're just curious and motivated.)Berty
@Berty So, it seems to me that a code that reads large (but not gigantic; say 3 MB, why not) chunks of a gigantic (5 GB) file one after the other, and re-write the chunks one after the other on disk will be faster than a code reading line after line. Because in this last case, what will happen is in fact that the file will be read by chunks of the buffer's size and that this will be equivalent to read and re-write more chunks (of buffer's size), going from disk to buffer then to disk, while reading by big chunks will do less movements between I/O because the data are temporarily RAm-storedDigest
@Berty Wow, it's difficult to me to express such complex things in english. Excuse my poor english. And maybe I have false ideas concerning all that ? I don't pretend to be a specialist. I would appreciate links to in-depth explanations concerning this complex subject.Digest
@Berty I am curious and motivated to learn about the innards, yes. And it seems to me that optimization of a code isn't possible if one doesn't know a little about what is under the hood.Digest
@eyquem: Again, both the reading and writing are buffered. So when you call outf.write(line), it doesn't go rewrite a disk block just to write those 80 characters; those 80 characters go into a buffer, and if the buffer's now over, say, 8KB, the first 8KB gets written. If 3MB were faster than 8KB, they'd use a 3MB buffer instead. So the only difference between reading and writing 3MB chunks is that you also need to do a bit of RAM work and string processing—which is much faster than disk, so it usually doesn't matter.Berty
@Berty In fact, my subconscious idea is that when Python/OS have to write a 3 MB chunk, the process isn't going through the buffer, the data are sent directly from RAM on disk in a unique transfer and writing. Maybe am I wrong ?Digest
@eyquem: Python is not calling a routine to DMA 3MB of RAM to physical disk blocks. When you use Python's file objects, they're either wrapped around C stdio, or internally buffered in a similar way. Even when it does actual reads and writes to file descriptors, those will be cached by the OS. And modern disk drives have their own caches too, not to mention that the blocks aren't even real physical blocks anymore. Unless you're writing for an Apple ][ or something, this just isn't how things work.Berty
@Berty Thank you. I think you know more than me about the subject. I don't remind to have read this kind of explanation. How do you know all that ? I would like to study this subject, but I don't know what to consult: explanations on OS, on C, on Python ...? And where to find them ? It seems to me that people are not interested by precise innards, in general, I find it is a pity.Digest
@eyquem: Well, I originally learned by using systems like the Apple ][ that were so simple you actually could understand all the details, and taking an OS class in college probably helped, but mainly it's just spending decades making stupid mistakes and either being corrected or figuring out the right answer… Nowadays they have free online tutorials and even course materials for just about everything, which hopefully makes things a lot easier, but I wouldn't know where to start.Berty
K
0
def concatFiles():
    path = 'input/'
    files = os.listdir(path)
    for idx, infile in enumerate(files):
        print ("File #" + str(idx) + "  " + infile)
    concat = ''.join([open(path + f).read() for f in files])
    with open("output_concatFile.txt", "w") as fo:
        fo.write(path + concat)

if __name__ == "__main__":
    concatFiles()
Kubetz answered 28/9, 2013 at 0:3 Comment(0)
D
-2
  import os
  files=os.listdir()
  print(files)
  print('#',tuple(files))
  name=input('Enter the inclusive file name: ')
  exten=input('Enter the type(extension): ')
  filename=name+'.'+exten
  output_file=open(filename,'w+')
  for i in files:
    print(i)
    j=files.index(i)
    f_j=open(i,'r')
    print(f_j.read())
    for x in f_j:
      outfile.write(x)
Doukhobor answered 3/12, 2019 at 16:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.