I here have 200 separate CSV files named from SH (1) to SH (200). I want to merge them into a single CSV file. How can I do it?
As ghostdog74 said, but this time with headers:
with open("out.csv", "ab") as fout:
# First file:
with open("sh1.csv", "rb") as f:
fout.writelines(f)
# Now the rest:
for num in range(2, 201):
with open("sh" + str(num) + ".csv", "rb") as f:
next(f) # Skip the header, portably
fout.writelines(f)
f.__next__()
instead if f.next()
in python3.x. –
Voroshilov with open
syntax and avoid manually .close()
ing the files. –
Amaral f.next()
and f.__next__()
? when I use the former, I got '_io.TextIOWrapper' object has no attribute 'next'
–
Leighleigha fout.write(line)
I would do: if line[-1] != '\n': line += '\n'
–
Raybin next(f)
, which works on every version of Python from 2.6 onwards. –
Bolo .next
(which was not properly reserved) to .__next__
(which, thanks to beginning and ending in __
, can't be reused for other purposes by user code without violating language requirements). The portable approach is to use neither, and just call the top-level function next
on the iterator, making it next(f)
. –
Bolo You can just use sed 1d sh*.csv > merged.csv
.
Sometimes you don't even have to use Python!
for f in mydir; do sed 1d $f >> merged.csv; done
–
Backstop sed 1d *.csv > merged.csv
and this ran for while before my computer crashed because of no storage space left! :( –
Skiba Use the accepted Stack Overflow answer to create a list of CSV files that you want to append and then run this code:
import pandas as pd
combined_csv = pd.concat( [ pd.read_csv(f) for f in filenames ] )
And if you want to export it to a single CSV file, use this:
combined_csv.to_csv("combined_csv.csv", index=False)
fout=open("out.csv","a")
for num in range(1,201):
for line in open("sh"+str(num)+".csv"):
fout.write(line)
fout.close()
It depends what you mean by "merging"—do they have the same columns? Do they have headers? For example, if they all have the same columns, and no headers, simple concatenation is sufficient (open the destination file for writing, loop over the sources opening each for reading, use shutil.copyfileobj from the open-for-reading source into the open-for-writing destination, close the source, keep looping—use the with
statement to do the closing on your behalf). If they have the same columns, but also headers, you'll need a readline
on each source file except the first, after you open it for reading before you copy it into the destination, to skip the headers line.
If the CSV files don't all have the same columns then you need to define in what sense you're "merging" them (like a SQL JOIN? or "horizontally" if they all have the same number of lines? Etc., etc.)—it's hard for us to guess what you mean in that case.
I'm just going to throw another code example into the basket:
from glob import glob
with open('singleDataFile.csv', 'a') as singleFile:
for csvFile in glob('*.csv'):
for line in open(csvFile, 'r'):
singleFile.write(line)
It is quite easy to combine all files in a directory and merge them:
import glob
import csv
# Open result file
with open('output.txt', 'wb') as fout:
wout = csv.writer(fout, delimiter=',')
interesting_files = glob.glob("*.csv")
h = True
for filename in interesting_files:
print 'Processing', filename
# Open and process file
with open(filename, 'rb') as fin:
if h:
h = False
else:
fin.next() # Skip header
for line in csv.reader(fin, delimiter=','):
wout.writerow(line)
You can simply use the in-built csv
library. This solution will work even if some of your CSV files have slightly different column names or headers, unlike the other top-voted answers.
import csv
import glob
filenames = [i for i in glob.glob("SH*.csv")]
header_keys = []
merged_rows = []
for filename in filenames:
with open(filename) as f:
reader = csv.DictReader(f)
merged_rows.extend(list(reader))
header_keys.extend([key for key in reader.fieldnames if key not in header_keys])
with open("combined.csv", "w") as f:
w = csv.DictWriter(f, fieldnames=header_keys)
w.writeheader()
w.writerows(merged_rows)
The merged file will contain all possible columns (header_keys
) that can be found in the files. Any absent columns in a file would be rendered as blank / empty (but preserving rest of the file's data).
Note:
- This won't work if your CSV files have no headers. In that case you can still use the
csv
library, but instead of usingDictReader
&DictWriter
, you'll have to work with the basicreader
&writer
. - This may run into issues when you are dealing with massive data since the entirety of the content is being store in memory (
merged_rows
list).
fieldnames=
can be any iterable, so a set or even a dict will do and you can drop the keys.extend([... if ... not])
list comprehension in favor of keys.update(reader.fieldnames)
. –
Limassol If the merged CSV is going to be used in Python then just use glob
to get a list of the files to pass to fileinput.input()
via the files
argument, then use the csv
module to read it all in one go.
A slight change to Norfeldt's code as it does not actually work correctly.
It should be as follows...
from glob import glob
with open('main.csv', 'a') as singleFile:
for csv in glob('*.csv'):
if csv == 'main.csv':
pass
else:
for line in open(csv, 'r'):
singleFile.write(line)
If you are working on Linux or Mac, you can do this.
from subprocess import call
script = "cat *.csv > merge.csv"
call(script, shell=True)
OR, you could just do
cat sh*.csv > merged.csv
Over the solution that was made by Adders and later on improved by varun, I implemented some little improvement to leave the whole merged CSV with only the main header:
from glob import glob
filename = 'main.csv'
with open(filename, 'a') as singleFile:
first_csv = True
for csv in glob('*.csv'):
if csv == filename:
pass
else:
header = True
for line in open(csv, 'r'):
if first_csv and header:
singleFile.write(line)
first_csv = False
header = False
elif header:
header = False
else:
singleFile.write(line)
singleFile.close()
I have done it by implementing a function that expects an output file and paths of the input files.
The function copies the file content of the first file into the output file and then does the same for the rest of input files, but without the header line.
def concat_files_with_header(output_file, *paths):
for i, path in enumerate(paths):
with open(path) as input_file:
if i > 0:
next(input_file) # Skip header
output_file.writelines(input_file)
Usage example of the function:
if __name__ == "__main__":
paths = [f"sh{i}.csv" for i in range(1, 201)]
with open("output.csv", "w") as output_file:
concat_files_with_header(output_file, *paths)
You could import the csv module then loop through all the CSV files reading them into a list. Then write the list back out to disk.
import csv
rows = []
for f in (file1, file2, ...):
reader = csv.reader(open("f", "rb"))
for row in reader:
rows.append(row)
writer = csv.writer(open("some.csv", "wb"))
writer.writerows("\n".join(rows))
The above is not very robust as it doesn't have any error handling nor does it close any open files.
This should work whether or not the the individual files have one or more rows of CSV data in them. Also I did not run this code, but it should give you an idea of what to do.
First, you can iterate all the files present in the root dir and then you can read each file and use the concat function to concatenate each csv file
files = os.listdir(root_dir)
final_data = pd.DataFrame()
for file in files:
file_path = os.path.join(root_dir,file)
df = pd.read_csv(file_path)
final_data = pd.concat([final_data, df], ignore_index=True)
final_data.to_csv("Final_Data.csv",index=False)
I modified what wisty said to be working with Python 3.x, for those of you that have an encoding problem. Also I use the os module to avoid hard coding.
import os
def merge_all():
dir = os.chdir('C:\python\data\\')
fout = open("merged_files.csv", "ab")
# First file:
for line in open("file_1.csv", 'rb'):
fout.write(line)
# Now the rest:
list = os.listdir(dir)
number_files = len(list)
for num in range(2, number_files):
f = open("file_" + str(num) + ".csv", 'rb')
f.__next__() # Skip the header
for line in f:
fout.write(line)
f.close() # Not really needed
fout.close()
Here is a script:
- Concatenating CSV files named
SH1.csv
toSH200.csv
- Keeping the headers
import glob
import re
# Looking for filenames like 'SH1.csv' ... 'SH200.csv'
pattern = re.compile("^SH([1-9]|[1-9][0-9]|1[0-9][0-9]|200).csv$")
file_parts = [name for name in glob.glob('*.csv') if pattern.match(name)]
with open("file_merged.csv","wb") as file_merged:
for (i, name) in enumerate(file_parts):
with open(name, "rb") as file_part:
if i != 0:
next(file_part) # Skip headers if not the first file
file_merged.write(file_part.read())
Updating wisty's answer for Python 3:
fout = open("out.csv", "a")
# First file:
for line in open("sh1.csv"):
fout.write(line)
# Now the rest:
for num in range(2, 201):
f = open("sh" + str(num) + ".csv")
next(f) # Skip the header
for line in f:
fout.write(line)
f.close() # Not really needed
fout.close()
Let's say you have two CSV files like these:
File csv1.csv
id,name
1,Armin
2,Sven
File csv2.csv
id,place,year
1,Reykjavik,2017
2,Amsterdam,2018
3,Berlin,2019
And you want the result to be like this (file csv3.csv):
id,name,place,year
1,Armin,Reykjavik,2017
2,Sven,Amsterdam,2018
3,,Berlin,2019
Then you can use the following snippet to do that:
import csv
import pandas as pd
# The file names
f1 = "csv1.csv"
f2 = "csv2.csv"
out_f = "csv3.csv"
# Read the files
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
# Get the keys
keys1 = list(df1)
keys2 = list(df2)
# Merge both files
for idx, row in df2.iterrows():
data = df1[df1['id'] == row['id']]
# If row with such id does not exist, add the whole row
if data.empty:
next_idx = len(df1)
for key in keys2:
df1.at[next_idx, key] = df2.at[idx, key]
# If row with such id exists, add only the missing keys with their values
else:
i = int(data.index[0])
for key in keys2:
if key not in keys1:
df1.at[i, key] = df2.at[idx, key]
# Save the merged files
df1.to_csv(out_f, index=False, encoding='utf-8', quotechar="", quoting=csv.QUOTE_NONE)
With the help of a loop, you can achieve the same result for multiple files as it is in your case (200 CSV files).
If the files aren't numbered in order, take the hassle-free approach below:
Python 3.6 on a Windows machine:
import pandas as pd
from glob import glob
interesting_files = glob("C:/temp/*.csv") # It grabs all the csv files from
# the directory you mention here
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
# Save the final file in same/different directory:
full_df.to_csv("C:/temp/merged_pandas.csv", index=False)
IndentationError: expected an indented block
). Where did you copy it from? –
Younts An easy-to-use function:
def csv_merge(destination_path, *source_paths):
'''
Merges all csv files on source_paths to destination_path.
:param destination_path: Path of a single csv file, doesn't need to exist
:param source_paths: Paths of csv files to be merged into, needs to exist
:return: None
'''
with open(destination_path,"a") as dest_file:
with open(source_paths[0]) as src_file:
for src_line in src_file.read():
dest_file.write(src_line)
source_paths.pop(0)
for i in range(len(source_paths)):
with open(source_paths[i]) as src_file:
src_file.next()
for src_line in src_file:
dest_file.write(src_line)
import pandas as pd
import os
df = pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\Sales_April_2019.csv")
files = [file for file in os.listdir("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data")
for file in files:
print(file)
all_data = pd.DataFrame()
for file in files:
df=pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\"+file)
all_data = pd.concat([all_data,df])
all_data.head()
© 2022 - 2024 — McMap. All rights reserved.