convert a fixed width file from text to csv
Asked Answered
C

6

12

I have a large data file in text format and I want to convert it to csv by specifying each column length.

number of columns = 5

column length

[4 2 5 1 1]

sample observations:

aasdfh9013512
ajshdj 2445df

Expected Output

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Cordalia answered 17/2, 2015 at 12:42 Comment(0)
C
31

GNU awk (gawk) supports this directly with FIELDWIDTHS, e.g.:

gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile

Output:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Cofferdam answered 17/2, 2015 at 13:13 Comment(6)
Nice! I wasn't aware of this functionality. big +1! Related link: Reading Fixed-width DataJefferyjeffie
The "FIELDWIDTHS" argument only works for me if I install and use gawk; on Ubuntu 14.04.3.Gaussmeter
@Arthur: According to GNU awk's feature history, FIELDWIDTHS has been available since gawk 2.13, i.e. July 2010.Cofferdam
@Cofferdam Yes, I'm sure that's right. However, if gawk is not installed it won't matter. At least for me, on Ubuntu 14.04.3, awk was installed but NOT gawk.Gaussmeter
@Arthur: Yes this is a GNU awk (gawk) specific answer, I will make that more clear. Many Debian derived systems, for some reason, have mawk as their default awk alternative, maybe because it is faster.Cofferdam
Note that on Windows, this solution works but you need to use quote instead of tick: gawk "$1=$1" FIELDWIDTHS="1 4 8 5 3" OFS=, sample-fixed.csvCrabb
J
5

I would use sed and catch the groups with the given length:

$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Jefferyjeffie answered 17/2, 2015 at 12:47 Comment(2)
First of all, thanks for answering this. But in actual file I have to break it to 80 columns and sed command works only to 9 columns. Please help.Cordalia
@AshishKumar then you probably must use Thor's answer with awk.Jefferyjeffie
D
3

Here's a solution that works with regular awk (does not require gawk).

awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

It uses awk's substr function to define each field's start position and length. OFS defines what the output field separator is (in this case, a comma).

(Side note: This only works if the source data does not have any commas. If the data has commas, then you have to escape them to be proper CSV, which is beyond the scope of this question.)

Demo:

echo 'aasdfh9013512
ajshdj 2445df' | 
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

Output:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Dentation answered 30/8, 2017 at 23:1 Comment(0)
S
1

Adding a Generic way of handling this(alternative to FIELDSWIDTH option) in awk(where we need not to harcode sub string positions, this will work as per position nuber provided by user wherever comma needs to be inserted) could be as follows, written and tested in GNU awk. To use this, we have to define values(like OP showed in samples), position numbers where we need to insert commas, awk variable name is colLength give position numbers with space between them.

awk -v colLengh="4 2 5 1 1" '
BEGIN{
  num=split(colLengh,arr,OFS)
}
{
  j=sum=0
  while(++j<=num){
    if(length($0)>sum){
      sub("^.{"arr[j]+sum"}","&,")
    }
    sum+=arr[j]+1
  }
}
1
' Input_file

Explanation: Simple explanation would be, creating awk variable named colLengh where we need to define position numbers wherever we need to insert commas. Then in BEGIN section creating array arr which has value of indexes where we need to insert commas in it.

In main program section first of all nullifying variables j and sum here. Then running a while loop from j=1 to till value of j becomes equal to num. In each run substituting from starting of current line(if length of current line is greater than sum else it doesn't make sense to perform substitution to I have put addiotnal check here) everything with everything + , as per need. Eg: sub function will become .{4} for first time loop runs then it becomes, .{7} because its 7th position we need to insert comma and so on. So sub will substitute those many characters from starting to till generated numbers with matched value + ,. At last in this program mentioning 1 will print edited/non-edited lines.

Swedenborgianism answered 16/11, 2021 at 22:2 Comment(0)
S
0

If any one is still looking for a solution, I have developed a small script in python. its easy to use provided you have python 3.5

https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py

  """
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
    Should have format as
    FieldName,fieldLength
    eg:
    FirstName,10
    SecondName,8
    Address,30
    etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys


def slices(s, args):
    position = 0
    for length in args:
        length = int(length)
        yield s[position:position + length]
        position += length

def extant_file(x):
    """
    'Type' for argparse - checks that file exists but does not open.
    """
    if not os.path.exists(x):
        # Argparse uses the ArgumentTypeError to give a rejection message like:
        # error: argument input: x does not exist
        raise argparse.ArgumentTypeError("{0} does not exist".format(x))
    return x





parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True,    help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False,    help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False,   help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False,   help="Provide the delimiter string you want",metavar="STRING", default="|")

args = parser.parse_args()

#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter

#Output file checks
if args.OutputFile is None:
    OutputFile = str(InputFile) + "Delimited.txt"
    print ("Setting Ouput file as "+ OutputFile)
else:
    OutputFile = args.OutputFile

#Config file check
if args.ConfigFile is None:
    if not os.path.exists("Config.txt"):
        print ("There is no Config File provided exiting the script")
        sys.exit()
    else:
        ConfigFile = "Config.txt"
        print ("Taking Config.txt file on this path as Default Config File")
else:
    ConfigFile = args.ConfigFile

fieldNames = []
fieldLength = []
myvars = OrderedDict()


with open(ConfigFile) as myfile:
    for line in myfile:
        name, var = line.partition(",")[::2]
        myvars[name.strip()] = int(var)
for key,value in myvars.items():
    fieldNames.append(key)
    fieldLength.append(value)

with open(OutputFile, 'w') as f1:
    fieldNames = DELIMITER.join(map(str, fieldNames))
    f1.write(fieldNames + "\n")
    with open(InputFile, 'r') as f:
        for line in f:
            rec = (list(slices(line, fieldLength)))
            myLine = DELIMITER.join(map(str, rec))
            f1.write(myLine + "\n")
Schoenberg answered 9/7, 2016 at 18:6 Comment(0)
C
0

Portable awk

Generate an awk script with the appropriate substr commands

cat cols
4
2
5
1
1
<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1

Output:

substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)

Combine lines and make it a valid awk-script:

<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'

Output:

{ print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1) }

Redirect the above to a file, e.g. /tmp/t.awk and run it on the input-file:

<infile awk -f /tmp/t.awk

Output:

aasd fh 90135 1 2
ajsh dj  2445 d f

Or with comma as the output separator:

<infile awk -f /tmp/t.awk OFS=,

Output:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Cofferdam answered 15/11, 2021 at 15:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.