I need to repeatedly remove the first line from a huge text file using a bash script.
Right now I am using sed -i -e "1d" $FILE
- but it takes around a minute to do the deletion.
Is there a more efficient way to accomplish this?
I need to repeatedly remove the first line from a huge text file using a bash script.
Right now I am using sed -i -e "1d" $FILE
- but it takes around a minute to do the deletion.
Is there a more efficient way to accomplish this?
Try tail:
tail -n +2 "$FILE"
-n x
: Just print the last x
lines. tail -n 5
would give you the last 5 lines of the input. The +
sign kind of inverts the argument and make tail
print anything but the first x-1
lines. tail -n +1
would print the whole file, tail -n +2
everything but the first line, etc.
GNU tail
is much faster than sed
. tail
is also available on BSD and the -n +2
flag is consistent across both tools. Check the FreeBSD or OS X man pages for more.
The BSD version can be much slower than sed
, though. I wonder how they managed that; tail
should just read a file line by line while sed
does pretty complex operations involving interpreting a script, applying regular expressions and the like.
Note: You may be tempted to use
# THIS WILL GIVE YOU AN EMPTY FILE!
tail -n +2 "$FILE" > "$FILE"
but this will give you an empty file. The reason is that the redirection (>
) happens before tail
is invoked by the shell:
$FILE
tail
tail
process to $FILE
tail
reads from the now empty $FILE
If you want to remove the first line inside the file, you should use:
tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"
The &&
will make sure that the file doesn't get overwritten when there is a problem.
-r
option. Maybe there's a buffer setting somewhere in the system? Or -n
is a 32-bit signed number? –
Sluggish tail
will work for any file size. –
Navar sed
has an internal buffer for the current line while tail
can get away by just remembering the offset of the N last newline characters (note that I didn't actually look at the sources). –
Navar tail -n +2 "$FILE" > newfile
–
Electrosurgery 1d
just matches the first line but I'm not sure that sed
optimizes this case, for example). –
Navar echo | lpr
. I don't have the time to debug tail
, so I don't know why it's slower in your case. My gut feeling is that it's the long lines but I don't know. –
Navar time cat sample.txt > /dev/null
takes 0.06s (just IO from the cache). time sed -e "1d" sample.txt > /dev/null
takes 1.12s, time tail -n +2 sample.txt > /dev/null
takes 0.22s. sed
is roughly 6 times slower than tail
. –
Navar sed
is that you can use it to edit files in place, which you can not do with tail
( As far as I am aware of. Please correct me if I am wrong). If you would like to delete the first line in all files in a directory, you could do something like this sed -i "1d" *
. I guess you could also automate tail
by using it in combination with find
or by making a script, but I am not sure which one performs better. I know the OP mentioned they were using -i
, but I thought this might help clarify its use. –
Sensationalism sponge
instead of a temporary file, as mentioned in this answer. –
Eyelet With sed
, the pattern '1d'
will delete the first line. Additionally, the -i
flag can be used to update the file "in place". 1
sed -i '1d' filename
1 sed -i
automatically creates a temporary file with the desired changes, and then replaces the original file.
unterminated transform source string
–
Nygaard sed -i '1,2d' filename
–
Waverly tail -n +2
. Not sure why it isn't the top answer. –
Peepul tail
compared to sed
, it should be noted that despite the -i
option, sed
needs to create a copy of the file anyway, so this solution won't be more helpful than tail
when facing limited disk space issues. –
Wrath sed -i '' '1d' filename
. Per #16746488 –
Hie cat filename | sed '1,3d' > filename
, this will empty your file before the pipeline even starts :-) Use a different filename for the output, then move it: cat filename | sed '1,3d' > file.tmp && mv -i file.tmp filename
–
Diann For those who are on SunOS which is non-GNU, the following code will help:
sed '1d' test.dat > tmp.dat
You can easily do this with:
cat filename | sed 1d > filename_without_first_line
on the command line; or to remove the first line of a file permanently, use the in-place mode of sed with the -i
flag:
sed -i 1d <filename>
-i
option technically takes an argument specifying the file suffix to use when making a backup of the file (e.g. sed -I .bak 1d filename
creates a copy called filename.bak
of the original file with the first line intact). While GNU sed lets you specify -i
without an argument to skip the backup, BSD sed, as found on macOS, requires an empty string argument as a separate shell word (e.g. sed -i '' ...
). –
Terms The sponge
util avoids the need for juggling a temp file:
tail -n +2 "$FILE" | sponge "$FILE"
sponge
is indeed much cleaner and more robust than the accepted solution (tail -n +2 "$FILE" > "$FILE.tmp" && mv "$FILE.tmp" "$FILE"
) –
Smriti sponge
buffer the whole file in memory? That won't work if it's hundreds of GB. –
Amathist sponge
will soak it up, since it uses a /tmp file as an intermediate step, which is then used to replace the original afterward. –
Dupleix No, that's about as efficient as you're going to get. You could write a C program which could do the job a little faster (less startup time and processing arguments) but it will probably tend towards the same speed as sed as files get large (and I assume they're large if it's taking a minute).
But your question suffers from the same problem as so many others in that it pre-supposes the solution. If you were to tell us in detail what you're trying to do rather then how, we may be able to suggest a better option.
For example, if this is a file A that some other program B processes, one solution would be to not strip off the first line, but modify program B to process it differently.
Let's say all your programs append to this file A and program B currently reads and processes the first line before deleting it.
You could re-engineer program B so that it didn't try to delete the first line but maintains a persistent (probably file-based) offset into the file A so that, next time it runs, it could seek to that offset, process the line there, and update the offset.
Then, at a quiet time (midnight?), it could do special processing of file A to delete all lines currently processed and set the offset back to 0.
It will certainly be faster for a program to open and seek a file rather than open and rewrite. This discussion assumes you have control over program B, of course. I don't know if that's the case but there may be other possible solutions if you provide further information.
awk FNR-1 *.csv
is probably faster. –
Ronen If you want to modify the file in place, you could always use the original ed
instead of its streaming successor sed
:
ed "$FILE" <<<$'1d\nwq\n'
The ed
command was the original UNIX text editor, before there were even full-screen terminals, much less graphical workstations. The ex
editor, best known as what you're using when typing at the colon prompt in vi
, is an extended version of ed
, so many of the same commands work. While ed
is meant to be used interactively, it can also be used in batch mode by sending a string of commands to it, which is what this solution does.
The sequence <<<$'1d\nwq\n'
takes advantage of modern shells' support for here-strings (<<<
) and ANSI quotes ($'
...'
) to feed input to the ed
command consisting of two lines: 1d
, which deletes line 1, and then wq
, which writes the file back out to disk and then quits the editing session.
As Pax said, you probably aren't going to get any faster than this. The reason is that there are almost no filesystems that support truncating from the beginning of the file so this is going to be an O(n
) operation where n
is the size of the file. What you can do much faster though is overwrite the first line with the same number of bytes (maybe with spaces or a comment) which might work for you depending on exactly what you are trying to do (what is that by the way?).
You can edit the files in place: Just use perl's -i
flag, like this:
perl -ni -e 'print unless $. == 1' filename.txt
This makes the first line disappear, as you ask. Perl will need to read and copy the entire file, but it arranges for the output to be saved under the name of the original file.
should show the lines except the first line :
cat textfile.txt | tail -n +2
Could use vim to do this:
vim -u NONE +'1d' +'wq!' /tmp/test.txt
This should be faster, since vim won't read whole file when process.
+wq!
if your shell is bash. Probably not since the !
is not at the beginning of a word, but getting in the habit of quoting things is probably good all around. (And if you're going for super-efficiency by not quoting unnecessarily, you don't need the quotes around the 1d
either.) –
Terms How about using csplit?
man csplit
csplit -k file 1 '{1}'
csplit file /^.*$/1
. Or more simply: csplit file //1
. Or even more simply: csplit file 2
. –
Blackwell This one liner will do:
echo "$(tail -n +2 "$FILE")" > "$FILE"
It works, since tail
is executed prior to echo
and then the file is unlocked, hence no need for a temp file.
Since it sounds like I can't speed up the deletion, I think a good approach might be to process the file in batches like this:
While file1 not empty
file2 = head -n1000 file1
process file2
sed -i -e "1000d" file1
end
The drawback of this is that if the program gets killed in the middle (or if there's some bad sql in there - causing the "process" part to die or lock-up), there will be lines that are either skipped, or processed twice.
(file1 contains lines of sql code)
tail +2 path/to/your/file
works for me, no need to specify the -n
flag. For reasons, see Aaron's answer.
You can use the sed
command to delete arbitrary lines by line number
# create multi line txt file
echo """1. first
2. second
3. third""" > file.txt
deleting lines and printing to stdout
$ sed '1d' file.txt
2. second
3. third
$ sed '2d' file.txt
1. first
3. third
$ sed '3d' file.txt
1. first
2. second
# delete multi lines
$ sed '1,2d' file.txt
3. third
# delete the last line
sed '$d' file.txt
1. first
2. second
use the -i
option to edit the file in-place
$ cat file.txt
1. first
2. second
3. third
$ sed -i '1d' file.txt
$cat file.txt
2. second
3. third
If what you are looking to do is recover after failure, you could just build up a file that has what you've done so far.
if [[ -f $tmpf ]] ; then
rm -f $tmpf
fi
cat $srcf |
while read line ; do
# process line
echo "$line" >> $tmpf
done
Based on 3 other answers, I came up with this syntax that works perfectly in my Mac OSx bash shell:
line=$(head -n1 list.txt && echo "$(tail -n +2 list.txt)" > list.txt)
Test case:
~> printf "Line #%2d\n" {1..3} > list.txt
~> cat list.txt
Line # 1
Line # 2
Line # 3
~> line=$(head -n1 list.txt && echo "$(tail -n +2 list.txt)" > list.txt)
~> echo $line
Line # 1
~> cat list.txt
Line # 2
Line # 3
Also check these ways :
mapfile -t lines < 1.txt && printf "%s\n" "${lines[@]:1}" > new.txt
#OR
awk 'NR>1' old.txt > new.txt
#OR
cut -d $'\n' -f 2- old.txt > new.txt
For truly in-place deletion of lines at the head of a file:
$ cat file
1
2
3
4
5
$ bytes=$(head -1 file |wc -c)
$ dd if=file bs="$bytes" skip=1 conv=notrunc of=file
4+0 records in
4+0 records out
8 bytes copied, 0.0002447 s, 32.7 kB/s
$ truncate -s "-$bytes" file
$ cat file
2
3
4
5
It will be orders of magnitude slower than using sed -i '1d'
or similar that use a temp file though so only use it if you don't have enough disk space to make a copy of the input file.
Would using tail on N-1 lines and directing that into a file, followed by removing the old file, and renaming the new file to the old name do the job?
If i were doing this programatically, i would read through the file, and remember the file offset, after reading each line, so i could seek back to that position to read the file with one less line in it.
© 2022 - 2024 — McMap. All rights reserved.