Efficient way to get n middle lines from a very big file
Asked Answered
P

6

7

I have a big file around 60GB.

I need to get n middle lines of the file. I am using a command with head and tail like

tail -m file |head -n >output.txt

where m,n are numbers

The general structure of the file is like below with set of records (comma separated columns.) Each line can be of different length(say max 5000 chars).

col1,col2,col3,col4...col10

Is there any other way that I can take n middle lines with less time, because the current command is taking lot of time to execute?

Phonologist answered 9/12, 2013 at 7:7 Comment(2)
Can you tell us more about the data in your file like the general structure of the file. How are the lines separated? Max size of each line? so that we can try and traverse the memory to the required line directly? If your lines are not equal in length, we'll have to parse it character by character. In that case, you are already using the best possble way.Echo
Added the general structure of the record to the question.Phonologist
S
16

With sed you can at least remove the pipeline:

sed -n '600000,700000p' file > output.txt

will print lines 600000 through 700000.

Sophister answered 9/12, 2013 at 9:16 Comment(1)
If there are a lot of lines after the last requested line, it might help to also add a 'q' command: sed -n '600000,700000p;700000q' file. Otherwise, sed will keep running until the last line of the file is read (even if nothing is printed).Weave
P
8

awk 'FNR>=n && FNR<=m'

followed by name of the file.

Pauli answered 9/12, 2013 at 10:28 Comment(0)
D
4

It might be more efficient to use the split utility, because with tail and head in pipe you scan some parts of the file twice.

Example

split -l <k> <file> <prefix>

Where k is the number of lines you want to have in each file, and the (optional) prefix is added to each output file name.

Decibel answered 9/12, 2013 at 9:20 Comment(1)
yes, I thought of using this command but my machine doesn't have much space to store the splitted files :(Phonologist
E
0

The only possible solution I can think of to speed up the search is to build and index of your lines, something like:

 0 00000000
 1 00000013
 2 00000045
   ...
 N 48579344

And then, knowing the index length, you could jump quickly in the middle of your data file (or wherever you like...). Of course you should keep the index updated when the file changes...

Obviously the canonical solution for such a problem would be to keep the data in a DB (see for example SQLite), an not in a plain file... :-)

Epiphany answered 9/12, 2013 at 9:0 Comment(1)
My intention is to move this data to DB. Because few of the records are not in proper structure and due to some other issues, I am moving them to DB in chunks.Phonologist
C
0

Having the same problem (mine is an Asterisk Master.csv file), I am affraid there is no trivial solution: when wanting to access the 10,000,000-th line of a file (file, not database record nor in memory representation of the file), whatever have to count from 0 to 10,000,000... :-(

Convery answered 24/4, 2023 at 6:1 Comment(0)
J
-2

Open the file in the binary random access mode, seek to the middle, move forward sequentially till you reach \n or \n\r ascii, starting from the following character dump N lines to your rest file (one \n - one line). Job done.

If your file is sorted and you need data between two keys you use the above described method + bisection.

Janeejaneen answered 9/12, 2013 at 9:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.