[tclug-list] line cut: another missing coreutil

Sat Jun 1 12:05:22 CDT 2013

We have a nice command for cutting columns from data files where we can 
give it a list of columns to retain by character or field (defined by some 
delimiter).  For example, if I have a tab-delimited file with 164 columns 
and I want to retain columns 1-6, 15, 19-25, 74-80 and 97-164, I can do it 
like this:

cut -f -6,15,19-25,74-80,97-

So GNU cut is good at dealing with fields (columns), but what about 
records (lines/rows)?  What can do this with line numbers?  I don't think 
we have anything.  Suppose I have a file with 164 lines and I want to 
retain lines 1-6, 15, 19-25, 74-80 and 97-164.  Now what?

It is possible to write something that does this, but isn't there a nice, 
fast utility written in C, say, that will do this very quickly?

One interesting thing I have figured out is that if we use awk or sed for 
grabbing certain lines by number, it must read the entire file.  That can 
be extremely slow if you only need a few lines from the beginnning of a 
huge file.  It is much faster to use head and tail.  Suppose you just want 
lines 27 to 35, then do this...

head -35 file | tail -n+27

...or this:

tail -n+27 file | head -9

The latter method will be a little faster, and it might make a big 
difference when the first line number is huge, but you have to do some 
arithmetic to get the number for head.

Of course, that head/tail approach would have to be used with each 
comma-delimited member of the list which is a huge waste of effort 
compared to what would be done by a program written for this purpose.

By the way, if the job is to grab every Nth line, I use awk for that. 
Here are two examples were I grab every fiftieth line either starting from 
the 50th line or starting from the first line:

$ seq 300 | awk 'NR%50==0'
50
100
150
200
250
300

$ seq 300 | awk 'NR%50==1'
1
51
101
151
201
251

In 1999 I asked some friends on a LUG (MLUG, at U Missouri) about this 
problem and one of them wrote a perl script.  Another suggested some 
revisions.  It seemed like they were going to improve it further but I 
guess that never happened.  It probably works, though, and maybe I should 
be using it.  We were calling it "rowcut".  The old MLUG server is down, 
but here is that old thread:

http://genetsim.org/rowcut/

I haven't done any testing.  I'm not sure how fast it is or if it needs to 
read every line of the input (like sed or awk) even if the last line 
number is small and the file is long.  In such cases, use of head to drop 
the unneeded lines of the file and pipe the useable part to perl would 
probably be a lot faster.

Mike