[tclug-list] line cut: another missing coreutil

Sat Jun 1 20:15:39 CDT 2013

Here's my script to try:

#!/usr/bin/gawk -f
# print_ranges.awk
# usage: 2nd arg is string enclosed by quotes, like "to-from1 to-from2 ... "
# ex: print_ranges file "92-97 5-8 23-42 55-71"

BEGIN {
 range_cnt = split(ARGV[2], ranges, " ");

 while(getline < ARGV[1]) {
 line_arr[++n] = $0;
 } close(ARGV[1])

 for(i = 1; i <= range_cnt; i++) { 
 split(ranges[i], start_stop, "-");

 start = start_stop[1]
 stop = start_stop[2]

 for(j = start; j <= stop; j++) { 
 print line_arr[j];
 }
 print "\n"
 }
}

-jack

-----Original Message-----
From: Mike Miller [mailto:mbmiller+l at gmail.com]
Sent: Saturday, June 1, 2013 01:05 PM
To: 'TCLUG List'
Subject: [tclug-list] line cut: another missing coreutil

We have a nice command for cutting columns from data files where we can give it a list of columns to retain by character or field (defined by some delimiter). For example, if I have a tab-delimited file with 164 columns and I want to retain columns 1-6, 15, 19-25, 74-80 and 97-164, I can do it like this:cut -f -6,15,19-25,74-80,97-So GNU cut is good at dealing with fields (columns), but what about records (lines/rows)? What can do this with line numbers? I don't think we have anything. Suppose I have a file with 164 lines and I want to retain lines 1-6, 15, 19-25, 74-80 and 97-164. Now what?It is possible to write something that does this, but isn't there a nice, fast utility written in C, say, that will do this very quickly?One interesting thing I have figured out is that if we use awk or sed for grabbing certain lines by number, it must read the entire file. That can be extremely slow if you only need a few lines from the beginnning of a huge file. It is much faster to use head and tail. Suppose you just want lines 27 to 35, then do this...head -35 file | tail -n+27...or this:tail -n+27 file | head -9The latter method will be a little faster, and it might make a big difference when the first line number is huge, but you have to do some arithmetic to get the number for head.Of course, that head/tail approach would have to be used with each comma-delimited member of the list which is a huge waste of effort compared to what would be done by a program written for this purpose.By the way, if the job is to grab every Nth line, I use awk for that. Here are two examples were I grab every fiftieth line either starting from the 50th line or starting from the first line:$ seq 300 | awk 'NR%50==0'50100150200250300$ seq 300 | awk 'NR%50==1'151101151201251In 1999 I asked some friends on a LUG (MLUG, at U Missouri) about this problem and one of them wrote a perl script. Another suggested some revisions. It seemed like they were going to improve it further but I guess that never happened. It probably works, though, and maybe I should be using it. We were calling it "rowcut". The old MLUG server is down, but here is that old thread:http://genetsim.org/rowcut/I haven't done any testing. I'm not sure how fast it is or if it needs to read every line of the input (like sed or awk) even if the last line number is small and the file is long. In such cases, use of head to drop the unneeded lines of the file and pipe the useable part to perl would probably be a lot faster.Mike_______________________________________________TCLUG Mailing List - Minneapolis/St. Paul, Minnesotatclug-list at mn-linux.orghttp://mailman.mn-linux.org/mailman/listinfo/tclug-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mn-linux.org/pipermail/tclug-list/attachments/20130602/3ad02643/attachment.html>