Manipulate csv file using bash script or python, converting certain columns into rows

I have a csv file with 6 columns. I want to write a script using Python or Bash so that I can generate a new file, maybe a txt file, where for each row the first three columns remain the same (that is column 1, 2 and 3 remain in same rows), but the next three columns (i.e. columns 4, 5 ,6) get inserted into a new row, just below their previous row.

My sample file looks like this:

Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386

My final file should look like this:

Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386

2 Answers

Here's an awk approach:

$ awk 'NR>1{for(i=1;i<=NF;i++){ if((i%4)==0) printf "\n"; printf "%s%s",$i,FS;}printf "\n"} END{printf "\n"}' input.csv
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386

Assuming the "Columns" is present in the file, it'd be easier to process the file without them, hence NR>1 part. After that is a simple for-loop, which inserts a newline whenever there's no remainder in integer division by 4 in the column number ( the (i%4) == 0 part ). This is a very common trick to deal with short counters or data that needs to occur at specific intervals. Newline is necessary because we're printing data via printf directly as plain string, not a full line.

Since you've requested python, here's a python script as well:

#!/usr/bin/env python3
import sys
with open(sys.argv[1]) as fd: for index,line in enumerate(fd): items=line.strip().split() if index == 0: print(" ".join(items[:len(items)//2])) continue left = items[:len(items)//2] right = items[len(items)//2:] print( " ".join(left) ) print( " ".join(right) )

Works as follows:

$ ./break_columns.py ./input.csv
Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386

You can use Miller () and a bash script. Starting from input

Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386

and running

<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 1,2,3 >out_01
<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 4,5,6 >out_02
mlr --nidx --ifs ' ' --repifs cat out_01 out_02 >out
mlr -I --n2c --fs ' ' cat then label Column1,Column2,Column3 out

you will have

Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386

If you want a pretty printed output, change the final command in mlr -I --n2p --fs ' ' cat then label Column1,Column2,Column3 out

Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386

Pop Feed Daily

Manipulate csv file using bash script or python, converting certain columns into rows

2 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

Downloading an Xbox 360 update fails every time

Is there a way to recover the undertale genocide route after fighting mettaton NEO

My PS2 can't play music CD's

Why does the server say "connection lost, timed out" after joining a minigame? [duplicate]