KEMBAR78
Joy of Unix | PDF
Joy of Getting things done with Unix
MoT EDS
NYU/Poly
Problem Statement
I have a really really large file and I want to
import all of that into a database
What are the issues:
I want to isolate errors so I dont have to rerun the
entire batch
I want to minimize manual intervention
You can use python or php but take care of batching within the script by commiting
every 1000 records, provided the project is not at risk.
While numpy is present, scipy is not there, in this environment which I believe ought to be
there, I am working with IBM to correct this situation.
The gift of Unix
Splitting a file
The split command does the magic.
Like every other Unix command split also allows you to customize
with several different options.
In this case I want to split the file 201407.csv into 1000 line files
And because I will have more files I am customizing split to use
5 characters to create temporary file names.
In Unix man <any command> will always tell you what to do ..
The original file 201407.csv
Was split into 972 files each
with 1000 records.
Cat x* > orig will recreate the
Original file. With “ls -l” and diff
I can verify that both files recreated
and the original are same.
Access to file
File_priv in mysql requires lot of setup.
Mysql should be configured to allow file I/O.
The file should be specified in absolute path.
Since the mysql daemon is running as a
different user than the mysql client (shell) it
requires RW privilege for all users
“chmod 666 filename” does that.
Repeating – loops
In every form of shell (sh,ksh,csh,tcsh,bash and
so on) there are primitives for loops with minor
difference in the syntax.
For I in * is sh,ksh,bash
The general pattern is for I in *; do ; …; done
Here I am repeating a block of actions on each
file.
i.e.. I am loading each file into mysql
Mysql options
Mysql takes many parameters
-p password – I dont want my password on the
command line
− So I have stored my password in a file
− The subcommand $(cat dbpwd) reads that file
− And supplies that to mysql
-D database – I want all my import to go to this
tables in this database (akin to use db in the
shell)
-e to execute a valid sql statement
The file I want to import changes so
I have created a template for the load infile command
The unix cmd “sed” – stream editor substitutes
For loop iterates over all files matching x*fileX with ${i} the loop
variable
${i} is set to that file
sed returns the modified load data with correct file
Output of that $(sub-command) is passwd to -e / mysql
Each iteration one file is loaded.
Running subcommands
Verifying your work
OOPS! We are loading only half the records!
But now we can check the smaller batch files
Using Select with between
clause, we see that only
half the records are
loaded. Note that
from to direction matters
Tail/head and wc unix
Commands give me count
I should get and criteria for
selecting
Troubleshooting with Unix
Why are half the records being dropped?
Examine the file with “od -c”
Thanks to the hacks who have written incompatible systems in dos rn
Whereas in Unix it is n..
And the standard technique to convert from one env to the other, is to run
dos2unix on the file
Loading all the records
Fix data
Rerun the load data infile
Success!
Many thanks to Bill Sweeney, Mitch Haviv, Gary
Maier, Byron Nicas, Wes Iwanski, Brian Lara,
Brian Driscoll, Tilak Singh, David Montgomery,
Vadim Dosyche, Peter DeLucia,John Sabini, Ed
Jerlin, Mike Savettiere,Luis Soto, Dan Marcellus
and countless students who have either taught
me or encouraged me, over the years!
And to Unix, an environment with a full complement of tools for every scenario to
process text.
IBM Academic Initiatives for their support.

Joy of Unix

  • 1.
    Joy of Gettingthings done with Unix MoT EDS NYU/Poly
  • 2.
    Problem Statement I havea really really large file and I want to import all of that into a database What are the issues: I want to isolate errors so I dont have to rerun the entire batch I want to minimize manual intervention You can use python or php but take care of batching within the script by commiting every 1000 records, provided the project is not at risk. While numpy is present, scipy is not there, in this environment which I believe ought to be there, I am working with IBM to correct this situation.
  • 3.
  • 4.
    Splitting a file Thesplit command does the magic. Like every other Unix command split also allows you to customize with several different options. In this case I want to split the file 201407.csv into 1000 line files And because I will have more files I am customizing split to use 5 characters to create temporary file names. In Unix man <any command> will always tell you what to do .. The original file 201407.csv Was split into 972 files each with 1000 records. Cat x* > orig will recreate the Original file. With “ls -l” and diff I can verify that both files recreated and the original are same.
  • 5.
    Access to file File_privin mysql requires lot of setup. Mysql should be configured to allow file I/O. The file should be specified in absolute path. Since the mysql daemon is running as a different user than the mysql client (shell) it requires RW privilege for all users “chmod 666 filename” does that.
  • 6.
    Repeating – loops Inevery form of shell (sh,ksh,csh,tcsh,bash and so on) there are primitives for loops with minor difference in the syntax. For I in * is sh,ksh,bash The general pattern is for I in *; do ; …; done Here I am repeating a block of actions on each file. i.e.. I am loading each file into mysql
  • 7.
    Mysql options Mysql takesmany parameters -p password – I dont want my password on the command line − So I have stored my password in a file − The subcommand $(cat dbpwd) reads that file − And supplies that to mysql -D database – I want all my import to go to this tables in this database (akin to use db in the shell) -e to execute a valid sql statement
  • 8.
    The file Iwant to import changes so I have created a template for the load infile command The unix cmd “sed” – stream editor substitutes For loop iterates over all files matching x*fileX with ${i} the loop variable ${i} is set to that file sed returns the modified load data with correct file Output of that $(sub-command) is passwd to -e / mysql Each iteration one file is loaded. Running subcommands
  • 9.
    Verifying your work OOPS!We are loading only half the records! But now we can check the smaller batch files Using Select with between clause, we see that only half the records are loaded. Note that from to direction matters Tail/head and wc unix Commands give me count I should get and criteria for selecting
  • 10.
    Troubleshooting with Unix Whyare half the records being dropped? Examine the file with “od -c” Thanks to the hacks who have written incompatible systems in dos rn Whereas in Unix it is n.. And the standard technique to convert from one env to the other, is to run dos2unix on the file
  • 11.
    Loading all therecords Fix data Rerun the load data infile Success!
  • 12.
    Many thanks toBill Sweeney, Mitch Haviv, Gary Maier, Byron Nicas, Wes Iwanski, Brian Lara, Brian Driscoll, Tilak Singh, David Montgomery, Vadim Dosyche, Peter DeLucia,John Sabini, Ed Jerlin, Mike Savettiere,Luis Soto, Dan Marcellus and countless students who have either taught me or encouraged me, over the years! And to Unix, an environment with a full complement of tools for every scenario to process text. IBM Academic Initiatives for their support.