Tag Archives: perl

newbie picking away at awk

I’ve been looking into awk off and on.
This is kind of a weird thing to do, coming from perl.
I’m not a perl genius, but, I’m intermediate. Which is saying a lot.
Perl is a beast. It’s a madman in the forest tripping on lsd- who commands power over small countries and speaks to aliens. If you can talk to perl, it can do everything from giving your wife an orgasm to putting the baby to sleep.

So, awk- it would seem silly to be interested in it. Since perl could do all of this already. Why learn it?

Well, I’ve been learning it. Because if there’s anything cooler than perl, it’s unix.
And awk is .. well.. unixy.

So.. awk. I’m a total awk noob, please.. keep in mind.

Awk seems to be cool for parsing line output, for one.
I often do ls -lha for listing sizes of things. And I may not ne interested in, you know.. permissions. Because maybe I just want to know the size of things..

Playing with ls output

Regular ls

$ ls -lh
total 76K
-rw-rw-r-- 1 leo leo 3.9K 2010-07-13 08:17 build-meta-refresh-substitutes.pl
-rw-rw-r-- 1 leo leo 138 2010-07-02 03:27 convert-old-new.urls-to-htaccess-entries.sh
-rw-r--r-- 1 leo leo 4.1K 2010-07-08 12:31 htaccess
-rw-r--r-- 1 leo leo 3.7K 2010-07-08 12:31 htaccess2
-rw-r--r-- 1 leo leo 3.5K 2010-07-08 12:30 htaccess3
-rw-r--r-- 1 leo leo 4.0K 2010-07-08 12:35 htaccess4
drwxrwxr-x 2 leo leo 4.0K 2010-07-13 08:18 meta-refresh-versions
-rw-rw-r-- 1 leo leo 5.0K 2010-07-02 02:44 old-new.urls
-rw-r--r-- 1 leo leo 20K 2010-07-13 08:18 refreshpages.zip
-rw-rw-r-- 1 leo leo 6.0K 2010-07-02 02:27 sitemap.new
-rw-rw-r-- 1 leo leo 5.0K 2010-07-02 02:43 sitemap.old

Yes, but what would jesus awk say?

$ ls -lh | awk '{ printf "s %s\n", $5, $8 }'

 3.9K build-meta-refresh-substitutes.pl
 138 convert-old-new.urls-to-htaccess-entries.sh
 4.1K htaccess
 3.7K htaccess2
 3.5K htaccess3
 4.0K htaccess4
 4.0K meta-refresh-versions
 5.0K old-new.urls
 20K refreshpages.zip
 6.0K sitemap.new
 5.0K sitemap.old

Get it? No?
The lines are treated one by one. Each argument is $1, $2, $3, etc.
The delimiter is by default, the shell delimiter. That’s capricorn weekly horoscope sign should climb upwards because financial security is necessary for this sign. whitespace (tab space).

looking for text in files and editing in vim

I often need to find something in code or text. Maybe I’m messing with wordpress stuff, and need to find a php function.

For example, finding a php function..

I want to look for a function called get_author* in the php files around here..

html $ find ~/public_html/ -name "*php" | xargs grep 'function get_author'
/home/leocharre/public_html/wp-includes/link-template.php:function get_author_feed_link( $author_id, $feed = '' ) {
/home/leocharre/public_html/wp-includes/rewrite.php: function get_author_permastruct() {
/home/leocharre/public_html/wp-includes/author-template.php:function get_author_posts_url($author_id, $author_nicename = '') {
/home/leocharre/public_html/wp-includes/theme.php:function get_author_template() {

Yes, but what would awk say?

Automating this somewhat..
The cool thing would be to automatically go there, or at least prit the commands so I can call up vim by cut and paste.

Ok.. not the easiest thing as it turns out… making use of this..

html $ find ~/public_html/ -name "*php" | xargs grep -s 'function get_author' | sed 's/:\s\ /:/' | sed "s/'.\ //" | grep2vim
vim '/home/leocharre/public_html/wp-includes/link-template.php' /'function get_author_feed_link( $author_id, $feed = '
vim '/home/leocharre/public_html/wp-includes/rewrite.php' /'function get_author_permastruct() {'
vim '/home/leocharre/public_html/wp-includes/author-template.php' /'function get_author_posts_url($author_id, $author_nicename = '
vim '/home/leocharre/public_html/wp-includes/theme.php' /'function get_author_template() {'

Where grep2vim is an awk script inside my bin dir..

html $ cat ~/bin/grep2vim
#!/bin/awk -f
BEGIN { FS=":" }
{ printf "vim '%s' /'%s'\n", $1, $2 }


The output is pretty cool, it’s cut and paste, for example.. and then vim gets the commnand to search for that string, that’s what the / fuss is all about.

Okkkaaaaay…. Putting it all together..

html $ cat ~/bin/findphpfunction2vim

if [ -z "$BASEDIR" ]; then
 echo "$0 missing DIR path"
 exit 1

if [ -z "$FUNCTIONNAME" ]; then
 echo "$0 missing function name"
 exit 1

find $BASEDIR -name "*.php" | xargs grep -s "function $FUNCTIONNAME" | sed 's/:\s\ /:/' | sed "s/'.\ //" | grep2vim

Example usage:

html $ findphpfunction2vim ./ is_user
vim './wp-includes/ms-functions.php' /'function is_user_member_of_blog( $user_id, $blog_id = 0 ) {'
vim './wp-includes/ms-functions.php' /'function is_user_spammy( $username = 0 ) {'
vim './wp-includes/ms-functions.php' /'function is_user_option_local( $key, $user_id = 0, $blog_id = 0 ) {'
vim './wp-includes/pluggable.php' /'function is_user_logged_in() {'
vim './wp-admin/includes/class-wp-importer.php' /'function is_user_over_quota() {'

Great, using my terminal emulator, I can just double click and middle click to cut and paste, automatically executed since select works including the carriage return.

generating bugzilla simple bug summary text report

I installed bugzilla on our nework to track development in my office. It’s wonderful.
I figured it was pretty easy for no techs to use, if they want to make notes, report, etc.

But the reality is most people in the office will not be handling it, it’s overkill.

I needed to be able to summarize the bugs, the bug history.
I needed to generate a simple outline/ text.. that says something like..

bug 1: title

bug 2: title

That kind of thing.
Because long text format (where you show whole bug detail history) is overkill for these people.
I needed a text summary of bugzilla, that I could print out and put on my boss’ desk every now and then.

I thought, there must be some script already out there to be able to get an array of structs/hashes for each bug.. so you can iterate and print … or whatever.
After discussion on perlmonks.. I realized .. if so.. then it’s not distributed. Probably some people must have scripts doing this.. But .. Seriously.. I don’t ever want someobody else to solve this freaking problem if I already have. And I don’t ever want to solve it again!!! THAT’S WHY WE DISTRIBUTE.

So.. I made a proper distro and uploaded to cpan.

It’s called bugzillareport.

Some example output:

   AP links show wrong file counts (DMS WUI)
   Status: RESOLVED
   Bug ID: 11
   "This past week there were 2 cases where the information in the
   hearders did not match what was in the folders."...
   WJA, shows 11 files listed as Invoices Pending Payment, but going to
   place shows 11 and 2 APIE files.
   The  various screens in the dms ap process were designed to be for use
   inhouse. The users were supposed to be able to approve, and disapprove
   The managers were to merge an approved invoice with a scanned check.

   We did not design view screens specifically for users in these various
   steps of the ap process.
   These various screens have to be designed.

   Need additional ap process view screens for users (DMS WUI)
   Status: RESOLVED
   Bug ID: 16
   The screens needed are

   1) invoices pending approval

   2) invoices pending payment

   3) invoices disapproved

   4) invoices and checks pending filing

   5) vendor history

   New screens have been made, new logic for navigation bar for ap has
   been made and deployed.

Instant mysql connection with perl.

Reading about the exciting concept of code speed vs programmer time. That is, balancing what it costs to make a machine run slow code, vs what it costs to let a coder work easier.


So.. I noticed I was coding stuff that uses a mysql connection, and I store the connection variables, the credentials- in a YAML configuration file. Often. It’s good practice. YAML files are simple, and you can store them anywhere- this helps keep sensittive data, or .. the data that makes the same app act different- appart from the executable application.

I was helping a co-worker learn some perl, and he wanted to connect to a db. He was sort of impatient- because the details of establishing a connection with a new language are a little bit… beneath him? This is someone highly trained and skilled in other languages and systems. And he wanted to just.. *have* a connection handle to a database.

So… YAML.. DBH…. I wrote and released YAML::DBH.

As simple as it gets. You write a config file:

password: super
user: myself
database: superstuff

In your script:

use YAML::DBH ‘yaml_dbh';

my $dbh = yaml_dbh(‘./credentials.yml’);

gnu coreutils md5sum 3 times faster than Digest::MD5::File

I use md5 sums for identifying files under FileArchiveIndexer.
One of the issues that need improvement are updating the files list- the speed of this procedure.

Now, with getting md5 sums of a few files, there is no issue. Even if we have a few files of large size.
When we are dealing with a few gigs, this becomes more important.

In running benchmarks of FileArchiveIndexer::Update , I notice that the cpu and memory consumption hover at about 10% and 12% or so- This is really not making good use of the machine.

I set about making some tests that would benchmark different ways of getting the md5 sum for files.
There are the following ways that I’ve looked at..


  • The most basic that seems to make sense here is Digest::MD5::File. This module provides various means of getting digests with a simple path argument to the file.
  • Another method is read in the file data, all of it at once (watch out.. memory rape)- and use Digest::MD5 to get the md5_hex() sum.
  • I also considered a lazy approach to reading in only the first 25k or so of a file, and getting a digest from that- as a sort of.. lazy and dangerous way of getting sums. There are reasons, or situations in which this can actually be useful.
  • The fourth way- which I thought woulf be slow- but I wanted to test it anyhow- is using gnu coreutils md5sum via the command line. That is- making a system-ish call to md5sum.

    [leo@localhost LEOCHARRE-MD5-Benchmark]$ dprofpp -I ./tmon.out
    Total Elapsed Time = 89.05905 Seconds
      User+System Time = 22.93905 Seconds
    Inclusive Times
    %Time ExclSec CumulS #Calls sec/call Csec/c  Name
     100.   0.020 23.021      5   0.0040 4.6042  main::bench_a_dir
     99.5   0.288 22.826     20   0.0144 1.1413  main::get_file_md5s
     66.2   0.134 15.200   5225   0.0000 0.0029  main::_md5_Digest_MD5_File
     65.6   9.682 15.066   5225   0.0019 0.0029  Digest::MD5::File::file_md5_hex
     20.6   4.737  4.737   5225   0.0009 0.0009  Digest::MD5::add
     19.8   1.634  4.551   5225   0.0003 0.0009  main::_md5_Digest_MD5
     15.7   3.614  3.614  10450   0.0003 0.0003  Digest::MD5::md5_hex
     7.79   1.787  1.787   5225   0.0003 0.0003  main::_md5_cli
     4.36   0.304  1.001   5225   0.0001 0.0002  main::_md5_lazy
     2.25   0.517  0.517   5225   0.0001 0.0001  Digest::MD5::File::__ANON__
     0.78   0.020  0.178      9   0.0022 0.0198  main::BEGIN
     0.65   0.150  0.150      5   0.0300 0.0300  main::ls
     0.44   0.020  0.100      5   0.0040 0.0199  Digest::MD5::File::BEGIN
     0.38   0.087  0.087   5225   0.0000 0.0000  Digest::MD5::hexdigest
     0.35   0.060  0.080     10   0.0060 0.0080  LWP::UserAgent::BEGIN

    Now, what interests us here are the following ..

    [leo@localhost LEOCHARRE-MD5-Benchmark]$ dprofpp -I ./tmon.out  | grep '::_md5'
     66.2   0.134 15.200   5225   0.0000 0.0029  main::_md5_Digest_MD5_File
     19.8   1.634  4.551   5225   0.0003 0.0009  main::_md5_Digest_MD5
     7.79   1.787  1.787   5225   0.0003 0.0003  main::_md5_cli
     4.36   0.304  1.001   5225   0.0001 0.0002  main::_md5_lazy

    The total ammount of data processed is about 250 megs in this test.
    As it turns out, making a call to gnu coreutils md5sum, is about… THREE TIMES FASTER than using Digest::MD5::File- which is what I was using before.
    This means the files/locations update on a few gigs of data that normally take 30 minutes to do- can take 10- if I make calls to cli md5sum.


    FileArchiveIndexer takes care of using OCR on a massive ammount of scanned in hard copy documents. These documents may change location, be renamed, enter the system- leave the system- be copied.. etc. The faster we can register their existence- the better we can track user changes.

    You are welcome to download my test suite: leocharre-md5-benchmark-01tar.gz

problems installing DBD::mysql

So I was doing a fresh install of my customized database api package for perl. LEOCHARRE::Database.
Goodly enough, my perl Makefile.PL let me know that I was missing DBD::mysql. Great.

I fire up cpan install DBD::mysql, and alas.. No go! How come??

Turns out you need to install mysql-client and mysql-devel.
I’m on a fedora core gui, so I use yum..

yum -y install mysql-client mysql-devel

Now let’s try cpan again..
cpan install DBD::mysql

It works better.. but oops.. still ..
2 tests skipped.
Failed 31/34 test scripts, 8.82% okay. 473/478 subtests failed, 1.05% okay.
make: *** [test_dynamic] Error 255
/usr/bin/make test -- NOT OK
Running make install
make test had returned bad status, won't install without force

What’s up?
I think the mysql server’s not running on this machine, thus, we need to install to make a full successful check via cpan.

yum -y install mysql-server

And then..
[root@localhost LEOCHARRE-Database]# /etc/init.d/mysqld status
mysqld is stopped
[root@localhost LEOCHARRE-Database]# /etc/init.d/mysqld start
Initializing MySQL database: Installing MySQL system tables...
Filling help tables...

Great. Let’s try that cpan again..

cpan install DBD::mysql

Haha! It works! :-)