Skip navigation.

The Anti-Kyte

Syndicate content The Anti-Kyte
Oracle - for when it was like that when you got there
Updated: 14 hours 37 min ago

Null is Odd…or Things I Used to Know about SQL Aggregate Functions

Mon, 2016-05-16 14:03

Brendan McCullum recently played his final Test for New Zealand.
That’s something of an understatement. In his last game he made a century in a mere 54 balls, a feat unmatched in 139 years of test cricket.
From the outside looking in, it seemed that McCullum had come to realise something he’d always known. Playing cricket is supposed to be fun.
What’s more, you can consider yourself quite fortunate if you get paid for doing something you enjoy, especially when that something is hitting a ball with a stick.

With the help of Mr McCullum, what follows will serve to remind me of something I’ve always known but may forget from time to time.
In my case, it’s the fact that NULL is odd. This is especially true when it comes to basic SQL aggregation functions.

Some test data

We’ve got a simple table that holds the number of runs scored by McCullum in each of his Test innings together with a nullable value to indicate whether or not he was dismissed in that innings.

This is relevant because one of the things we’re going to do is calculate his batting average.

In Cricket, the formula for this is :

Runs Scored / (Innings Batted – Times Not Out)

Anyway, here’s the table :

create table mccullum_inns
(
    score number not null,
    not_out number
)
/

…and the data…

insert into mccullum_inns( score,not_out) values (57, null);
insert into mccullum_inns( score,not_out) values (19, 1);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (55, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (96, null);
insert into mccullum_inns( score,not_out) values (54, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (143, null);
insert into mccullum_inns( score,not_out) values (17, 1);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (36, null);
insert into mccullum_inns( score,not_out) values (29, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (99, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (111, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (74, null);
insert into mccullum_inns( score,not_out) values (23, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (33, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (14, 1);
insert into mccullum_inns( score,not_out) values (43, null);
insert into mccullum_inns( score,not_out) values (17, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (26, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (40, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (85, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (97, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (71, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (66, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (30, null);
insert into mccullum_inns( score,not_out) values (84, 1);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (84, null);
insert into mccullum_inns( score,not_out) values (115, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (6, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (29, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (78, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (89, null);
insert into mccullum_inns( score,not_out) values (185, null);
insert into mccullum_inns( score,not_out) values (19, 1);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (104, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (65, null);
insert into mccullum_inns( score,not_out) values (11, 1);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (225, null);
insert into mccullum_inns( score,not_out) values (40, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (56, null);
insert into mccullum_inns( score,not_out) values (35, null);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (64, null);
insert into mccullum_inns( score,not_out) values (14, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (34, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (16, null);
insert into mccullum_inns( score,not_out) values (12, null);
insert into mccullum_inns( score,not_out) values (83, null);
insert into mccullum_inns( score,not_out) values (48, null);
insert into mccullum_inns( score,not_out) values (58, 1);
insert into mccullum_inns( score,not_out) values (61, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (84, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (23, null);
insert into mccullum_inns( score,not_out) values (68, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (35, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (74, null);
insert into mccullum_inns( score,not_out) values (69, null);
insert into mccullum_inns( score,not_out) values (38, null);
insert into mccullum_inns( score,not_out) values (67, 1);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (113, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (37, null);
insert into mccullum_inns( score,not_out) values (12, null);
insert into mccullum_inns( score,not_out) values (224, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (302, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (17, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (39, null);
insert into mccullum_inns( score,not_out) values (43, null);
insert into mccullum_inns( score,not_out) values (45, null);
insert into mccullum_inns( score,not_out) values (202, null);
insert into mccullum_inns( score,not_out) values (195, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (41, null);
insert into mccullum_inns( score,not_out) values (55, null);
insert into mccullum_inns( score,not_out) values (6, null);
insert into mccullum_inns( score,not_out) values (80, null);
insert into mccullum_inns( score,not_out) values (27, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (75, null);
insert into mccullum_inns( score,not_out) values (17, 1);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (145, null);
insert into mccullum_inns( score,not_out) values (25, null);
commit;

I’ve loaded this into my Oracle 11gXE Enterprise Edition database.

Don’t count on COUNT()

Let’s just check the number of rows in the table :

select count(*), count(score), count(not_out)
from mccullum_inns
/

  COUNT(*) COUNT(SCORE) COUNT(NOT_OUT) 
---------- ------------ --------------
       176          176             9
       

Hmmm, that’s interesting. Whilst there are 176 rows in the table, a count of the NOT_OUT column only returns 9, which is the number of rows with a non-null value in this column.

The fact is that COUNT(*) behaves a bit differently from COUNT(some_column)…

with stick as
(
    select 1 as ball from dual
    union all select 2 from dual
    union all select null from dual
)    
select count(*), count(ball)
from stick
/

COUNT(*)                             COUNT(BALL)
---------- ---------------------------------------
         3                                       2

Tanel Poder provides the explanation as to why this happens here.
Due to this difference in behaviour, you may well consider that COUNT(*) is a completely different function to COUNT(column), at least where NULLS are concerned.

When all else fails, Read the Manual

From very early on, database developers learn to be wary of columns that may contain null values and code accordingly, making frequent use of the NVL function.
However, aggregate functions can prove to be something of a blind spot. This can lead to some interesting results.
Whilst we know ( and can prove) that NULL + anything equals NULL…

select 3 + 1 + 4 + 1 + null as ball
from dual
/

     BALL
----------
         

…if we use an aggregate function…

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select sum(ball) 
from stick
/

SUM(BALL)
----------
         9

…so, calculating an average may well lead to some confusion…

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select avg(ball)
from stick
/

AVG(BALL)
----------
      2.25

…which is not what we would expect given :

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select sum(ball)/count(*) as Average
from stick
/

   AVERAGE
----------
       1.8 

You can see similar behaviour with the MAX and MIN functions :

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select max(ball), min(ball)
from stick
/

 MAX(BALL)  MIN(BALL)
---------- ----------
         4          1

Looking at the documentation, we can see that :

“All aggregate functions except COUNT(*), GROUPING, and GROUPING_ID ignore nulls. You can use the NVL function in the argument to an aggregate function to substitute a value for a null. COUNT and REGR_COUNT never return null, but return either a number or zero. For all the remaining aggregate functions, if the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null.”

So, if we want our aggregate functions to behave themselves, or at least, behave as we might expect, we need to account for situations where the column on which they are operating may be null.
Returning to COUNT…

select count(nvl(not_out, 0)) 
from mccullum_inns
/

                 COUNT(NVL(NOT_OUT,0))
---------------------------------------
                                    176
                    

Going back to our original task, i.e. finding McCullum’s final batting average, we could do this :

select count(*) as Inns, 
    sum(score) as Runs,
    sum(nvl(not_out,0)) as "Not Outs",
    round(sum(score)/(count(*) - count(not_out)),2) as Average
from mccullum_inns
/

INNS  RUNS   Not Outs AVERAGE
----- ----- ---------- -------
  176  6453          9   38.64
  

However, now we’ve re-learned how nulls are treated by aggregate functions, we could save ourselves a bit of typing…

select count(*) as Inns, 
    sum(score) as Runs,
    count(not_out) as "Not Outs",
    round(sum(score)/(count(*) - count(not_out)),2) as Average
from mccullum_inns
/

INNS  RUNS   Not Outs AVERAGE
----- ----- ---------- -------
  176  6453          9   38.64

Time to draw stumps.


Filed under: Oracle, SQL Tagged: avg, count(*), max, min, null and aggregate functions, NVL, sum

Standard SQL ? – Oracle REGEXP_LIKE

Thu, 2016-04-28 05:55

Is there any such thing as ANSI Standard SQL ?
Lots of databases claim to conform to this standard. Recent experience tends to make me wonder whether it’s more a just basis for negotiation.
This view is partly the result of having to juggle SQL between three different SQL parsers in the Cloudera Hadoop infrastructure, each with their own “quirks”.
It’s worth remembering however, that SQL differs across established Relational Databases as well, as a recent question from Simon (Teradata virtuoso and Luton Town Season Ticket Holder) demonstrates :

Is there an Oracle equivalent of the Teradata LIKE ANY operator when you want to match against a list of patterns, for example :

like any ('%a%', '%b%')

In other words, can you do a string comparison, including wildcards, within a single predicate in Oracle SQL ?

The short answer is yes, but the syntax is a bit different….

The test table

We’ve already established that we’re not comparing apples with apples, but I’m on a bit of a health kick at the moment, so…

create table fruits as
    select 'apple' as fruit from dual
    union all
    select 'banana' from dual
    union all
    select 'orange' from dual
    union all
    select 'lemon' from dual
/
The multiple predicate approach

Traditionally the search statement would look something like :

select fruit
from fruits
where fruit like '%a%'
or fruit like '%b%'
/

FRUIT 
------
apple 
banana
orange

REGEXP_LIKE

Using REGEXP_LIKE takes a bit less typing and – unusually for a regular expression – less non-alphanumeric characters …

select fruit
from fruits
where regexp_like(fruit, '(a)|(b)')
/

FRUIT 
------
apple 
banana
orange

We can also search for multiple substrings in the same way :

select fruit
from fruits
where regexp_like(fruit, '(an)|(on)')
/

FRUIT 
------
banana
orange
lemon 

I know, it doesn’t feel like a proper regular expression unless we’re using the top row of the keyboard.

Alright then, if we just want to get records that start with ‘a’ or ‘b’ :

select fruit
from fruits
where regexp_like(fruit, '(^a)|(^b)')
/

FRUIT 
------
apple 
banana

If instead, we want to match the end of the string…

select fruit
from fruits
where regexp_like(fruit, '(ge$)|(on$)')
/

FRUIT
------
orange
lemon

…and if you want to combine searching for patterns at the start, end or anywhere in a string, in this case searching for records that

  • start with ‘o’
  • or contain the string ‘ana’
  • or end with the string ‘on’

select fruit
from fruits
where regexp_like(fruit, '(^o)|(ana)|(on$)')
/

FRUIT
------
banana
orange
lemon

Finally on this whistle-stop tour of REGEXP_LIKE, for a case insensitive search…

select fruit
from fruits
where regexp_like(fruit, '(^O)|(ANA)|(ON$)', 'i')
/

FRUIT
------
banana
orange
lemon

There’s quite a bit more to regular expressions in Oracle SQL.
For a start, here’s an example of using REGEXP_LIKE to validate a UK Post Code.
There’s also a comprehensive guide here on the PSOUG site.
Now I’ve gone through all that fruit I feel healthy enough for a quick jog… to the nearest pub.
I wonder if that piece of lime they put in top of a bottle of beer counts as one of my five a day ?


Filed under: Oracle, SQL Tagged: regexp_like

Getting the current SQL statement from SYS_CONTEXT using Fine Grained Auditing

Sun, 2016-04-17 13:44

The stand-off between Apple and the FBI has moved on. In essence both sides have taken it in turns to refuse to tell each other how to hack an iPhone.

Something else that tends to tell little or nothing in the face of repeated interrogation is SYS_CONTEXT(‘userenv’, ‘current_sql’).
If you’re fortunate enough to be running on Enterprise Edition however, a Fine Grained Auditing Policy will loosen it’s tongue.

Consider the following scenario.
You’ve recently got a job as a database specialist with Spectre.
They’ve been expanding their IT department recently as the result of their “Global Surveillance Initiative”.

There’s not much of a view from your desk as there are no windows in the hollowed out volcano that serves as the Company’s HQ.
The company is using Oracle 12c Enterprise Edition.

Everything seems to be going along nicely until you suddenly get a “request” from the Head of Audit, a Mr Goldfinger.
The requirement is that any changes to employee data in the HR system are recorded, together with the statement executed to change each record.
Reading between the lines, you suspect that Mr White – head of HR – is not entirely trusted by the hierarchy.

Whilst journalling triggers are common enough, capturing the actual SQL used to make DML changes is a bit more of a challenge.
Explaining this to Mr Goldfinger is unlikely to be a career-enhancing move. You’re going to have to be a bit creative if you want to avoid the dreaded “Exit Interview” (followed by a visit to the Piranha tank).

First of all though….

Fine Grained Auditing Configuration

You need to do a quick check to make sure that Fine Grained Auditing is available and configured in the way you would expect.

Access to Fine Grained Auditing

FGA is a feature of Oracle Enterprise Edition.
If you were working on any other edition of the database, Oracle would tell you that FGA is not enabled. For example, running the following on Oracle Express Edition 11g…

begin
    dbms_fga.add_policy
    (
        object_schema => 'HR',
        object_name => 'DEPARTMENTS',
        policy_name => 'WATCHING YOU',
        audit_condition => null,
        statement_types => 'INSERT, UPDATE, DELETE'
    );
end;
/

… will result in the Oracle Database telling you your fish-food …

ERROR at line 1:
ORA-00439: feature not enabled: Fine-grained Auditing
ORA-06512: at "SYS.DBMS_FGA", line 20
ORA-06512: at line 2

You can avoid this embarrassment simply by checking what edition of Oracle you’re running :

select banner
from v$version
/

In the case of Oracle 12c, you’ll get :

BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
CORE	12.1.0.2.0	Production
TNS for Linux: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production

If you don’t happen to work for a worldwide crime syndicate and/or don’t have access to an Enterprise Edition database, you can still have a play around by means of a Developer Day Virtual Box image.

Unified Auditing

The other thing you need to check is just where audit records are going to be written to. This is not so much a requirement for the solution being implemented here, but it is relevant to some of the examples that follow.

By default, unified auditing is not implemented in 12c and you can confirm this by running :

select value
from v$option
where parameter = 'Unified Auditing'
/

If the query returns FALSE, then Unified Auditing has not been enabled.
Otherwise, it’s probably worth taking a look at the documentation to see how this affects auditing behaviour in the database.

Initialization Parameters

Assuming Unified Auditing has not been configured, the location of the audit records will be dictated by the AUDIT_TRAIL initialization parameter. You can check this value as follows :

select value
from v$parameter
where name = 'audit_trail'
/

If the value is set to DB, or DB, EXTENDED then any FGA policies should write to the tables mentioned below.

Now to take a closer look at FGA…

How long before SYS_CONTEXT cracks ?

To test exactly when you will be able to retrieve the DML statement you’re interested in, you can knock up a quick test.

First, you need a table to audit against for testing purposes :

create table trigger_messages
(
    message varchar2(4000)
)
/

Next, a simple procedure to insert a record :

create or replace procedure add_message( i_msg in trigger_messages.message%type)
is
begin
    insert into trigger_messages(message) values( i_msg);
end;
/

Now for a DML trigger on the table :

create or replace trigger trg_msg
    for insert or update or delete 
    on trigger_messages
    compound trigger
    
    l_action varchar2(10);
    before statement is
    begin
        l_action := case when inserting then 'INSERT' when updating then 'UPDATE' else 'DELETE' end;
        dbms_output.put_line('Before Statement '||l_action);
        dbms_output.put_line( nvl( sys_context('userenv', 'current_sql'), 'My lips are sealed'));
    end before statement;
    
    before each row is
    begin
        dbms_output.put_line('Before Row '||l_action);
        dbms_output.put_line( nvl( sys_context('userenv', 'current_sql'), 'My lips are sealed'));
    end before each row;
    
    after each row is
    begin
        dbms_output.put_line('After Row '||l_action);
        dbms_output.put_line( nvl( sys_context('userenv', 'current_sql'), 'My lips are sealed'));
    end after each row;
    
    after statement is
    begin
        dbms_output.put_line('After Statement '||l_action);
        dbms_output.put_line( nvl( sys_context('userenv', 'current_sql'), 'My lips are sealed'));
    end after statement;
end trg_msg;
/

Next up, you need a procedure to serve as a handler for a Fine Grained Auditing event. The reason for this will become apparent when we run the test. Note that the signature for an FGA handler procedure is mandated :

create or replace procedure trg_msg_fga
(
    object_schema varchar2,
    object_name varchar2,
    policy_name varchar2
)
is
begin
    dbms_output.put_line('FGA Policy');
    dbms_output.put_line(sys_context('userenv', 'current_sql'));
    dbms_output.put_line(sys_context('userenv', 'current_bind'));
    dbms_output.put_line(sys_context('userenv', 'current_sql_length'));
end;
/

Now all that’s left to do is to create an FGA policy on the table :

begin
    dbms_fga.add_policy
    (
        object_schema => 'MIKE',
        object_name => 'TRIGGER_MESSAGES',
        policy_name => 'FIRING_ORDER',
        statement_types => 'INSERT, UPDATE, DELETE',
        handler_schema => 'MIKE',
        handler_module => 'TRG_MSG_FGA'
    );
end;
/

You can confirm that the policy has been created successfully and is enabled by querying DBA_AUDIT_POLICIES…

select object_schema, object_name, enabled,
    sel, ins, upd, del
from dba_audit_policies
where policy_owner = user
and policy_name = 'FIRING_ORDER'
/

OBJECT_SCHEMA	OBJECT_NAME	     ENABLED	SEL   INS   UPD   DEL
--------------- -------------------- ---------- ----- ----- ----- -----
MIKE		TRIGGER_MESSAGES     YES	NO    YES   YES   YES

Now you’re ready to test…

set serveroutput on size unlimited

begin 
    add_message('Spectre - at the cutting-edge of laser technology');
end;
/

update trigger_messages set message = 'Spectre - coming to a browser near you'
/

delete from trigger_messages
/

The results are quite interesting…

Before Statement INSERT
My lips are sealed
Before Row INSERT
My lips are sealed
FGA Policy
INSERT INTO TRIGGER_MESSAGES(MESSAGE) VALUES( :B1 )
#1(49):Spectre - at the cutting-edge of laser technology
51
After Row INSERT
My lips are sealed
After Statement INSERT
My lips are sealed

PL/SQL procedure successfully completed.

Before Statement UPDATE
My lips are sealed
Before Row UPDATE
My lips are sealed
FGA Policy
update trigger_messages set message = 'Spectre - coming to a browser near you'
78
After Row UPDATE
My lips are sealed
After Statement UPDATE
My lips are sealed

1 row updated.

Before Statement DELETE
My lips are sealed
Before Row DELETE
My lips are sealed
After Row DELETE
My lips are sealed
FGA Policy
delete from trigger_messages
28
After Statement DELETE
My lips are sealed

1 row deleted.

From this you conclude that :

  • sys_context is only populated with the current statement inside the fga handler procedure
  • the handler procedure is invoked prior to the after row event for inserts and updates, but not for deletes

At this point, you consider that it might just be simpler to interrogate the DBA_FGA_AUDIT_TRAIL view, which has also captured the DML statements we’ve just run :

select sql_text
from dba_fga_audit_trail
where policy_name = 'FIRING_ORDER'
order by timestamp
/  

SQL_TEXT
----------------------------------------------------------------------------------------------------------------------------------
INSERT INTO TRIGGER_MESSAGES(MESSAGE) VALUES( :B1 )
update trigger_messages set message = 'Spectre - coming to a browser near you'
delete from trigger_messages

Note – the bind values for the procedure call can be found in the SQL_BIND column of this view.

However, it’s worth noting that we haven’t actually commited the test transaction yet these records are still present.
They will remain there, even if the transaction is rolled back.

In the end, you decide that the best approach is a journalling trigger…

The Unnecessarily Slow Dipping Mechanism – the DML trigger

Due to the nature of the organization, Spectre doesn’t have employees. It has associates. This is reflected in the table that you need to audit :

create table associates
(
    emp_id number,
    emp_name varchar2(100),
    job_title varchar2(30)
)
/

The table to hold the audit trail will probably look something like this :

create table assoc_audit
(
    action varchar2(6),
    changed_by varchar2(30),
    change_ts timestamp,
    emp_id number,
    emp_name varchar2(100),
    job_title varchar2(30),
    statement varchar2(4000),
    binds varchar2(4000)
)
/

It’s worth pausing at this point to note that SYS_CONTEXT can report up to 32k of a statement.
It does this by splitting the statement into eight 4k chunks, available in the USERENV context variables CURRENT_SQL, CURRENT_SQL1…CURRENT_SQL7.
It also provides the length of the statement it currently holds in the CURRENT_SQL_LENGTH variable.
Therefore, you may consider having a 32k varchar statement column in the audit table ( if this is enabled on your database), or even a column for the contents of each of these variables.
For the sake of simplicity, plus the fact that none of the examples here are very large, you decide to stick with just the one 4k varchar column to hold the statement.

There’s a procedure for adding new records to the table :

create or replace procedure add_associate
(
    i_emp_id in associates.emp_id%type,
    i_name in associates.emp_name%type,
    i_job_title in associates.job_title%type
)
is
begin
    insert into associates( emp_id, emp_name, job_title)
    values( i_emp_id, i_name, i_job_title);
end;
/
    

In the real world this would probably be in a package, but hey, you’re working for Spectre.

Now we need a handler for the FGA policy that we’re going to implement. In order for the context values that are captured to be accessible to the trigger, this handler is going to be part of a package which includes a couple of package variables :

create or replace package assoc_fga_handler
as

    g_statement varchar2(4000);
    g_binds varchar2(4000);
    
    -- The procedure to be invoked by the FGA policy.
    -- Note that the signature for this procedure is mandatory
    procedure set_statement
    (
        object_schema varchar2,
        object_name varchar2,
        policy_name varchar2
    );
end assoc_fga_handler;
/

create or replace package body assoc_fga_handler
as

    procedure set_statement
    (
        object_schema varchar2,
        object_name varchar2,
        policy_name varchar2
    )
    is
    begin
        g_statement := sys_context('userenv', 'current_sql');
        g_binds := sys_context('userenv', 'current_bind');
    end set_statement;
end assoc_fga_handler;
/

Now for the trigger. You may notice some compromises here …

create or replace trigger assoc_aud
    for insert or update or delete on associates
    compound trigger

    type typ_audit is table of assoc_audit%rowtype index by pls_integer;
    tbl_audit typ_audit;
    l_idx pls_integer := 0;
    
    after each row is
    begin
        l_idx := tbl_audit.count + 1;
        tbl_audit(l_idx).action := case when inserting then 'INSERT' when updating then 'UPDATE' else 'DELETE' end;
        tbl_audit(l_idx).changed_by := user;
        tbl_audit(l_idx).change_ts := systimestamp;
        tbl_audit(l_idx).emp_id := case when inserting then :new.emp_id else :old.emp_id end;
        tbl_audit(l_idx).emp_name := case when inserting then :new.emp_name else :old.emp_name end;
        tbl_audit(l_idx).job_title := case when inserting then :new.job_title else :old.job_title end;
    end after each row;
    
    after statement is
    begin
        for i in 1..tbl_audit.count loop
            tbl_audit(i).statement := assoc_fga_handler.g_statement;
            tbl_audit(i).binds := assoc_fga_handler.g_binds;
        end loop;
        forall j in 1..tbl_audit.count
            insert into assoc_audit values tbl_audit(j);
        -- cleardown the array
        tbl_audit.delete;    
    end after statement;
end assoc_aud;
/

Due to the fact that the FGA policy is not fired until after an AFTER ROW trigger for a DELETE, we are only guaranteed to capture the CURRENT_SQL value in an AFTER STATEMENT trigger.
The upshot is that we’re left with a PL/SQL array which is not constrained by a LIMIT clause. In these circumstances it’s not too much of an issue, Spectre has quite a small number of employees…er…associates, so you’re not likely to end up with an array large enough to cause memory issues.
On a potentially larger volume of records you may well consider splitting the INSERT and UPDATE portions of the trigger so that you can limit the size of the arrays generated by these operations. For DELETEs however, it appears that we may well be stuck with this approach.
On a not entirely unrelated subject, Jeff Kemp has an interesting method of speeding up Journalling Triggers.

All that remains is for the FGA policy….

begin
    dbms_fga.add_policy
    (
        object_schema => 'MIKE',
        object_name => 'ASSOCIATES',
        policy_name => 'ASSOCIATES_DML',
        statement_types => 'INSERT, UPDATE, DELETE',
        handler_schema => 'MIKE',
        handler_module => 'ASSOC_FGA_HANDLER.SET_STATEMENT'
    );
end;
/

…and now you can test…

set serveroutput on size unlimited
--
-- Cleardown the tables before running the test
--
truncate table assoc_audit
/

truncate table associates
/

begin
    add_associate(1, 'Odd Job', 'HENCHMAN');
    add_associate(2, 'Jaws', 'HENCHMAN');
    add_associate(3, 'Mayday', 'HENCHWOMAN');
    add_associate(4, 'Ernst Stavro Blofeld', 'CRIMINAL MASTERMIND');
    add_associate(5, 'Emilio Largo', 'Deputy Evil Genius');
    
end;
/

insert into associates( emp_id, emp_name, job_title)
values(6, 'Hans', 'Bodyguard and Piranha keeper')
/

commit;

update associates
set job_title = 'VALET'
where emp_id = 1
/
commit;


delete from associates
where emp_id = 1
/

commit;

-- Spectre is an Equal Opportunities Employer...and I need a statement
-- affecting multiple rows to test so...
update associates
set job_title = 'HENCHPERSON'
where job_title in ('HENCHMAN', 'HENCHWOMAN')
/

commit;

It is with a sense of relief that, when you check the audit table after running this you find …

select action, emp_name, 
    statement, binds
from assoc_audit
order by change_ts
/

ACTION EMP_NAME 	    STATEMENT							 BINDS
------ -------------------- ------------------------------------------------------------ ----------------------------------------
INSERT Odd Job		    INSERT INTO ASSOCIATES( EMP_ID, EMP_NAME, JOB_TITLE) VALUES(  #1(1):1 #2(7):Odd Job #3(8):HENCHMAN
			     :B3 , :B2 , :B1 )

INSERT Jaws		    INSERT INTO ASSOCIATES( EMP_ID, EMP_NAME, JOB_TITLE) VALUES(  #1(1):2 #2(4):Jaws #3(8):HENCHMAN
			     :B3 , :B2 , :B1 )

INSERT Mayday		    INSERT INTO ASSOCIATES( EMP_ID, EMP_NAME, JOB_TITLE) VALUES(  #1(1):3 #2(6):Mayday #3(10):HENCHWOMAN
			     :B3 , :B2 , :B1 )

INSERT Ernst Stavro Blofeld INSERT INTO ASSOCIATES( EMP_ID, EMP_NAME, JOB_TITLE) VALUES(  #1(1):4 #2(20):Ernst Stavro Blofeld #3(
			     :B3 , :B2 , :B1 )						 19):CRIMINAL MASTERMIND

INSERT Emilio Largo	    INSERT INTO ASSOCIATES( EMP_ID, EMP_NAME, JOB_TITLE) VALUES(  #1(1):5 #2(12):Emilio Largo #3(18):Depu
			     :B3 , :B2 , :B1 )						 ty Evil Genius

INSERT Hans		    insert into associates( emp_id, emp_name, job_title)
			    values(6, 'Hans', 'Bodyguard and Piranha keeper')

UPDATE Odd Job		    update associates
			    set job_title = 'VALET'
			    where emp_id = 1

DELETE Odd Job		    delete from associates
			    where emp_id = 1

UPDATE Jaws		    update associates
			    set job_title = 'HENCHPERSON'
			    where job_title in ('HENCHMAN', 'HENCHWOMAN')

UPDATE Mayday		    update associates
			    set job_title = 'HENCHPERSON'
			    where job_title in ('HENCHMAN', 'HENCHWOMAN')


10 rows selected.

Looks like the Piranhas will be going hungry…for now !


Filed under: Oracle, PL/SQL, SQL Tagged: audit_trail initialization parameter, compound trigger, dba_fga_audit_trail, dbms_fga.add_policy, SYS_CONTEXT, sys_context current_bind, sys_context current_sql, sys_context current_sql_length, v$option, v$parameter, v$version

Adrift in a Data Lake – an Oracle Developer’s first steps in Hadoop

Mon, 2016-03-28 10:49

We live in interesting times. As I write, the political life of a great nation is currently in thrall to a wealthy right-wing populist with a rather distinctive hairstyle.
But enough about Boris Johnson.

For someone used to the way things are done in Oracle, Hadoop can be something of a culture shock.
My initial introduction to Hadoop and it’s “vibrant ecosystem” was some internally written documentation.
As with most technical documentation, it was written with the assumption of a certain level of prior knowledge on the part of the reader. For this particular reader, it proved to be an erroneous assumption.

After a half an hour of struggling through this text , I was left wondering what you would use a Khafka Banana Oozie for.

Maybe it’s for killing zombie processes ? Let’s face it, with all that Java running on your system there are bound to be a few knocking around.
I’m a little hazy on my Zombie Lore, so I’m not entirely clear as to why a rapid application of fresh fruit would pose an existential threat to your average zombie. Maybe they’re potassium intolerant ?

There are a bewildering array of tools associated with Hadoop, many of which appear to provide almost identical functionality.
For example, a relational database traditionally requires only one SQL Engine…and I’ll be taking a look at two of them.

Moving from Oracle to Hadoop can feel rather like trading your car in for a box of Lego.
If the box does contain any instructions they seem to have been translated from Java into English…by someone who doesn’t speak either.
Note to reader : please substitute your own language as appropriate.

Fortunately, there are Hadoop distributions available which bundle the core tools required to get up an running. We’ll gloss over the fact that these distributions – Cloudera and Hortonworks – don’t contain the exact same tools.

In my search to find a coherent explanation of how to use Hadoop, I’ve found that the VM provided by Cloudera, together with the introductory tutorial, is a valuable aid to familiarising myself with the basics.

To start with then, I’ll cover getting hold of the Cloudera VM and setting it up in VirtualBox.
Then I’ll go through some of the tools available and what they do.
I’ll do this from the perspective of an Oracle developer (mainly because I don’t have much option) and will point out the driftwood of familiarity that you might be able to cling to in order to stay afloat in your Data Lake.

What I’ll cover is :

  • The core components of Hadoop
  • HDFS commands
  • Transferring data between a relational database and hadoop using SQOOP
  • Querying structured data using Hive and Impala
  • Uploading data using Kite

Ready to Dive in ?

Cloudera VM setup for Virtualbox

Before we head off to get the VM, there are a couple of points worth mentioning.

First of all, the VM is running on a 64-bit version of Centos 6.4. Therefore you need to make sure that your host operating system ( i.e. the one that you’ll be running Virtual Box on) is also 64-bit.

Secondly, the minimum recommended memory allocation for the VM is 4GB.

Finally, I’m using the Cloudera Quickstart 5.5 VM. The behaviour of some of the tools covered here differs in 5.4 and 5.5
The Cloudera VM can be downloaded from here .

You’ll need to fill in some details about yourself before moving on.

Once you’re at the downloads page, select VirtualBox from the Platform drop-down list.
VMs for other platforms are also available.

Once you hit the “Download Now” button you’ll be asked for some further details. However, these do not appear to be validated.

The download is quite chunky, just over 5GB.

Once it’s completed you should have a zip file called :

cloudera-quickstart-vm-5.5.0-0-virtualbox.zip

You can use a standard unzip utility to extract this file ( e.g. Winzip on Windows, plain old zip on Linux).
Once extracted, you’ll see a directory called cloudera-quickstart-vm-5.5.0-0-virtualbox which contains two files :

cloudera-quickstart-vm-5.5.0-0-virtualbox-disk1.vmdk
cloudera-quickstart-vm-5.5.0-0-virtualbox.ovf

The .ovf file is the VirtualBox image and the .vmdk is the VM’s hard-drive image.

To set up the VM in Virtualbox, you’ll need to import the .ovf file as an Appliance.
The steps to do this are the same as those for importing an Oracle Developer Day 12c Image.

When you first start the VM it does take a while to come up. When it does, you should see

cloudera_vm_start

Tweaking the VM settings

If you don’t happen to live in Boston and/or you don’t have a US keyboard, you may want to make a couple of configuration changes to the VM.

To change the Location ( and consequently the Time Zone) :

  1. Click on the Date and Time on the Status Bar. It’s in the top right-hand corner of the screen.
  2. Hit the Edit button next to Locations
  3. Add your location by typing it in the Location Name dialog box (a drop-down list will appear as you start typing).
  4. Now highlight Boston and click the Remove button. Your new location should now show at the bottom of the Time window
  5. Move your mouse over your location and an Edit button should appear. Click this and the Date and Time should now reflect your new location

set_location

You will be prompted for a password to confirm this change. You need to enter cloudera

To change the keyboard layout :

  1. Go to the System Menu and select Preferences and then Keyboard
  2. Navigate to the Layouts tab and click Add
  3. Select your preferred layout from the drop-down list
  4. Once your new layout is shown, click the Default radio button next to it

set_keyboard

These changes should take effect without the need to re-start the VM.

OK, so let’s get started….

The Core components of Hadoop

Hadoop itself comprises three main components :

  • HDFS – a distributed file system
  • The MapReduce framework
  • Yarn – a Job scheduler / resource manager for the parallel execution of MapReduce jobs
MapReduce

The MapReduce framework consists of a Mapper and a Reducer.
In SQL terms, the Mapper program applies a query predicate against the data you are querying – essentially, it does the filtering.
The Reducer then does any aggregation on the result set produced from the Mapper process.

Yarn

Yarn is the default job scheduler and resource manager. It facilitates parallel execution of MapReduce jobs.

HDFS

HDFS – Hadoop File System – is a distributed file system. The idea is that datafiles are replicated across multiple nodes (physical servers) in a cluster. Essentially, any program can run on any node. By replicating the data to each node, network latency is minimised for these programs.
It comes with it’s own set of commands which you can use interactively.
These appear to be largely a subset of those you’d find on a Linux OS.

The format of these commands is :

hadoop fs -command [some arguments]

However, the way that the VM is configured, we need to run these commands as the hdfs user.
Therefore, in the examples that follow the commands will follow the format..

sudo -u hdfs hadoop fs -command [some arguments]

To start with, let’s see what we’d need to do to upload a csv file to HDFS.

The file in question is called tools.csv, which I’ve created in the cloudera user’s home directory on the vm.
It contains the following :

tools.csv

To start with, let’s see what’s currently in hdfs :

sudo -u hdfs hadoop fs -ls /

Found 5 items
drwxrwxrwx   - hdfs  supergroup          0 2015-11-18 10:57 /benchmarks
drwxr-xr-x   - hbase supergroup          0 2016-03-23 12:08 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2016-03-22 12:23 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:01 /user
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:00 /var

This shows us the top-level directories. Let’s take a look at what’s in /user :

sudo -u hdfs hadoop fs -ls /user

Found 8 items
drwxr-xr-x   - cloudera cloudera            0 2015-11-18 10:56 /user/cloudera
drwxr-xr-x   - mapred   hadoop              0 2015-11-18 10:57 /user/history
drwxrwxrwx   - hive     supergroup          0 2015-11-18 11:01 /user/hive
drwxrwxrwx   - hue      supergroup          0 2015-11-18 10:58 /user/hue
drwxrwxrwx   - jenkins  supergroup          0 2015-11-18 10:58 /user/jenkins
drwxrwxrwx   - oozie    supergroup          0 2015-11-18 10:59 /user/oozie
drwxrwxrwx   - root     supergroup          0 2015-11-18 10:58 /user/root
drwxr-xr-x   - hdfs     supergroup          0 2015-11-18 11:01 /user/spark

For the purposes of this test, I’ll create a directory under the /user/cloudera directory, and then check that it’s been created as expected :

sudo -u hdfs hadoop fs -mkdir /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxr-xr-x   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

Notice that only the directory owner has write permissions on the directory.
As I’m feeling reckless, I want to grant write permissions to everyone.
This can be done as follows :

sudo -u hdfs hadoop fs -chmod a+w /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

In HDFS, the chmod command seems to accept the same arguments as it’s Linux counterpart.

To check that we can now see the directory :

sudo -u hdfs hadoop fs -ls /home/cloudera/test

The simplest way to load our csv is to use the put command :

sudo -u hdfs hadoop fs -put tools.csv /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv

There is another way to do this :

sudo -u hdfs hadoop fs -copyFromLocal tools.csv /user/cloudera/test/tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 2 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

If we want to delete a file then :

sudo -u hdfs hadoop fs -rm /user/cloudera/test/put_tools.csv
16/03/23 17:06:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

You can display the contents of a file in hdfs by using the cat command :

sudo -u hdfs hadoop fs -cat /user/cloudera/test/tools.csv
tool,description
hue,Web-based UI for Hadoop
sqoop,Transfer structured data between an RDBMS and Hadoop
flume,stream a file into Hadoop
impala,a query engine
hive,another query engine
spark,a query engine that is not hive or impala
khafka,a scheduler
banana,a web UI framework
oozie,another scheduler

In order to demonstrate copying a file from hdfs to the local filesystem in the VM, we’ll need to create a directory that the hdfs user has access to :

mkdir test
chmod a+rw test
cd test
ls -ld
drwxrwxrwx 2 cloudera cloudera 4096 Mar 23 17:13 .

Now, as the hdfs user, we can retrieve our file from hdfs onto the local file system using copyFromLocal :

sudo -u hdfs hadoop fs -copyToLocal /user/cloudera/test/tools.csv /home/cloudera/test/welcome_back.csv
ls -l welcome_back.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:18 welcome_back.csv

Another method of doing this is using get :

sudo -u hdfs hadoop fs -get /user/cloudera/test/tools.csv /home/cloudera/test/got_it.csv
ls -l got_it.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:21 got_it.csv

One final hdfs command that may come in useful is du, which shows the amount of space used by a directory or file on hdfs :

sudo -u hdfs hadoop fs -du /user/cloudera
301  301  /user/cloudera/test
SQOOP

The VM comes with a MySQL database from which data is loaded into Hadoop via SQOOP.
SQOOP is a tool for transferring structured data between an RDBMS and Hadoop.

The documentation does say that SQOOP is capable of loading data into Oracle using the command (from the Local File System) :

sqoop import --connect jdbc:oracle:thin:@//db_name --table table_name

However, said documentation says that it’s been tested with Oracle 10.2 Express Edition, so you may want to have a play around with it before using it in anger.

The tutorial directs us to use SQOOP to ingest all of the data from the MySQL database by running the following command :

sqoop import-all-tables \
    -m 1 \
    --connect jdbc:mysql://quickstart:3306/retail_db \
    --username=retail_dba \
    --password=cloudera \
    --compression-codec=snappy \
    --as-parquetfile \
    --warehouse-dir=/user/hive/warehouse \
    --hive-import

There’s a fair amount going on here, we’re connecting to MySQL, then outputting the data as a compressed file onto hdfs in the /user/hive/warehouse directory.
The compression library being used is Snappy.

It’s instructive to see the output when we run this command as it shows both MapReduce and Yarn in action. You’ll probably see lines like :

...
16/03/23 17:36:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1458734644938_0001
16/03/23 17:36:08 INFO impl.YarnClientImpl: Submitted application application_1458734644938_0001
16/03/23 17:36:08 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1458734644938_0001/
16/03/23 17:36:08 INFO mapreduce.Job: Running job: job_1458734644938_0001
16/03/23 17:37:28 INFO mapreduce.Job: Job job_1458734644938_0001 running in uber mode : false
16/03/23 17:37:28 INFO mapreduce.Job:  map 0% reduce 0%
16/03/23 17:38:29 INFO mapreduce.Job:  map 100% reduce 0%
16/03/23 17:38:34 INFO mapreduce.Job: Job job_1458734644938_0001 completed successfully
...

After a fair amount of time, the command should end with :

...
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Transferred 46.1318 KB in 157.9222 seconds (299.1283 bytes/sec)
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Retrieved 1345 records.

If we now check, we can see that a directory has been created for each table :

hadoop fs -ls /user/hive/warehouse
Found 6 items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:41 /user/hive/warehouse/customers
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:44 /user/hive/warehouse/departments
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:48 /user/hive/warehouse/order_items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:51 /user/hive/warehouse/orders
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:53 /user/hive/warehouse/products

Digging further into the catogories directory and it’s children, we find that the table structure has also been transferred across :

hadoop fs -cat /user/hive/warehouse/categories/.metadata/schemas/1.avsc
{
  "type" : "record",
  "name" : "categories",
  "doc" : "Sqoop import of categories",
  "fields" : [ {
    "name" : "category_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_id",
    "sqlType" : "4"
  }, {
    "name" : "category_department_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_department_id",
    "sqlType" : "4"
  }, {
    "name" : "category_name",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "category_name",
    "sqlType" : "12"
  } ],
  "tableName" : "categories"
}

The data is stored in a .parquet file :

hadoop fs -ls /user/hive/warehouse/categories
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:35 /user/hive/warehouse/categories/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories/.signals
-rw-r--r--   1 cloudera supergroup       1956 2016-03-23 17:38 /user/hive/warehouse/categories/a3db2e78-2861-4906-a769-e8035c03d7d2.parquet

There are a number of file formats you can use with Hadoop, each with their own pros and cons.
You can see a discussion of some of these formats here.

NOTE – the first time I ran this, I shutdown the VM after I got to here. When I started it again, Impala (see below) refused to see any databases. I had to trash and re-create the VM to get it to work. I’m not sure why this happened ( I did have a hunt around) but just thought I’d mention it in case you’re thinking of doing the same.

Anyway, now we have data in Hadoop, it would be good if we could interrogate it…

Not Quite SQL – Hive and Impala

Cloudera comes with a Web based UI for Hadoop in the form of Hue.
Note that Hortonworks seem to be standardazing on a different UI tool – Apache Ambari.

In the Cloudera VM, there is a link to Hue on the Bookmarks bar in Firefox.
Click on this link and then connect as cloudera (password cloudera).

The favoured query engine for interactive queries in Cloudera is called Impala.
I believe that, at the time of writing, Hortonworks are sticking with Hive.

As far as I can tell, it seems that Hive has been retained by Cloudera to handle what, in RDBMS terms, would be called the catalog.
In Oracle terms this would be the Data Dictionary.

This is essentially the metadata for the tables in the database.
This metadata seems to be read by all of the SQL Engines irrespective of where it is updated from.

Once you’re connected via Hue you can select either of these tools from the Query Editors drop-down.

Initially, the tutorial directs you to Impala.

The Data Dictionary

The first thing to note about Impala is that it doesn’t bother reading the table metadata unless you tell it to. Therefore, if you make any changes DML or DDL changes, you probably need to tell Impala to check the metadata for any query results to pick up these changes.
For example, we know that we’ve got six tables in our database which we’ve created via SQOOP. However, if you ask Impala about it :

show tables;

…you get the not entirely helpful :

The operation has no results

By contrast, if you try this in Hive (Query Editors/Hive), the tables are all present and correct.

To persuade Impala to see these changes you need to run :

invalidate metadata;

Note that you can also run this command for individual tables should the need arise, e.g. :

invalidate metadata categories;

Anyway, now we can see the tables in Impala, we can run some queries against them.
Whilst we’re at it, we can do a simple comparison between Impala and Hive in terms of how they process the same query.

Comparative performance

The query in question (taken from the Getting Started Tutorial provided with the VM) is :

-- Most popular product categories
select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;

The syntax looks reassuringly familiar.

In Impala this ran in around 35 seconds.
I then ran the same query in Hive, which took about 7 minutes.

Once the query is run, Hue shows an Explain button. Click on this and you can see the execution plan for the query.

In Hive the plan looks like this :

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-3 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-4

STAGE PLANS:
  Stage: Stage-9
    Map Reduce Local Work
      Alias -> Map Local Tables:
        c
          Fetch Operator
            limit: -1
        p
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        c
          TableScan
            alias: c
            Statistics: Num rows: 24 Data size: 2550 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: category_id is not null (type: boolean)
              Statistics: Num rows: 12 Data size: 1275 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col10 (type: int)
                  1 category_id (type: int)
        p
          TableScan
            alias: p
            Statistics: Num rows: 5737 Data size: 45896 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (product_id is not null and product_category_id is not null) (type: boolean)
              Statistics: Num rows: 1435 Data size: 11479 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: oi
            Statistics: Num rows: 206028 Data size: 1648231 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: order_item_product_id is not null (type: boolean)
              Statistics: Num rows: 103014 Data size: 824115 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)
                outputColumnNames: _col3, _col10
                Statistics: Num rows: 113315 Data size: 906526 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col10 (type: int)
                    1 category_id (type: int)
                  outputColumnNames: _col3, _col20
                  Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col20 (type: string), _col3 (type: int)
                    outputColumnNames: _col20, _col3
                    Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count(_col3)
                      keys: _col20 (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: bigint)
              sort order: -
              Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col0 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 10
            Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

The Impala plan looks a bit different :

Estimated Per-Host Requirements: Memory=4.16GB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
default.categories, default.order_items, default.products

11:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: count(order_item_quantity) DESC
|  limit: 10
|
06:TOP-N [LIMIT=10]
|  order by: count(order_item_quantity) DESC
|
10:AGGREGATE [FINALIZE]
|  output: count:merge(order_item_quantity)
|  group by: c.category_name
|
09:EXCHANGE [HASH(c.category_name)]
|
05:AGGREGATE
|  output: count(order_item_quantity)
|  group by: c.category_name
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: p.product_category_id = c.category_id
|
|--08:EXCHANGE [BROADCAST]
|  |
|  02:SCAN HDFS [default.categories c]
|     partitions=1/1 files=1 size=1.91KB
|
03:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: oi.order_item_product_id = p.product_id
|
|--07:EXCHANGE [BROADCAST]
|  |
|  01:SCAN HDFS [default.products p]
|     partitions=1/1 files=1 size=43.80KB
|
00:SCAN HDFS [default.order_items oi]
   partitions=1/1 files=1 size=1.57MB

As well as being somewhat shorter, the Impala plan appears more familiar to someone looking at an Oracle Query Plan.

To check for the possible effect of caching, I then ran these queries again.
For Impala, the runtime dropped to around 13 seconds.
For Hive, the runtime was the same as for the first run.

The Impala plan was unchanged, despite the fact that the runtime was drastically reduced.
From this behaviour I would infer that there is some caching effect for Impala, although a cache similar to the Buffer Cache in Oracle is not shared between the two Query Engines.

The explanation offered in the tutorial is that Hive compiles SQL queries into MapReduce jobs wheras Impala was designed as a SQL engine.
It would appear that Impala is better for small-scale interactive queries wheras Hive is more suited to large-scale ETL.

There’s a benchmarking comparison between Hive, Impala, and Spark that may be of interest here.

Optimizer Statistics

You’ll notice that, at the start of the Impala query plan, there’s a warning about missing statistics.

If you roll the mouse over one of the tables on the left-hand side of the Hue Query Editor Window, you’ll see an icon appear which enables you to explore the table’s metadata.

Looking at one of these tables we can see that there are no stats present :

no_stats

We can fix this easily enough in Impala by gathering stats for each of our tables. For example :

compute stats categories;

If we now check the metadata for the CATEGORIES table, we can see that stats are present :

with_stats

If we now re-run the original query, the plan will no longer show the warning. However, the rest of the plan remains unchanged.
Given that these are quite small tables, this is probably not surprising.

There’s more information about stats gathering in Impala here.

External Tables

One thing that Hive is good for apparently is creating External Tables.

In the tutorial, some unstructured data ( a log file) is loaded and then external tables created in Hive as using the following code :


CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = '%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s')
LOCATION '/user/hive/warehouse/original_access_logs';

CREATE EXTERNAL TABLE tokenized_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';

ADD JAR /usr/lib/hive/lib/hive-contrib.jar;

INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;

Once again, it looks familiar for the most part.
One point of interest is the INSERT OVERWRITE command used to insert data into the table.
This has the effect of removing any pre-existing data in the table before inserting the new data.

String functions and other SQL Stuff

We can query the new external table in Impala :

invalidate metadata tokenized_access_logs;
select * from tokenized_access_logs;

The values in the url column contain %20 characters instead of spaces.
Let’s see what Impala can do in terms of the standard SQL string manipulation functions.

Unlike Oracle, there’s no REPLACE function, there is however a REGEXP_REPLACE…

select regexp_replace(url, '%20', ' ')
from tokenized_access_logs;

When we run this, we can see that the REGEXP_REPLACE has done the job :

regexp_replace

The logs we’re really interested in are where a specific product has been viewed.
If we can get the name of the product from the url, then maybe that will help when relating this data back to the most popular items in terms of sales…

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%';

When we run this, we can see that the INSTR function also works as expected…to a point. Note that the ‘/’ characters need to be escaped with a ‘\’.
The result looks like this :

products_url

Now, there are a couple of records with the product name and “/add_to_cart” appended. With INSTR in Oracle, you can search for a pattern starting at the end of a string – e.g. :

select instr('/product/Pelican Sunstream 100 Kayak/add_to_cart', '/',-1,1)
from dual;

INSTR('/PRODUCT/PELICANSUNSTREAM100KAYAK/ADD_TO_CART','/',-1,1)
---------------------------------------------------------------
							     37

There is no such option in the Impala equivalent. It simply searches the string from the start and reports the first occurrence.
Fortunately, we want to strip out these results for the stuff we’re going to do in a minute. Therefore, I’ve just amended the query to be :

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%';

We can use fairly standard SQL to get a listing of the products by number of views :

select count(*), substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
order by 1 desc;

In the tutorial, it mentions that there is one product that has a lot of views but no sales. I wonder if we can find this just using SQL ?

with viewed_products as
(
select count(*) as times_viewed,
substr(regexp_replace( url, '%20', ' '),
       instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9) as product_name
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
)
select v.times_viewed, v.product_name
from viewed_products v
where upper(v.product_name) not in
(
  select upper(p.product_name)
  from products p
  inner join
  (
    select oi.order_item_product_id as product_id,
        count(oi.order_item_id) as units_sold
    from order_items oi
    inner join orders o
        on oi.order_item_order_id = o.order_id
    where o.order_status not in ('CANCELED', 'SUSPTECTED_FRAUD')
    group by order_item_product_id
  ) s
    on p.product_id = s.product_id
);

OK, it’s not the most elegant SQL I’ve ever written but it does demonstrate that :

  • You can define in-line views using the WITH clause
  • You can use a NOT IN subquery

…better than that, it gives the correct answer :

adidas Kids' RG III Mid Football Cleat

In the tutorial, the reason for the lack of sales is a coding error in the product page. You can ask your own snarky question about Unit Testing practices here.

Going back to the csv files we were playing around with on hdfs earlier, how would we get those into our database ?

Go fly a Kite

Yes, it’s another tool. This one is called Kite. It’s purpose, in this case at least, is to help us create a table based on the tools.csv file and allow us to then make use of it in Impala.

First of all, I’ve created the tools.csv file on the local file system :

l tools.csv
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

Now to get kite to create a “table” definition based on the csv :

kite-dataset csv-schema tools.csv --class Tools -o tools.avsc
ls -l tools.*
-rw-rw-r-- 1 cloudera cloudera 373 Mar 26 17:22 tools.avsc
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

If we have a look at the new avro file that kite has generated, it looks similar in terms of format to the one that SQOOP generated earlier :

{
  "type" : "record",
  "name" : "Tools",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "tool",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'hue'",
    "default" : null
  }, {
    "name" : "description",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'Web-based UI for Hadoop'",
    "default" : null
  } ]
}

The next step is to create the table metadata ( in kite terminology this is a dataset)…

kite-dataset create tools -s tools.avsc

…and finally add the data itself to our table…

kite-dataset csv-import tools.csv tools
The url to track the job: http://localhost:8080/
Added 9 records to "tools"

To check in Impala, we can head over to Hue, select Impala from the Query Editors drop-down and run :

invalidate metadata tools;
select * from tools;

The result should look like this :

tools_query

Back on the command line, we can see that kite has added files to hdfs :

hadoop fs -ls /user/hive/warehouse/tools
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:32 /user/hive/warehouse/tools/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:35 /user/hive/warehouse/tools/.signals
-rw-r--r--   1 cloudera supergroup        622 2016-03-26 17:35 /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro

The .avro file appears to have been compressed using the Snappy compression tool also used by SQOOP earlier.
You can however read the file by running :

hadoop fs -text /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro
{"tool":{"string":"banana"},"description":{"string":"a web UI framework"}}
{"tool":{"string":"flume"},"description":{"string":"stream a file into Hadoop"}}
{"tool":{"string":"hive"},"description":{"string":"another query engine"}}
{"tool":{"string":"hue"},"description":{"string":"Web-based UI for Hadoop"}}
{"tool":{"string":"impala"},"description":{"string":"a query engine"}}
{"tool":{"string":"khafka"},"description":{"string":"a scheduler"}}
{"tool":{"string":"oozie"},"description":{"string":"another scheduler"}}
{"tool":{"string":"spark"},"description":{"string":"a query engine that is not hive or impala ?"}}
{"tool":{"string":"sqoop"},"description":{"string":"Transfer structured data between an RDBMS and Hadoop"}}
Summary

The Getting Started Tutorial goes on to cover various other tools available in the Cloudera distribution for doing data manipulation and analysis.
Additionally, you may find this presentation on Hadoop internals for Oracle Devs by Tanel Poder useful.

For now though, I have enough to keep my head above water.


Filed under: Oracle, SQL Tagged: Hadoop, hdfs, Hive, Impala, Kite, SQOOP

Live to Win – Motorhead Covers and Pythonic Irrigation

Tue, 2016-03-08 15:31

The recent passing of Lemmy has caused me to reflect on on the career of one of the bands who made my growing up (and grown-up) years that much…well…louder.

Yes, I know that serious Python documentation should employ a sprinkling of Monty Python references but, let’s face it, what follows is more of a quick trawl through some basic Python constructs that I’ve found quite useful recently.
If I put them all here, at least I’ll know where to look when I need them again.

In any case, Michael Pailin made a guest appearance on the album Rock ‘n’ Roll so that’s probably enough of a link to safisfy the Monty Python criteria.

I find Python a really good language to code in…especially when the alternative is writing a Windows Batch Script. However, there is a “but”.
Python 3 is not backward compatible with Python 2. This can make life rather interesting on occasion.

It is possible to write code that is compatible with both versions of the language and there’s a useful article here on that topic.

The code I’ve written here has been tested on both Python 2 (2.7.6) and Python 3 (3.4.3).

One of the great things about Python is that there are a number of modules supplied as standard, which greatly simplify some common programming tasks.
What I’m going to run through here is :

  • Getting information about the environment
  • Handling runtime arguments with the argparse module
  • Reading config files with configparser
  • Writing information to log files with the logging module

Existential Questions

There are a number of questions that you’ll want to answer programatically, sooner rather than later…

Who am I

There’s a couple of ways to find out the user your connected as from inside Python.
You could simply use the os.getlogin() function…

import os
print( os.getlogin())

…but according to the official documentation [link] this is probably a better option…

import os
import pwd
print( pwd.getpwuid(os.getuid())[0])

Additionally, we may want to know the name of the Python program we’re currently in. The following script – called road_crew.py – should do the job :

import os
print(os.path.basename(__file__))

Running this we get :

road_crew.py
Where am I

Step forward the platform module, as seen here in this code (saved as hammersmith.py) :

import platform

def main() :
    # Get the name of the host machine
    machine_name = platform.node()
    # Get the OS and architecture
    os_type = platform.system()
    if platform.machine() == 'x86_64' :
        os_arch = '64-bit'
    else :
        os_arch = '32-bit'

    print('Running on '+machine_name+' which is running '+ os_type + ' ' + os_arch)

    # Now get more detailed OS information using the appropriate function...
    if os_type == 'Linux' :
        print(platform.linux_distribution())
    elif os_type == 'Windows' :
        print(platform.win32_ver())
    elif os_type == 'Mac' :
        #NOTE - I don't have a Mac handy so have no way of testing this statement
        print(platform.mac_ver())
    else :
        print("Sky high and 6000 miles away!")

if __name__ == '__main__' :
    main()

Running this on my Linux Mint machine produces :

Running on mike-TravelMate-B116-M which is running Linux 64-bit
('LinuxMint', '17.3', 'rosa')

As mentioned previously, you also may be quite keen to know the version of Python that your program is running on….

import sys

major = sys.version_info[0]
minor = sys.version_info[1]
micro = sys.version_info[2]

if major == 3 :
    print('Ace of Spades !')
else :
    print('Bomber !')

print('You are running Python ' + str(major) + '.' + str(minor) + '.' + str(micro))

On Python 3, this outputs…

Ace of Spades !
You are running Python 3.4.3

…whilst on Python 2…

Bomber !
You are running Python 2.7.6
When Am I

As for the current date and time, allow me to introduce another_perfect_day.py…

import time

today = time.strftime("%a %d %B %Y")
now = time.strftime("%H:%M:%S")

print("Today's date is " + today);
print("The time is now " + now);

…which gives us …

Today's date is Sun 06 March 2016
The time is now 19:15:19
Argument parsing

The argparse module makes handling arguments passed to the program fairly straightforward.
It allows you to provide a short or long switch for the argument, specify a default value, and even write some help text.
The program is called no_remorse.py and looks like this :

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-a", "--age", default = 40, help = "How old are you ? (defaults to 40 - nothing personal)")
args = vars(parser.parse_args())
age = args['age']
if int(age) > 39 :
    print('I remember when Motorhead had a number 1 album !')
else :
    print('Who are Motorhead ?')

The argparse gives us a couple of things. First of all, if we want to know more about the required parameters, we can simply invoke the help :

python no_remorse.py -h
usage: no_remorse.py [-h] [-a AGE]

optional arguments:
  -h, --help         show this help message and exit
  -a AGE, --age AGE  How old are you ? (defaults to 40 - nothing personal)

If we run it without specifying a value for age, it will pick up the default….

python no_remorse.py
I remember when Motorhead had a number 1 album !

…and if I’m tempted to lie about my age (explicitly, as opposed to by omission in the previous example)…

python no_remorse.py -a 39
Who are Motorhead ?

As well as using the single-letter switch for the parameter, we can use the long version …

python no_remorse.py --age 48
I remember when Motorhead had a number 1 album !

One other point to note, the program will not accept arguments passed by positon, either the long or short switch for the argument must be specified. Either that or Python comes with it’s own outrageous lie detector…

python no_remorse.py 25
usage: no_remorse.py [-h] [-a AGE]
no_remorse.py: error: unrecognized arguments: 25
Reading a config file

There are times when you need a program to run on multiple environments, each with slightly different details ( machine name, directory paths etc).
Rather than having to pass these details in each time you run the program, you can dump them all into a file for your program to read at runtime.
Usually, you’ll pass in an argument to point the program at the appropriate section of your config file. A config file will look something like this :

[DEV]
db_name = dev01

[TEST]
db_name = test01

[PROD]
db_name = prod

In this example, your program will probably accept an argument specifying which environment it needs to run against and then read the appropriate section of the config file to set variables to the appropriate values.

My working example is slightly different and is based on cover versions that Motorhead have done of other artists’ tracks, together with a couple of my favourite covers of Motorhead songs by other bands :

[MOTORHEAD]
Tammy Wynette = Stand By Your Man
The Kingsmen = Louie Louie

[METALLICA]
Motorhead = Overkill

[CORDUROY]
Motorhead = Motorhead

Now, you could spend a fair amount of time trying to figure out how to read this file and get the appropriate values…or you could just use the configparser module…

Conditional Import – making sure you find Configparser

The configparser module was renamed in Python3 so the import statement for it is different depending on which version of Python your using.
Fortunately, Python offers the ability to conditionally import modules as well as allowing you to alias them.
Therefore, this should solve your problem…

try:
    import configparser
except ImportError :
    import ConfigParser as configparser

So, if we’re running Python 3 the first import statement succeeds.
If we’re running Python2 we’ll get an ImportError, in which case we import the version 2 ConfigParser and alias it as configparser.
The alias means that we can refer to the module in the same way throughout the rest of the program without having to check which version we’ve actually imported.
As a result, our code should now run on either Python version :

try:
    import configparser
except ImportError :
    import ConfigParser as configparser

config = configparser.ConfigParser()
config.read('covers.cfg')

#Get a single value from the [CORDUROY] section of the config file
cover_artist = 'CORDUROY'
#Find the track they covered, originally recorded by Motorhead
# Pass the config section and the original artist ( the entry on the left-hand side of the "="
# in the config file
track = config.get(cover_artist, 'Motorhead')
# cover_artist and track are string objects so we can use the title method to initcap the output
print(cover_artist.title() + ' covered ' + track.title() + ' by Motorhead')

# Loop through all of the entries in the [MOTORHEAD] section of the config file
for original_artist in config.options('MOTORHEAD') :
    print('Motorhead covered ' + config.get('MOTORHEAD', original_artist) + ' by ' + original_artist.upper())

Run this and we get…

Corduroy covered Motorhead by Motorhead
Motorhead covered Stand By Your Man by TAMMY WYNETTE
Motorhead covered Louie Louie by THE KINGSMEN
Dead Men Tell No Tales

…but fortunately the Python logging module will let your programs sing like a canary.

As with the configparser, there’s no need to write lots of code to open and write to a file.
There are five levels of logging message supported :

  • DEBUG
  • INFO
  • WARNING – the default
  • ERROR
  • CRITICAL

There is a separate call to write each message type. The message itself can be formatted to include information such as a timestamp and the program from which the message was written. There’s a detailed how-to on logging here.

For now though, we want a simple program (logger.py) to write messages to a file wittily and originally titled logger.log…

import logging

logging.basicConfig(
    filename='logger.log',
    level=logging.INFO,
    format='%(asctime)s:%(filename)s:%(levelname)s:%(message)s'
)

logging.debug('No Remorse')
logging.info('Overnight Sensation')
logging.warn('March or Die')
logging.error('Bad Magic')

There’s no output to the screen when we run this program but if we check, there should now be a file called logger.log in the same directory which contains :

2016-03-06 19:19:59,375:logger.py:INFO:Overnight Sensation
2016-03-06 19:19:59,375:logger.py:WARNING:March or Die
2016-03-06 19:19:59,375:logger.py:ERROR:Bad Magic

As you can see, the type of message in the log depends on the logging member invoked to write the message.

If you want a more comprehensive/authoritative/coherent explanation of the features I’ve covered here, then have a look at the official Python documentation.
On the other hand, if you want to check out a rather unusual version of one of Motorhead’s signature tracks, this is definitely worth a look.


Filed under: python Tagged: argparse, configparser, logging, os.getlogin, os.path.basename, platform.node, platform.system, pwd.getpwuid, sys.version_info, time.strftime

Resolving Hardware Issues with a Kernel Upgrade in Linux Mint

Sun, 2016-02-07 11:40

One evening recently, whilst climbing the wooden hills with netbook in hand, I encountered a cat who had decided that halfway up the stairs was a perfect place to catch forty winks.
One startled moggy later, I had become the owner of what I can only describe as…an ex-netbook.

Now, finally, I’ve managed to get a replacement (netbook, not cat).

As usual when I get a new machine, the first thing I did was to replace Windows with Linux Mint…with the immediate result being that the wireless card stopped working.

The solution ? Don’t (kernel) panic, kernel upgrade !

Support for most of the hardware out there is included in the Linux Kernel. The kernel is enhanced and released every few months. However, distributions, such as Mint, tend to stick on one kernel version for a while in order to provide a stable base on which to develop.
This means that, if Linux is not playing nicely with your Wireless card/web-cam/any other aspect of your machine’s hardware, a kernel upgrade may resolve your problem.
Obviously it’s always good to do a bit of checking to see if this might be the case.
It’s also good to have a way of putting things back as they were should the change we’re making not have the desired effect.

What I’m going to cover here is the specific issue I encountered with my new Netbook and the steps I took to figure out what kernel version might fix the problem.
I’ll then detail the kernel upgrade itself.

Machine details

The machine In question is an Acer TravelMate-B116.
It has an 11.6 inch screen, 4GB RAM and a 500GB HDD.
For the purposes of the steps that follow, I was able to connect to the internet via a wired connection to my router. Well, up until I got the wireless working.
The Linux OS I’m using is Linux Mint 17.3 Cinnamon.
Note that I have disabled UEFI and am booting the machine in Legacy mode.

Standard Warning – have a backup handy !

In my particular circumstances, I was trying to configure a new machine. If it all went wrong, I could simply re-install Mint and be back where I started.
If you have stuff on your machine that you don’t want to lose, it’s probably a good idea to back it up onto separate media ( e.g. a USB stick).
Additionally, if you are not presented with a grub menu when you boot your machine, you may consider running the boot-repair tool.
This will ensure that you have the option of which kernel to use if you have more than one to choose from ( which will be the case once you’ve done the kernel upgrade).

It is possible that upgrading the kernel may cause issues with some of the hardware that is working fine with the kernel you currently have installed, so it’s probably wise to be prepared.

Identifying the card

The first step then, is to identify exactly which wireless network card is in the machine.
From a terminal window …

lspci

00:00.0 Host bridge: Intel Corporation Device 2280 (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Device 22b1 (rev 21)
00:0b.0 Signal processing controller: Intel Corporation Device 22dc (rev 21)
00:13.0 SATA controller: Intel Corporation Device 22a3 (rev 21)
00:14.0 USB controller: Intel Corporation Device 22b5 (rev 21)
00:1a.0 Encryption controller: Intel Corporation Device 2298 (rev 21)
00:1b.0 Audio device: Intel Corporation Device 2284 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 22c8 (rev 21)
00:1c.2 PCI bridge: Intel Corporation Device 22cc (rev 21)
00:1c.3 PCI bridge: Intel Corporation Device 22ce (rev 21)
00:1f.0 ISA bridge: Intel Corporation Device 229c (rev 21)
00:1f.3 SMBus: Intel Corporation Device 2292 (rev 21)
02:00.0 Network controller: Intel Corporation Device 3165 (rev 81)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

It looks like the penultimate entry is our wireless card.
It is possible to get details of the card you have by using “Intel Corporation Device 3165” as a search term. However, we may be able to get the name of the card by running ….

lspci -vq |grep -i wireless -B 1 -A 4

In my case, this returns :

02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
	Subsystem: Intel Corporation Dual Band Wireless AC 3165
	Flags: bus master, fast devsel, latency 0, IRQ 200
	Memory at 91100000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>

Further digging around reveals that, according to Intel, this card is supported in linux starting at Kernel version 4.2.

Now, which version of the Kernel are we actually running ?

Identifying the current kernel version and packages

This is relatively simple. In the Terminal just type :

uname -r

On Mint 17.3, the output is :

3.19.0-32-generic

At this point, we now know that an upgrade to the kernel may well solve our wireless problem. The question now is, which packages do we need to install to effect the upgrade ?

If you look in the repositories, there appear to be at least two distinct versions of kernel packages, the generic and something called low-latency.
In order to be confident of which packages we want to get, it’s probably a good idea to work out what we have now.
This can be achieved by searching the installed packages for the version number of the current kernel.
We can do this in the terminal :

dpkg --list |grep 3.19.0-32 |awk '{print $2}'

In my case, this returned :

linux-headers-3.19.0-32
linux-headers-3.19.0-32-generic
linux-image-3.19.0-32-generic
linux-image-extra-3.19.0.32-generic
linux-kernel-generic

As an alternative, you could use the graphical Synaptic Package Manager.
You can start this from the menu ( Administration/Synaptic Package Manager).

synaptic1

Now we know what we’ve got, the next step is to find the kernel version that we need…

Getting the new kernel packages

It may well be the case that the kernel version you’re after has already been added to the distro’s repository.
To see if this is the case, use Synaptic Package Manager to search as follows :

Start Synaptic Package Manager from the System Menu.
You will be prompted for your password.

Click the Status button and select Not Installed

synaptic_search1

In the Quick filter bar, enter the text : linux-headers-4.2*-generic

synaptic_search2

This should give you a list of any kernel 4.2 versions available in the repository.

If, as I did, you find the version you’re looking for, you need to select the packages that are equivalent to the ones you already have installed on your system.
Incidentally, there are a number of 4.2 kernel versions available, so I decided to go for the latest.
In my case then, I want to install :

  • linux-headers-4.20.0-25
  • linux-headers-4.20.0-25-generic
  • linux-image-4.20.0-25-generic
  • linux-image-extra-4.20.0-25-generic

NOTE – If you don’t find the kernel version you are looking for, you can always download the packages directly using these instructions.

Assuming we have found the version we want, we need to now search for the relevant packages.
In the Quick filter field in Synaptic, change the search string to : linux-*4.2.0-25

To Mark the packages for installation, right-click each one in turn and select Mark for Installation

synaptic_select

Once you’ve selected them all, hit the Apply button.

Once the installation is completed, you need to re-start your computer.

On re-start, you should find that the Grub menu has an entry for Advanced Options.
If you select this, you’ll see that you have a list of kernels to choose to boot into.
This comes in handy if you want to go back to running the previous kernel version.

For now though, we’ll boot into the kernel we’ve just installed.
We can confirm that the installation has been successful, once the machine starts, by opening a Terminal and running :

uname -r

If all has gone to plan, we should now see…

4.2.0-25-generic

Even better in my case, my wireless card has now been recognised.
Opening the systray icon, I can enable wireless and connect to my router.

Backing out of the Kernel Upgrade

If you find that the effects of the kernel upgrade are undesirable, you can always go back to the kernel you started with.
If at all possible, I’d recommend starting Mint using the old kernel before doing this.

If you’re running on the kernel for which you are deleting the packages, you may get some alarming warnings. However, once you re-start, you should be back to your original kernel version.

The command then, is :

sudo apt-get remove linux-headers-4.2* linux-image-4.2*

…where 4.2 is the version of the kernel you want to remove.
Run this and the output looks like this…

The following packages will be REMOVED
  linux-headers-4.2.0-25 linux-headers-4.2.0-25-generic
  linux-image-4.2.0-25-generic linux-image-extra-4.2.0-25-generic
  linux-signed-image-4.2.0-25-generic
0 to upgrade, 0 to newly install, 5 to remove and 7 not to upgrade.
After this operation, 294 MB disk space will be freed.
Do you want to continue? [Y/n]

Once the packages have been removed, the old kernel will be in use on the next re-boot.
After re-starting, you can check this with :

uname -r

Thankfully, these steps proved unnecessary in my case and the kernel upgrade has saved me from hardware cat-astrophe.


Filed under: Linux, Mint Tagged: Acer TravelMate-B116, apt-get remove, dpkg, Intel Corporation Dual Band Wireless AC 3165, kernel upgrade, lspci, synaptic package manager, uname -r