A Primer for the Qube Data Warehouse Schema

The pfx_dw schema The dwh schema is a fairly textbook "textbook star schema", with things like jobs and worker usage in in fact tables, and the various things that you might use in a SQL SQL WHERE clause in clause in dimension tables tables.

The job

...

_fact table, a typical "fact" table.

For example, to get reports about various jobs over time, you'll be querying the pfx_dwthe dwh.job_fact table table:

Code Block

language	sql

mysql> SHOW CREATE TABLE pfx_dwpfx=> \dS+ dwh.job_fact\G
*************************** 1. row ***************************
Table: job_fact
Create Table: CREATE TABLE `job_fact` (
`time_sk` int(11) NOT NULL,
`job_id` int(11) DEFAULT NULL,
`job_name` varchar(255) DEFAULT NULL,
`job_priority` int(11) DEFAULT NULL,
`job_timesubmit` int(11) DEFAULT NULL,
`job_timestart` int(11) DEFAULT NULL,
`job_timecomplete` int(11) DEFAULT NULL,
`jobstatus_sk` int(11) DEFAULT NULL,
`user_sk` smallint(5) unsigned DEFAULT NULL,
`cluster_sk` smallint(5) unsigned DEFAULT NULL,
`account_sk` smallint(5) unsigned DEFAULT NULL,
`kind_sk` smallint(5) unsigned DEFAULT NULL,
`prototype_sk` smallint(5) unsigned DEFAULT NULL,
`prod_show_sk` smallint(5) unsigned DEFAULT NULL,
`prod_shot_sk` smallint(5) unsigned DEFAULT NULL,
`prod_seq_sk` smallint(5) unsigned DEFAULT NULL,
`prod_client_sk` smallint(5) unsigned DEFAULT NULL,
`prod_dept_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom1_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom2_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom3_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom4_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom5_sk` smallint(5) unsigned DEFAULT NULL,
`cpu_count` int(10) unsigned NOT NULL,
`cpu_seconds` int(10) unsigned NOT NULL,
`work_count` int(10) unsigned NOT NULL,
`worktime_min` int(10) unsigned NOT NULL,
`worktime_max` int(10) unsigned NOT NULL,
`worktime_avg` int(10) unsigned NOT NULL,
`mem_max` bigint(20) unsigned DEFAULT NULL,
`mem_avg` bigint(20) unsigned DEFAULT NULL,
`mem_stddev` bigint(20) unsigned DEFAULT NULL,
UNIQUE KEY `job_id` (`job_id`),
KEY `time_sk` (`time_sk`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec                                   Table "dwh.job_fact"
      Column      |           Type           | Modifiers | Storage  | Stats target | Description 
------------------+--------------------------+-----------+----------+--------------+-------------
 time_sk          | integer                  | not null  | plain    |              | 
 job_id           | integer                  |           | plain    |              | 
 job_pgrp         | integer                  |           | plain    |              | 
 job_name         | character varying(255)   |           | extended |              | 
 job_priority     | integer                  |           | plain    |              | 
 job_timesubmit   | timestamp with time zone |           | plain    |              | 
 job_timestart    | timestamp with time zone |           | plain    |              | 
 job_timecomplete | timestamp with time zone |           | plain    |              | 
 jobstatus_sk     | integer                  |           | plain    |              | 
 user_sk          | smallint                 |           | plain    |              | 
 cluster_sk       | smallint                 |           | plain    |              | 
 account_sk       | smallint                 |           | plain    |              | 
 kind_sk          | smallint                 |           | plain    |              | 
 prototype_sk     | smallint                 |           | plain    |              | 
 prod_show_sk     | smallint                 |           | plain    |              | 
 prod_shot_sk     | smallint                 |           | plain    |              | 
 prod_seq_sk      | smallint                 |           | plain    |              | 
 prod_client_sk   | smallint                 |           | plain    |              | 
 prod_dept_sk     | smallint                 |           | plain    |              | 
 prod_custom1_sk  | smallint                 |           | plain    |              | 
 prod_custom2_sk  | smallint                 |           | plain    |              | 
 prod_custom3_sk  | smallint                 |           | plain    |              | 
 prod_custom4_sk  | smallint                 |           | plain    |              | 
 prod_custom5_sk  | smallint                 |           | plain    |              | 
 cpu_count        | integer                  | not null  | plain    |              | 
 cpu_seconds      | integer                  | not null  | plain    |              | 
 work_count       | integer                  | not null  | plain    |              | 
 worktime_min     | integer                  | not null  | plain    |              | 
 worktime_max     | integer                  | not null  | plain    |              | 
 worktime_avg     | real                     | not null  | plain    |              | 
Indexes:
    "job_id_unique" UNIQUE CONSTRAINT, btree (job_id)
    "idx_time" btree (time_sk)

Note

title	job_time* are stored as timestamp types

With Qube 7.x which uses PostgreSQL as the backbone, job_time* are stored as timestamp types, so, for example, you can fetch time data for a specific job as in:

pfx=> SELECT job_id, job_timesubmit, job_timestart,job_timecomplete FROM dwh.job_fact WHERE job_id=57 ;

 job_id |     job_timesubmit     |     job_timestart      |    job_timecomplete

--------+------------------------+------------------------+------------------------

     57 | 2020-08-07 23:49:57-10 | 2020-08-07 23:49:57-10 | 2020-08-07 23:50:03-10

(1 row)

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

Any column that is named with an an _sk suffix is a a Synthetic ynthetic Key that points to a corresponding dimension table, named with the part of the column before the the _sk; the dimension table will have a a _dim suffix suffix in the name. This way, it's easy to write the the JOIN's, the column name is a clue to the dimension table, which will have a column of the same name. Almost every dimension table will consist of a a *_sk PRIMARY PRIMARY KEY and a a name column column.

A typical dimension table, the "user_dim" table

For example, the the user_sk column can be used to do a SQL SQL INNER JOIN to the the user_dim table table.

Code Block

language	sql

mysql> SHOW CREATE TABLE pfx_dwpfx=> \dS+ dwh.user_dim\G
*************************** 1. row ***************************
Table: user_dim
Create Table: CREATE TABLE `user_dim` (
`user_sk` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`user_sk`),
UNIQUE KEY `name` (`name`)
) ENGINE=MyISAM AUTO_INCREMENT=10 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
 
mysql> SELECT name FROM user_dim;
                                                      Table "dwh.user_dim"
 Column  |          Type          |                         Modifiers                          | Storage  | Stats target | Description 
---------+---------------+
| name          |
---------+------------------------------------------------------------+----------+--------------+
| Administrator-------------
 user_sk | |integer bar               | not null default nextval('user_dim_user_sk_seq'::regclass) | fooplain    |       | | foobar     | 
 name    | character varying(255) | fubarnot null        | | jburk         | | perforce      | | root                      | extended | shinya        |
+     | 
Indexes:
    "user_dim_pkey" PRIMARY KEY, btree (user_sk)
    "user_dim_name_key" UNIQUE CONSTRAINT, btree (name)

pfx=> SELECT name FROM user_dim;
  name  
---------------+


 shinya
 root
 administrator
 jburk
 joe
 kmac
(6 rows)

Get a count of all jobs for a particular user:

Code Block

language	sql

mysql>pfx=> SELECT 	COUNT(*) 
FROM 
	job_fact AS fact
INNER JOIN 	user_dim AS user
ON 	job_fact.user_sk = user_dim.user_sk WHERE 	user_dim.name = 'jburkshinya'
;
+----------+
| COUNT(*) |
+ count 
----------+
| 5868  370
(1 row)
|
+----------+

The time dimension table

The pfx_dwdwh.time_dim table table is provided so that you don't have to perform date/time operations on every row in a a fact table (since they can run into the 100's of millions of rows), instead you do a SQL INNER JOIN to to it and use the values in the the time_dim table table in your your WHERE clause. The The time_sk column in every fact table has an identical value in the the time_dim table which table which has a single row with a primary key time_sk. The The time_sk value value is actually the unix epoch time in seconds:

Code Block

language	sql

mysql>pfx=> SELECT SELECT 	*  FROM FROM 	time_dim dim ORDER BY BY 	time_sk sk DESC LIMIT 1
;
+------------+---------------------+------+------------+-----+------------+-------+------+ | time_sk    |       date_time           | hour | date   date    | dow | month_name | month | year |
+------------+------------------------+------+------------+-----+------------+-------+------+
| 13923396001607570700 | 20142020-0212-1309 17:0025:00-10 | 17  17 | 20142020-0212-1309 | 5  3 | FebruaryDecember   | 2   12  | 2014 |
+------------+---------------------+------+------------+-----+------------+-------+------+
1 row in set (0.00 sec)

mysql> SELECT 
	FROM_UNIXTIME2020
(1 row)

pfx=> SELECT to_timestamp(time_sk)
	, date_time time FROM FROM 	time_dim dim ORDER BY BY 	time_sk sk DESC LIMIT 1
;
+------------------------+---------------------+ | FROM_UNIXTIME(time_sk) |     to_timestamp      |       date_time        
  |
+------------------------+---------------------+---
| 20142020-0212-1309 17:0030:00-10    | 20142020-0212-1309 17:0030:00-10
|
+------------------------+---------------------+(1 row)

The "job status" dimension table

The pfx_dwdwh.jobstatus_dim table is one of the few exceptions to the normal dimension table structure; it provides a mapping between the integer and human-readable status values.

Code Block

language	sql

mysql>pfx=> SELECT * FROM jobstatus_dim;
+--------------+------------+-------------+----------------+-------------+ | jobstatus_sk | status_int | status_char | effective_date | expiry_date |
+--------------+------------+-------------+----------------+-------------+
| 1           1 | 16        16 | complete    | 1999-12-31     | 9999-12-31
 | | 2         2   | 32        32 | failed      | 1999-12-31     | 9999-12-31
 | | 3         3   | 48        48 | killed      | 1999-12-31     | 9999-12-31
 | | 4         4   | 272       261 | blockeddying       | 1999-12-31     | 9999-12-31
 | | 5         5   | 288       262 | waitingexiting     | 1999-12-31     | 9999-12-31
 | | 6         6   | 304       265 | suspendedregistering   | 1999-12-31     | 9999-12-31
 | | 7         7   | 320       272 | pendingblocked     | 1999-12-31     | 9999-12-31
 | | 8         8   | 323       288 | waiting     | 1999-12-31     | 9999-12-31
 | | 9         9   | 325       304 | badloginsuspended    | 1999-12-31     | 9999-12-31
 | | 10        10   | 336       320 | runningpending     | 1999-12-31     | 9999-12-31
           11 |        323 | 11waiting     | 1999-12-31     | 2619999-12-31
           12 | dying       325 | badlogin    | 1999-12-31     | 9999-12-31
 |
+--------------+------------+-------------+----------------+-------------+

          13 |        336 | running     | 1999-12-31     | 9999-12-31
(13 rows)

Get a count of all jobs for a particular user for

...

August,

...

2020:

Code Block

language	sql

mysql>pfx=> SELECT 	COUNT(*) 
FROM 
	job_fact AS fact
INNER JOIN 	user_dim AS user
ON 	job_fact.user_sk = user_dim.user_sk INNER JOIN 	time_dim AS time
ON 	job_fact.time_sk=time_dim.time_sk WHERE 	user_dim.name = 'jburkshinya' AND 	time_dim.month = 18 AND 
	time_dim.year = 2014
2020;
+----------+
| COUNT(*) |
+ count 
----------+
| 83   96
(1 row)

|
+----------+

Get a count of all jobs for each user for all of

...

2020:

Code Block

language	sql

mysql>pfx=> SELECT 	user_dim.name
	, time_dim.month_name
	, COUNT(*) as "job count" 
FROM 
	job_fact AS fact
INNER JOIN 	user_dim AS user
ON 	job_fact.user_sk = user_dim.user_sk INNER JOIN 	time_dim AS time
ON 	job_fact.time_sk=time_dim.time_sk WHERE 	time_dim.year = 20132020 GROUP BY 	user_dim.name
	, time_dim.month,time_dim.month_name ORDER BY 	user_dim.name
	, time_dim.month
;
+---------------+------------+-----------+ | name          | month_name | job count |
+---------------+------------+-----------+
| Administratorroot | January    | 4         |
| bar           | July 		 | 1  	 	 | | foo       2
 root  | July 		 | 2August 	 	 | | foobar        | July 		 | 1 		 | | foobar 3
 root     | September  | 2 		 | | fubar   13
 root    | July		October | 1 		 | | jburk       8
 |root March		 | 1 		 | |November jburk         | May		 | 123 		 | | jburk  4
 shinya     | June		July | 24 		 | | jburk         | July		 | 220 		 | | jburk  9
 shinya     | August	 | 140 		 | | jburk         | September	 | 68 		 |
| jburk         | October	 | 560 		 |
| jburk         | November	 | 59 		 |
| jburk         | December	 | 4467 	 |
| perforce		| January	 | 128 		 |
| perforce		| February	 | 4 		 |
| root 			| January	 | 31
		 |
| root 			| February	 | 37 		 |
| root 			| March		 | 7 		 |
| root 			| April		 | 17 		 |
| root 			| May		 | 7 		 |
| root 			| June		 | 1 		 |
| root 			| July		 | 11 		 |
| root 			| September	 | 2 		 |
| root 			| December	 | 2 		 | 
| shinya 		| January	 | 23 		 |
| shinya 		| February	 | 69 		 |
| shinya 		| March		 | 7 		 |
| shinya 		| April		 | 20 		 |
| shinya 		| May		 | 2 		 |
+---------------+------------+-----------+shinya | September  |       157
 shinya | October    |        50
 shinya | November   |        75
(10 rows)

Get a count of all jobs for each user for all of

...

2020, broken down by month and the job's final status:

Code Block

language	sql

mysql>pfx=> SELECT
	  user_dim.name
	  , time_dim.year
	  , time_dim.month_name
	  , statusjobstatus_dim.status_char
	  , COUNT(*) as "job count"

FROM
 	 job_fact
AS fact
INNER JOIN
	  user_dim
ASON
user ON 	job_fact.user_sk=user_dim.user_sk
INNER JOIN
	  time_dim
ASON
time ON 	job_fact.time_sk=time_dim.time_sk
INNER JOIN
	  jobstatus_dim
ASON
status ON 
	job_fact.jobstatus_sk=statusjobstatus_dim.jobstatus_sk
WHERE
	  time_dim.year = 20132020
GROUP BY
	  user_dim.name
	,time_dim.year,time_dim.month_name,jobstatus_dim.status_char
  , time_dim.month
	  , statusjobstatus_dim.status_int
ORDER BY
	  user_dim.name
	  , time_dim.month
	,status  ,jobstatus_dim.status_char
;
+---------------+------+------------+-------------+-----------+ | name 			 | year | month_name | status_char | job count |
+---------------+------+------------+-------------+-----------+
| Administratorroot | 2013 | January2020 	| failed	July  | 4		  |  | barcomplete 			| 2013  | July 		| complete	  | 1		  | |2
foo 			|root 2013  | July2020 		| complete	August  | 2		  | |complete foobar 		| 2013 | July 		| complete	  | 1 		  |3
| foobarroot 		| 2013 | September2020 	| complete	September  | 2complete 		  | | fubar 		| 2013 | July 		| complete	 11
| 1		root  | | jburk2020 		| 2013September | March 		| complete	failed  | 1		  | | jburk 		| 2013 | May 		| complete	  |2
100		 root | | jburk 		| 20132020 | MayOctober 		| failed	  | 11		complete  |  | jburk 		| 2013 | May 		| killed	  |8
12		 root |  | jburk2020 		| 2013November | June 		| complete	  | 4		  | | jburk 		| 2013 | June 		| failed	 4
| 20		shinya  | |2020 jburk 		| 2013 | July 		| complete	  | 134		  | |complete jburk 		| 2013 | July 		| failed	  | 33		  | |6
jburk 		shinya | 20132020 | July 		| killed	  | 53		  | |failed jburk 		| 2013 | August 	| complete	  | 67		  | | jburk 		|1
2013 shinya | August2020 	| failed	July  | 25		  |  | jburkkilled 		| 2013 | August 	 | killed	  | 48		  | | jburk 		|2
2013 shinya | September2020 	| complete	August  | 38		  | |complete jburk 		| 2013 | September 	| failed	  | 17  	25
 |shinya | jburk2020 		| 2013August | September 	  | killed	  | 13  	  | | jburk 		| 2013 | October 	| complete	 6
| 450		shinya  | 2020 | jburkSeptember 		| 2013 | Octobercomplete 	| failed	  | 61		  | | jburk			| 2013 |139
October 	shinya | killed	2020  | 49		September  | |failed jburk 		| 2013 | November 	| complete	  | 38		  | | jburk11
		| 2013shinya | November2020 	| failed	September  | 12		killed  | | jburk 		 | 2013 | November 	| killed	  | 9		 7
| |shinya jburk 		| 20132020 | DecemberOctober 	| complete	  | 3812	complete  | | jburk 		| 2013 | December 	| failed	  | 627		47
 |shinya | jburk2020 		| 2013October | December 	 | killed	failed  | 28		  | | perforce 		| 2013 | January 	| failed	  |2
46		 shinya | 2020 | perforceOctober 		| 2013 | January 	| killed	  | 82		  | | perforce 		| 2013 | February 	| complete	  |1
3		 shinya | 2020 | perforceNovember 		| 2013 | Februarycomplete 	| killed	  | 1		  | | root 			| 2013 64
 shinya | January2020 	| complete	November   | 24		failed  | | root 			 | 2013 | January 	| failed	  | 6		 8
| shinya | root2020 			| 2013November | January 	| killed	  | 1		  | | root 			| 2013 | February 	| complete	  |3
34 		  |
| root 			| 2013 | February 	| killed	  | 3 		  |
| root 			| 2013 | March 		| complete	  | 7		  |
| root 			| 2013 | April 		| complete	  | 9		  |
| root 			| 2013 | April 		| failed	  | 4		  |
| root 			| 2013 | April 		| killed	  | 4		  |
| root 			| 2013 | May 		| complete	  | 6		  |
| root 			| 2013 | May 		| killed	  | 1		  |
| root 			| 2013 | June 		| complete	  | 1		  |
| root 			| 2013 | July 		| complete	  | 7		  |
| root 			| 2013 | July 		| failed	  | 1		  |
| root 			| 2013 | July 		| killed	  | 3		  |
| root 			| 2013 | September 	| complete	  | 1		  |
| root 			| 2013 | September 	| failed	  | 1		  |
| root 			| 2013 | December 	| complete	  | 2		  |
| shinya 		| 2013 | January 	| complete	  | 4		  |
| shinya 		| 2013 | January 	| failed	  | 7 		  |
| shinya 		| 2013 | January 	| killed	  | 12		  |
| shinya 		| 2013 | February 	| complete	  | 60		  |
| shinya 		| 2013 | February 	| failed	  | 2		  |
| shinya 		| 2013 | February 	| killed	  | 7		  |
| shinya 		| 2013 | March 		| complete	  | 2		  |
| shinya 		| 2013 | March 		| failed	  | 2		  |
| shinya 		| 2013 | March 		| killed	  | 3		  |
| shinya 		| 2013 | April 		| complete	  | 16		  |
| shinya 		| 2013 | April 		| killed	  | 4		  |
| shinya 		| 2013 | May 		| killed	  | 2		  |
+---------------+------+------------+-------------+-----------+

(20 rows)

Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:

Code Block

language	sql

pfx=> SELECT
    user_dim.name
    , time_dim.date
    , jobstatus_dim.status_char
    , SUM(job_fact.cpu_seconds) as "cpu_time"
FROM
    job_fact
INNER JOIN
    user_dim
ON
    job_fact.user_sk=user_dim.user_sk
INNER JOIN
    time_dim
ON
    job_fact.time_sk=time_dim.time_sk
INNER JOIN
    jobstatus_dim
ON
    job_fact.jobstatus_sk=jobstatus_dim.jobstatus_sk
WHERE
    DATE_PART('day', CURRENT_DATE - time_dim.date_time) < 7
GROUP BY
    user_dim.name,jobstatus_dim.status_char
    , time_dim.date
    , jobstatus_dim.status_int
ORDER BY
    time_dim.date
    , cpu_time DESC
    , jobstatus_dim.status_char
;
  name  |    date    | status_char | cpu_time 
--------+------------+-------------+----------
 shinya | 2020-11-13 | complete    |      144
 shinya | 2020-11-16 | complete    |       97
 shinya | 2020-11-17 | failed      |      906
 shinya | 2020-11-17 | killed      |      677
 shinya | 2020-11-17 | complete    |      102
 shinya | 2020-11-18 | failed      |    18695
 root   | 2020-11-18 | complete    |     1199
 shinya | 2020-11-18 | complete    |      760
 shinya | 2020-11-19 | failed      |     8022
 shinya | 2020-11-19 | complete    |      606
 root   | 2020-11-19 | complete    |       46
 shinya | 2020-11-19 | killed      |        0
(12 rows)

Versions Compared

Old Version 1

New Version Current

Key

The job

_fact table, a typical "fact" table.

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

A typical dimension table, the "user_dim" table

Get a count of all jobs for a particular user:

The time dimension table

The "job status" dimension table

Get a count of all jobs for a particular user for

August,

2020:

Get a count of all jobs for each user for all of

2020:

Get a count of all jobs for each user for all of

2020, broken down by month and the job's final status:

Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:

Qube Qube! Documentation Developer's Guide A Primer for the Qube Data Warehouse Schema A Primer for the Qube Data Warehouse Schema

Versions Compared

Old Version 1

New Version Current

Key

The job

_fact table, a typical "fact" table.

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

A typical dimension table, the "user_dim" table

Get a count of all jobs for a particular user:

The time dimension table

The "job status" dimension table

Get a count of all jobs for a particular user for

August,

2020:

Get a count of all jobs for each user for all of

2020:

Get a count of all jobs for each user for all of

2020, broken down by month and the job's final status:

Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:

Qube
Qube! Documentation
Developer's Guide
A Primer for the Qube Data Warehouse Schema

A Primer for the Qube Data Warehouse Schema