The pfx_dw
schema The dwh
schema is a fairly textbook textbook star schema, with things like jobs and worker usage in in fact tables, and the various things that you might use in a SQL SQL WHERE
clause in clause in dimension tables tables.
The job
...
_fact table, a typical "fact" table.
For example, to get reports about various jobs over time, you'll be querying the pfx_dwthe dwh.job_fact table table:
Code Block |
---|
|
mysql> SHOW CREATE TABLE pfx_dwpfx=> \dS+ dwh.job_fact\G
*************************** 1. row ***************************
Table: job_fact
Create Table: CREATE TABLE `job_fact` (
`time_sk` int(11) NOT NULL,
`job_id` int(11) DEFAULT NULL,
`job_name` varchar(255) DEFAULT NULL,
`job_priority` int(11) DEFAULT NULL,
`job_timesubmit` int(11) DEFAULT NULL,
`job_timestart` int(11) DEFAULT NULL,
`job_timecomplete` int(11) DEFAULT NULL,
`jobstatus_sk` int(11) DEFAULT NULL,
`user_sk` smallint(5) unsigned DEFAULT NULL,
`cluster_sk` smallint(5) unsigned DEFAULT NULL,
`account_sk` smallint(5) unsigned DEFAULT NULL,
`kind_sk` smallint(5) unsigned DEFAULT NULL,
`prototype_sk` smallint(5) unsigned DEFAULT NULL,
`prod_show_sk` smallint(5) unsigned DEFAULT NULL,
`prod_shot_sk` smallint(5) unsigned DEFAULT NULL,
`prod_seq_sk` smallint(5) unsigned DEFAULT NULL,
`prod_client_sk` smallint(5) unsigned DEFAULT NULL,
`prod_dept_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom1_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom2_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom3_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom4_sk` smallint(5) unsigned DEFAULT NULL,
`prod_custom5_sk` smallint(5) unsigned DEFAULT NULL,
`cpu_count` int(10) unsigned NOT NULL,
`cpu_seconds` int(10) unsigned NOT NULL,
`work_count` int(10) unsigned NOT NULL,
`worktime_min` int(10) unsigned NOT NULL,
`worktime_max` int(10) unsigned NOT NULL,
`worktime_avg` int(10) unsigned NOT NULL,
`mem_max` bigint(20) unsigned DEFAULT NULL,
`mem_avg` bigint(20) unsigned DEFAULT NULL,
`mem_stddev` bigint(20) unsigned DEFAULT NULL,
UNIQUE KEY `job_id` (`job_id`),
KEY `time_sk` (`time_sk`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec) |
...
title | The "job_time*" columns are stored in UNIX epoch time |
---|
To convert them to human-readable format use the MySQL FROM_UNIXTIME()
function.
mysql> SELECT job_id, job_timesubmit, FROM_UNIXTIME(job_timesubmit) FROM pfx_dw.job_fact WHERE job_id=
98269
;
+--------+----------------+-------------------------------+
| job_id | job_timesubmit | FROM_UNIXTIME(job_timesubmit) |
...
Table "dwh.job_fact"
Column | Type | Modifiers | Storage | Stats target | Description
------------------+--------------------------+-----------+----------+--------------+-------------
time_sk | integer | not null | plain | |
job_id | integer | | plain | |
job_pgrp | integer | | plain | |
job_name | character varying(255) | | extended | |
job_priority | integer | | plain | |
job_timesubmit | timestamp with time zone | | plain | |
job_timestart | timestamp with time zone | | plain | |
job_timecomplete | timestamp with time zone | | plain | |
jobstatus_sk | integer | | plain | |
user_sk | smallint | | plain | |
cluster_sk | smallint | | plain | |
account_sk | smallint | | plain | |
kind_sk | smallint | | plain | |
prototype_sk | smallint | | plain | |
prod_show_sk | smallint | | plain | |
prod_shot_sk | smallint | | plain | |
prod_seq_sk | smallint | | plain | |
prod_client_sk | smallint | | plain | |
prod_dept_sk | smallint | | plain | |
prod_custom1_sk | smallint | | plain | |
prod_custom2_sk | smallint | | plain | |
prod_custom3_sk | smallint | | plain | |
prod_custom4_sk | smallint | | plain | |
prod_custom5_sk | smallint | | plain | |
cpu_count | integer | not null | plain | |
cpu_seconds | integer | not null | plain | |
work_count | integer | not null | plain | |
worktime_min | integer | not null | plain | |
worktime_max | integer | not null | plain | |
worktime_avg | real | not null | plain | |
Indexes:
"job_id_unique" UNIQUE CONSTRAINT, btree (job_id)
"idx_time" btree (time_sk) |
Note |
---|
title | job_time* are stored as timestamp types |
---|
|
With Qube 7.x which uses PostgreSQL as the backbone, job_time* are stored as timestamp types, so, for example, you can fetch time data for a specific job as in: pfx=> SELECT job_id, job_timesubmit, job_timestart,job_timecomplete FROM dwh.job_fact WHERE job_id=57 ; job_id | job_timesubmit | job_timestart | job_timecomplete --------+---------------- +--------+----------------------- +| 98269 | 1414709214 | 2014 10 - 30 15 : 46 : 54 | ++-------------------------------+ 57 | 2020-08-07 23:49:57-10 | 2020-08-07 23:49:57-10 | 2020-08-07 23:50:03-10 (1 row)
|
|
"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table
Any column that is named with an an _sk
suffix is a a Synthetic ynthetic Key that points to a corresponding dimension table, named with the part of the column before the the _sk
; the dimension table will have a a _dim
suffix suffix in the name. This way, it's easy to write the the JOIN
's, the column name is a clue to the dimension table, which will have a column of the same name. Almost every dimension table will consist of a a *_sk
PRIMARY PRIMARY KEY
and a a name column column.
A typical dimension table, the "user_dim" table
For example, the the user_sk column can be used to do a SQL SQL INNER JOIN
to the the user_dim table table.
Code Block |
---|
|
mysql> SHOW CREATE TABLE pfx_dwpfx=> \dS+ dwh.user_dim\G
*************************** 1. row ***************************
Table: user_dim
Create Table: CREATE TABLE `user_dim` (
`user_sk` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`user_sk`),
UNIQUE KEY `name` (`name`)
) ENGINE=MyISAM AUTO_INCREMENT=10 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> SELECT name FROM user_dim;
Table "dwh.user_dim"
Column | Type | Modifiers | Storage | Stats target | Description
---------+---------------+
| name |
---------+------------------------------------------------------------+----------+--------------+
| Administrator-------------
user_sk | |integer bar | not null default nextval('user_dim_user_sk_seq'::regclass) | fooplain | | | foobar |
name | character varying(255) | fubarnot null | | jburk | | perforce | | root | extended | shinya |
+ |
Indexes:
"user_dim_pkey" PRIMARY KEY, btree (user_sk)
"user_dim_name_key" UNIQUE CONSTRAINT, btree (name)
pfx=> SELECT name FROM user_dim;
name
---------------+ |
shinya
root
administrator
jburk
joe
kmac
(6 rows)
|
Get a count of all jobs for a particular user:
Code Block |
---|
|
mysql>pfx=> SELECT COUNT(*)
FROM
job_fact AS fact
INNER JOIN user_dim AS user
ON job_fact.user_sk = user_dim.user_sk WHERE user_dim.name = 'jburkshinya'
;
+----------+
| COUNT(*) |
+ count
----------+
| 5868 370
(1 row)
|
+----------+
|
The time dimension table
The pfx_dwdwh.time_dim table table is provided so that you don't have to perform date/time operations on every row in a a fact table (since they can run into the 100's of millions of rows), instead you do a SQL INNER JOIN
to to it and use the values in the the time_dim table table in your your WHERE
clause. The The time_sk column in every fact table has an identical value in the the time_dim
table which table which has a single row with a primary key time_sk. The The time_sk
value value is actually the unix epoch time in seconds:
Code Block |
---|
|
mysql>pfx=> SELECT SELECT * FROM FROM time_dim dim ORDER BY BY time_sk sk DESC LIMIT 1
;
+------------+---------------------+------+------------+-----+------------+-------+------+ | time_sk | date_time | hour | date date | dow | month_name | month | year |
+------------+------------------------+------+------------+-----+------------+-------+------+
| 13923396001607570700 | 20142020-0212-1309 17:0025:00-10 | 17 17 | 20142020-0212-1309 | 5 3 | FebruaryDecember | 2 12 | 2014 |
+------------+---------------------+------+------------+-----+------------+-------+------+
1 row in set (0.00 sec)
mysql> SELECT
FROM_UNIXTIME2020
(1 row)
pfx=> SELECT to_timestamp(time_sk)
, date_time time FROM FROM time_dim dim ORDER BY BY time_sk sk DESC LIMIT 1
;
+------------------------+---------------------+ | FROM_UNIXTIME(time_sk) | to_timestamp | date_time
|
+------------------------+---------------------+---
| 20142020-0212-1309 17:0030:00-10 | 20142020-0212-1309 17:0030:00-10
|
+------------------------+---------------------+(1 row)
|
The "job status" dimension table
The pfx_dwdwh.jobstatus_dim table is one of the few exceptions to the normal dimension table structure; it provides a mapping between the integer and human-readable status values.
Code Block |
---|
|
mysql>pfx=> SELECT * FROM jobstatus_dim;
+--------------+------------+-------------+----------------+-------------+ | jobstatus_sk | status_int | status_char | effective_date | expiry_date |
+--------------+------------+-------------+----------------+-------------+
| 1 1 | 16 16 | complete | 1999-12-31 | 9999-12-31
| | 2 2 | 32 32 | failed | 1999-12-31 | 9999-12-31
| | 3 3 | 48 48 | killed | 1999-12-31 | 9999-12-31
| | 4 4 | 272 261 | blockeddying | 1999-12-31 | 9999-12-31
| | 5 5 | 288 262 | waitingexiting | 1999-12-31 | 9999-12-31
| | 6 6 | 304 265 | suspendedregistering | 1999-12-31 | 9999-12-31
| | 7 7 | 320 272 | pendingblocked | 1999-12-31 | 9999-12-31
| | 8 8 | 323 288 | waiting | 1999-12-31 | 9999-12-31
| | 9 9 | 325 304 | badloginsuspended | 1999-12-31 | 9999-12-31
| | 10 10 | 336 320 | runningpending | 1999-12-31 | 9999-12-31
11 | 323 | 11waiting | 1999-12-31 | 2619999-12-31
12 | dying 325 | badlogin | 1999-12-31 | 9999-12-31
|
+--------------+------------+-------------+----------------+-------------+ |
13 | 336 | running | 1999-12-31 | 9999-12-31
(13 rows)
|
Get a count of all jobs for a particular user for
...
August,
...
2020:
Code Block |
---|
|
mysql>pfx=> SELECT COUNT(*)
FROM
job_fact AS fact
INNER JOIN user_dim AS user
ON job_fact.user_sk = user_dim.user_sk INNER JOIN time_dim AS time
ON job_fact.time_sk=time_dim.time_sk WHERE user_dim.name = 'jburkshinya' AND time_dim.month = 18 AND
time_dim.year = 2014
2020;
+----------+
| COUNT(*) |
+ count
----------+
| 83 96
(1 row)
|
+----------+ |
Get a count of all jobs for each user for all of
...
2020:
Code Block |
---|
|
mysql>pfx=> SELECT user_dim.name
, time_dim.month_name
, COUNT(*) as "job count"
FROM
job_fact AS fact
INNER JOIN user_dim AS user
ON job_fact.user_sk = user_dim.user_sk INNER JOIN time_dim AS time
ON job_fact.time_sk=time_dim.time_sk WHERE time_dim.year = 20132020 GROUP BY user_dim.name
, time_dim.month,time_dim.month_name ORDER BY user_dim.name
, time_dim.month
;
+---------------+------------+-----------+ | name | month_name | job count |
+---------------+------------+-----------+
| Administratorroot | January | 4 |
| bar | July | 1 | | foo 2
root | July | 2August | | foobar | July | 1 | | foobar 3
root | September | 2 | | fubar 13
root | July October | 1 | | jburk 8
|root March | 1 | |November jburk | May | 123 | | jburk 4
shinya | June July | 24 | | jburk | July | 220 | | jburk 9
shinya | August | 140 | | jburk | September | 68 |
| jburk | October | 560 |
| jburk | November | 59 |
| jburk | December | 4467 |
| perforce | January | 128 |
| perforce | February | 4 |
| root | January | 31
|
| root | February | 37 |
| root | March | 7 |
| root | April | 17 |
| root | May | 7 |
| root | June | 1 |
| root | July | 11 |
| root | September | 2 |
| root | December | 2 |
| shinya | January | 23 |
| shinya | February | 69 |
| shinya | March | 7 |
| shinya | April | 20 |
| shinya | May | 2 |
+---------------+------------+-----------+shinya | September | 157
shinya | October | 50
shinya | November | 75
(10 rows)
|
Get a count of all jobs for each user for all of
...
2020, broken down by month and the job's final status:
Code Block |
---|
|
mysql>pfx=> SELECT
user_dim.name
, time_dim.year
, time_dim.month_name
, statusjobstatus_dim.status_char
, COUNT(*) as "job count"
FROM
job_fact
AS fact
INNER JOIN
user_dim
ASON
user ON job_fact.user_sk=user_dim.user_sk
INNER JOIN
time_dim
ASON
time ON job_fact.time_sk=time_dim.time_sk
INNER JOIN
jobstatus_dim
ASON
status ON
job_fact.jobstatus_sk=statusjobstatus_dim.jobstatus_sk
WHERE
time_dim.year = 20132020
GROUP BY
user_dim.name
,time_dim.year,time_dim.month_name,jobstatus_dim.status_char
, time_dim.month
, statusjobstatus_dim.status_int
ORDER BY
user_dim.name
, time_dim.month
,status ,jobstatus_dim.status_char
;
+---------------+------+------------+-------------+-----------+ | name | year | month_name | status_char | job count |
+---------------+------+------------+-------------+-----------+
| Administratorroot | 2013 | January2020 | failed July | 4 | | barcomplete | 2013 | July | complete | 1 | |2
foo |root 2013 | July2020 | complete August | 2 | |complete foobar | 2013 | July | complete | 1 |3
| foobarroot | 2013 | September2020 | complete September | 2complete | | fubar | 2013 | July | complete 11
| 1 root | | jburk2020 | 2013September | March | complete failed | 1 | | jburk | 2013 | May | complete |2
100 root | | jburk | 20132020 | MayOctober | failed | 11 complete | | jburk | 2013 | May | killed |8
12 root | | jburk2020 | 2013November | June | complete | 4 | | jburk | 2013 | June | failed 4
| 20 shinya | |2020 jburk | 2013 | July | complete | 134 | |complete jburk | 2013 | July | failed | 33 | |6
jburk shinya | 20132020 | July | killed | 53 | |failed jburk | 2013 | August | complete | 67 | | jburk |1
2013 shinya | August2020 | failed July | 25 | | jburkkilled | 2013 | August | killed | 48 | | jburk |2
2013 shinya | September2020 | complete August | 38 | |complete jburk | 2013 | September | failed | 17 25
|shinya | jburk2020 | 2013August | September | killed | 13 | | jburk | 2013 | October | complete 6
| 450 shinya | 2020 | jburkSeptember | 2013 | Octobercomplete | failed | 61 | | jburk | 2013 |139
October shinya | killed 2020 | 49 September | |failed jburk | 2013 | November | complete | 38 | | jburk11
| 2013shinya | November2020 | failed September | 12 killed | | jburk | 2013 | November | killed | 9 7
| |shinya jburk | 20132020 | DecemberOctober | complete | 3812 complete | | jburk | 2013 | December | failed | 627 47
|shinya | jburk2020 | 2013October | December | killed failed | 28 | | perforce | 2013 | January | failed |2
46 shinya | 2020 | perforceOctober | 2013 | January | killed | 82 | | perforce | 2013 | February | complete |1
3 shinya | 2020 | perforceNovember | 2013 | Februarycomplete | killed | 1 | | root | 2013 64
shinya | January2020 | complete November | 24 failed | | root | 2013 | January | failed | 6 8
| shinya | root2020 | 2013November | January | killed | 1 | | root | 2013 | February | complete |3
34 |
| root | 2013 | February | killed | 3 |
| root | 2013 | March | complete | 7 |
| root | 2013 | April | complete | 9 |
| root | 2013 | April | failed | 4 |
| root | 2013 | April | killed | 4 |
| root | 2013 | May | complete | 6 |
| root | 2013 | May | killed | 1 |
| root | 2013 | June | complete | 1 |
| root | 2013 | July | complete | 7 |
| root | 2013 | July | failed | 1 |
| root | 2013 | July | killed | 3 |
| root | 2013 | September | complete | 1 |
| root | 2013 | September | failed | 1 |
| root | 2013 | December | complete | 2 |
| shinya | 2013 | January | complete | 4 |
| shinya | 2013 | January | failed | 7 |
| shinya | 2013 | January | killed | 12 |
| shinya | 2013 | February | complete | 60 |
| shinya | 2013 | February | failed | 2 |
| shinya | 2013 | February | killed | 7 |
| shinya | 2013 | March | complete | 2 |
| shinya | 2013 | March | failed | 2 |
| shinya | 2013 | March | killed | 3 |
| shinya | 2013 | April | complete | 16 |
| shinya | 2013 | April | killed | 4 |
| shinya | 2013 | May | killed | 2 |
+---------------+------+------------+-------------+-----------+ |
Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:
SELECT
user . name
, time . date
, status.status_char
, SUM (fact.cpu_seconds) as "cpu_time"
FROM
job_fact AS fact
INNER JOIN
user_dim AS user
ON
fact.user_sk= user .user_sk
INNER JOIN
time_dim AS time
ON
fact.time_sk= time .time_sk
INNER JOIN
jobstatus_dim AS status
ON
fact.jobstatus_sk=status.jobstatus_sk
WHERE
DATEDIFF(CURDATE(), time .date_time) < 7
GROUP BY
user . name
, time . date
, status.status_int
ORDER BY
time . date
, cpu_time DESC
, status.status_char
;
+
| name | date | status_char | cpu_time |
+
<< snipped >>
| jburk | 2014-07-14 | complete | 351036 |
| jburk | 2014-07-14 | killed | 60029 |
| jburk | 2014-07-14 | failed | 139 |
| coxj | 2014-07-14 | killed | 98 |
| garza | 2014-07-14 | killed | 0 |
| jburk | 2014-07-15 | complete | 28910 |
| fubar | 2014-07-15 | complete | 18610 |
| foobar | 2014-07-15 | complete | 18561 |
| jburk | 2014-07-15 | killed | 16967 |
| jburk | 2014-07-15 | failed | 27 |
| jburk | 2014-07-16 | complete | 46797 |
| jburk | 2014-07-16 | killed | 17136 |
| jburk | 2014-07-16 | failed | 2 |
<< snipped >>
+
|
Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:
Code Block |
---|
|
pfx=> SELECT
user_dim.name
, time_dim.date
, jobstatus_dim.status_char
, SUM(job_fact.cpu_seconds) as "cpu_time"
FROM
job_fact
INNER JOIN
user_dim
ON
job_fact.user_sk=user_dim.user_sk
INNER JOIN
time_dim
ON
job_fact.time_sk=time_dim.time_sk
INNER JOIN
jobstatus_dim
ON
job_fact.jobstatus_sk=jobstatus_dim.jobstatus_sk
WHERE
DATE_PART('day', CURRENT_DATE - time_dim.date_time) < 7
GROUP BY
user_dim.name,jobstatus_dim.status_char
, time_dim.date
, jobstatus_dim.status_int
ORDER BY
time_dim.date
, cpu_time DESC
, jobstatus_dim.status_char
;
name | date | status_char | cpu_time
--------+------------+-------------+----------
shinya | 2020-11-13 | complete | 144
shinya | 2020-11-16 | complete | 97
shinya | 2020-11-17 | failed | 906
shinya | 2020-11-17 | killed | 677
shinya | 2020-11-17 | complete | 102
shinya | 2020-11-18 | failed | 18695
root | 2020-11-18 | complete | 1199
shinya | 2020-11-18 | complete | 760
shinya | 2020-11-19 | failed | 8022
shinya | 2020-11-19 | complete | 606
root | 2020-11-19 | complete | 46
shinya | 2020-11-19 | killed | 0
(12 rows)
|