Oracle External Tables – Part 1

If Only I Could Write A SQL Query On Data in a File…

Imagine you have a text file with data in, such as a CSV (character separated values) or ASCII fixed width fields, wouldn’t it be great if you could treat it like a table and write SQL queries against in?  Well you can and that’s exactly what Oracle’s External Tables functionality allow you to do!

External Tables are an extension to Oracle’s SQL*Loader – which pumps data directly into tables.  Both use the ORACLE_LOADER access driver, a utility for handing text data, but unlike SQL*Loader, External Tables manifest the data for you to interrogate, project, join to other tables, use as the basis of inserts, updates, merges…. and lots of other great things.

Sounds Good, So How Do I Use External Tables?

OK, so the best way to demonstrate the power of External Tables is to give it a try. Firstly, let’s get ourselves a CSV datasource for our trial purposes.  There are many free sample CSV files available online, but I picked this one with 2006 crime statistics from Sacramento.  At around 700KB it’s quite a decent size so should allow us to test how quickly External Tables function.  It also has a header line, which helps describes the fields.

Sacramento 2006 Crime Data – Viewed in Notepad++

I can garner the following information from opening the file in Notepad++

  • The file has a header line, which we’ll want to skip
  • The field delimiter (what separates the fields) is COMMA (,)
  • The record delimiter (what separates the records) is CR or CARRIAGE RETURN (ASCII 13)
  • The fields are not enclosed with a double quote (sometimes text data appears “like this” in files)
  • There are 10 fields to a record and these have different datatypes, some are date, numbers and strings

Some Setup – Oracle Directory

External Tables rely on files being put in an Oracle Directory, this is a named Oracle object which maps to a path on a drive or network.  It’s different from an operating system directory, in that respect and exists for security reasons, so that people can’t go reading, writing, deleting files willy-nilly.  So before we continue, we need to create an Oracle Directory.

Note : I’ve already copied the CSV file to the operating system folder, so it’ll be ready to play with.

You’ll need to log in as SYS to create an Oracle Directory and here I’m granting access to PUBLIC so that other schemas and users can read and write from it.

CREATE OR REPLACE DIRECTORY PAULZIP_ORACLE AS 'C:\Users\Paul\Documents\Oracle';
GRANT READ, WRITE ON DIRECTORY PAULZIP_ORACLE TO PUBLIC;

Afterwards, simply log back into your working schema and you’re good to start.

External Table Definition

Next I’ll define my attempt at the External Table definition.  I’m basing it on a few observations of the data as I don’t have the file format.  Often you’ll know the format of a CSV via a definition document, but this is good practice anyway and it’ll show what happens if we define things incorrectly…

CREATE TABLE T_SACRAMENTO_2006 (
  CDATETIME     DATE,
  ADDRESS       VARCHAR2(30),
  DISTRICT      NUMBER(2),
  BEAT          VARCHAR2(10),
  GRID          NUMBER(5),
  CRIMEDESCR    VARCHAR2(50),
  UCR_NCIC_CODE NUMBER(5),
  LATITUDE      NUMBER(20, 8),
  LONGITUDE     NUMBER(20, 8),
  REC_NUM       INTEGER
)
ORGANIZATION external
(
  TYPE oracle_loader
  DEFAULT DIRECTORY PAULZIP_ORACLE
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY 0X'0D'
    SKIP 1
    BADFILE 'SacramentocrimeJanuary2006.bad'  
    LOGFILE 'SacramentocrimeJanuary2006.log'   
    FIELDS TERMINATED BY ","
    (
      CDATETIME CHAR(14) date_format DATE mask "MM/DD/YY HH24:MI",   
      ADDRESS      ,
      DISTRICT     ,
      BEAT         ,
      GRID         ,
      CRIMEDESCR   ,
      UCR_NCIC_CODE,
      LATITUDE     ,
      LONGITUDE    ,
      REC_NUM   RECNUM
    )    
  )
  LOCATION('SacramentocrimeJanuary2006.csv')
)
PARALLEL
REJECT LIMIT UNLIMITED;

Here’s a breakdown of some sections of the statement:

RECORDS DELIMITED BY 0X'0D'

My records are delimited by carriage return, which in hex is 0x’0D’


SKIP 1

I have one line that I want to skip (the header line)


BADFILE 'SacramentocrimeJanuary2006.bad'

I want oracle_loader to put any bad records which don’t meet my defined format into a BAD file.


LOGFILE 'SacramentocrimeJanuary2006.log'

I want oracle_loader to log information about how it mounted the file into a LOG file. This can be a useful reference when we have problems.


CDATETIME char(14) date_format DATE mask "MM/DD/YY HH24:MI",

I want Oracle to try to map the data to the datatypes in the first section, but I need to explain to Oracle how to handle my dates, so this is the date mask used in the file for the crime date time. It’s up to 14 characters in length and is a date format DATE mask (rather than a TIMESTAMP mask).


REC_NUM RECNUM

I also want a special column to log the line number in the file my data came from. Notice the type is RECNUM? This is a special data type for exactly this purpose. I always recommend adding a RECNUM field to External Tables as it helps cross reference file lines.


REJECT LIMIT UNLIMITED

Don’t discard the file after a certain number of rejected bad records, reject an unlimited number of records if need be.


Let’s Give It A Try!

Oracle won’t mount the file until you try to select data from it.  This is a good thing as sometimes you might define an External Table prior to having the file available, so as long as your Oracle Directory exists, you won’t get an error when creating the External Table.

NOTE : If you are using a SQL development tool which displays results in a grid, those results are fetched piecemeal in say 250 row chunks and fetching occurs on a need to display basis. As External Tables also have an internal buffer, this can mean later records in the file might not be processed until you scroll to them on your SQL tool which can mean BAD records don’t get marked until scrolling occurs. 

select * from T_SACRAMENTO_2006;

2017-04-17 13_38_56-Toad for Oracle - [PAULZIP@XE - Editor (ExtTab.sql _)]

Our first External Table Results!

It worked! We now have a table to represent the data in the file.

Before we continue, we should check to see if we have any bad data, this will be in our BAD file (which contains records that failed validation for some reason).

2017-04-17 13_44_23-Oracle

Bad data!

Oh dear, it appears we do.  Let’s have a look at the BAD file.  It’s in exactly the same format as the original file, although it doesn’t have a header line – it only contains the records that erred.  The fact it’s in the same format can be useful, as you might want to load a file knowing there are faulty records, but extract those and fix them and process those again later.  Great!

Bad File Contents

1/1/06 0:36,NORTHGATE BLVD / W SILVER EAGLE RD,1,1C        ,419,CASUALTY REPORT,7000,38.63036289,-121.4755269
1/2/06 18:46,NORTHGATE BLVD / W EL CAMINO AVE,1,1C        ,429,MISCELLANEOUS I RPT (ZMISC),7000,38.61124309,-121.4755625
1/2/06 22:58,NORTHGATE BLVD / RIO TIERRA AVE,1,1C        ,419,TRAFFIC-ACCIDENT INJURY,5400,38.62753184,-121.4755302
1/3/06 18:46,14TH AVE / MARTIN LUTHER KING JR BLVD,6,6A        ,1042,NARCOTICS SUSP/EVID/ACT- I RPT,7000,38.53952458,-121.4647617
1/3/06 20:07,5218 MARTIN LUTHER KING JR BLVD,6,6A        ,1081,PROTECTIVE CUSTODY-I RPT,7000,38.5272962,-121.4652065
1/4/06 6:25,COSUMNES RIVER BLVD / BRUCEVILLE RD,5,5C        ,1639,TRAFFIC-ACCIDENT-NON INJURY,5400,38.45768744,-121.4176675
1/4/06 16:15,FRANKLIN BLVD / COSUMNES RIVER BLVD,5,5B        ,1634,TOWED/STORED VEH-14602.6,7000,38.46223959,-121.4466002
1/5/06 2:19,MARTIN LUTHER KING JR BLVD / 14TH AVE,6,6A        ,1042,1203.2 PC VIOLATION OF PROBATI,5012,38.53952458,-121.4647617
1/5/06 6:23,W EL CAMINO AVE / NORTHGATE BLVD,1,1C        ,429,4140 BP  POSS OF HYPO NEEDLE,3599,38.61124309,-121.4755625
1/5/06 6:55,FLORIN PERKINS RD / UNSWORTH AVE,6,6C        ,1411,TRAFFIC-ACCIDENT INJURY,5400,38.52149079,-121.3907012
1/5/06 12:54,BROADWAY / MARTIN LUTHER KING JR BLVD,6,6A        ,1014,TOWED/STORED VEH-14602.6,7000,38.54718283,-121.4647218
1/5/06 15:18,3224 MARTIN LUTHER KING JR BLVD,6,6A        ,1031,TOWED/STORED VEH-14602.6,7000,38.5447238,-121.4650094
1/5/06 17:59,W EL CAMINO AVE / GATEWAY OAKS DR,1,1B        ,421,TOWED/STORED VEH-14602.6,7000,38.61437028,-121.5171426
1/5/06 19:56,COLLEGE TOWN DR / LA RIVIERA DR,3,3C        ,888,TRAFFIC-ACCIDENT-NON INJURY,5400,38.55529508,-121.4106183
1/5/06 22:31,W EL CAMINO AVE / NORTHGATE BLVD,1,1C        ,429,11364 HS POSS DRUG PARAPH,3599,38.61124309,-121.4755625
1/6/06 3:15,MARYSVILLE BLVD / RIO LINDA BLVD,2,2A        ,204,23152(A) VC DUI-ALCOHOL/DRUGS,5404,38.66207022,-121.4477449
1/6/06 10:30,W EL CAMINO AVE / GATEWAY OAKS DR,1,1B        ,421,459 PC  BURGLARY VEHICLE,2299,38.61437028,-121.5171426
1/6/06 16:00,3510 MARTIN LUTHER KING JR BLVD,6,6A        ,1041,484 PETTY THEFT/LICENSE PLATE,2399,38.54276486,-121.4653703
1/6/06 21:50,ALPINE FROST DR / COVEY CREEK WAY,5,5C        ,1637,20002(A) HIT/RUN,5401,38.46009782,-121.4197397
1/7/06 7:33,MARTIN LUTHER KING JR BLVD / 12TH AVE,6,6A        ,1042,12500(A) VC UNLIC DRIVER,5499,38.54119139,-121.4647667
1/7/06 11:10,MARYSVILLE BLVD / RIO LINDA BLVD,2,2A        ,204,23153(A) VC DUI CAUSE INJURY,5404,38.66207022,-121.4477449
1/7/06 14:00,4315 MARTIN LUTHER KING JR BLVD,6,6A        ,1062,459 PC  BURGLARY BUSINESS,2203,38.53580637,-121.4623494
1/8/06 1:57,65TH STREET EXPY / LEMON HILL AVE,6,6B        ,1403,23152(A) VC DUI-ALCOHOL/DRUGS,5404,38.51599171,-121.4274605
1/8/06 5:32,COSUMNES RIVER BLVD / CENTER PKWY,5,5C        ,1645,TOWED/STORED VEHICLE,7000,38.45814178,-121.428371
1/8/06 10:03,W EL CAMINO AVE / GATEWAY OAKS DR,1,1B        ,421,TRAFFIC-ACCIDENT-NON INJURY,5400,38.61437028,-121.5171426
1/8/06 12:58,MARTIN LUTHER KING JR BLVD / 23RD AVE,6,6A        ,1081,23153(A) VC DUI CAUSE INJURY,5404,38.53182194,-121.4647041

But why did they fail? Let’s open the log file and find out.

 LOG file opened at 04/17/17 13:43:03

Field Definitions for table T_SACRAMENTO_2006
  Record format DELIMITED, delimited by 0D
  Data in file has same endianness as the platform
  Rows with all null fields are accepted

  Fields in Data Source: 

    CDATETIME                       CHAR (14)
      Date datatype DATE, date mask MM/DD/YY HH24:MI
      Terminated by ","
      Trim whitespace same as SQL Loader
    ADDRESS                         CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    DISTRICT                        CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    BEAT                            CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    GRID                            CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    CRIMEDESCR                      CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    UCR_NCIC_CODE                   CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    LATITUDE                        CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    LONGITUDE                       CHAR (255)
      Terminated by ","
      Trim whitespace same as SQL Loader
    REC_NUM                         RECNUM
      Terminated by ","
      Trim whitespace same as SQL Loader
error processing column ADDRESS in row 57 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 34, maximum: 30)
error processing column ADDRESS in row 347 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 32, maximum: 30)
error processing column ADDRESS in row 387 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 31, maximum: 30)
error processing column ADDRESS in row 601 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 37, maximum: 30)
error processing column ADDRESS in row 621 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 31, maximum: 30)
error processing column ADDRESS in row 691 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 35, maximum: 30)
error processing column ADDRESS in row 817 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 35, maximum: 30)
error processing column ADDRESS in row 932 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 37, maximum: 30)
error processing column ADDRESS in row 944 for datafile C:\Users\Paul\Documents\Oracle\SacramentocrimeJanuary2006.csv
ORA-12899: value too large for column ADDRESS (actual: 32, maximum: 30)

It appears to be a problem with the ADDRESS field not being long enough. Checking through I can see it needs to be a minimum of 40, so let’s up the length of the field to say 50 chars in length (no harm in going over to make it more future proof).

We’ll need to drop and re create the external table to remount the file with the new definition.

drop table T_SACRAMENTO_2006;

Whilst the External Table is dropped, I’ll delete the BAD and LOG file, then they’ll be recreated as fresh next time.

CREATE TABLE T_SACRAMENTO_2006 (
  [snip...]
  ADDRESS       VARCHAR2(50),
  [snip...]
;
select * from T_SACRAMENTO_2006;

When I check the BAD and LOG files again, there’s no BAD file, which means our External Table definition is fine. Great!

Powerful Stuff

OK, so finally as a demonstration of the power of External Tables, let’s query the crime figures file to find out some information that certainly wouldn’t be obvious without the power of being able to use SQL on it.

In district 3, during January -March, what hour of the day did a burglary happen most frequently?

Note : I looked up the National Crime Information Center Codes (UCR_NCIC_CODE) for burglaries which appear in the 2201-2299 region.

select to_char(CDATETIME, 'HH24') HR, count(*) CNT
from T_SACRAMENTO_2006
where DISTRICT = 3 and
      CDATETIME >= date '2006-01-01' and CDATETIME < date '2006-04-01' and
      UCR_NCIC_CODE in (2201, 2202, 2203, 2204, 2205, 2206, 2207, 2299)
group by to_char(CDATETIME, 'HH24')
order by 2 desc
HR        CNT
-- ----------
00         24
18         20
20         19
19         16
22         14
21         13
12         13
02         13
23         13
17         11
07         10
15         10
01          7
04          6
10          6
03          5
13          5
08          5
09          5
06          5
14          5
11          4
05          4
16          3

24 rows selected.

NOTE: That query took 100 milliseconds to run, which is impressive seeing as it came from a 775kB file with over 7500 records in.

So the period for the most burglaries was the midnight hour (12am up to 1am) and 4pm (up to 5pm) was the time for the least.

In my next part on Oracle External Tables I’ll explore some more advanced External Table functionality….

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s