URLEscape and URLUnescape for CLOBs

URL escaping (encoding), is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI. Certain characters are reserved in URLs, so are replaced with percent encoding, which is a % + two hex digits which represent the character’s ASCII code. For multibyte characters such as extended characters in UTF-8, codepoints are encoded as multiple escaped values – one for each octet. You can read more about URL encoding here or see the standard’s document relating to it : RFC 3986.

Oracle provides package functions for escaping (encoding) URLs using UTL_URL.ESCAPE and unescaping (decoding) using UTL_URL.UNESCAPE, however, these only support VARCHAR2 inputs up to 4000 bytes. I needed equivalents for CLOB inputs, so I wrote functions which leveraged the UTL_URL functions. Here’s what I came up with. Note the input parameters to both functions match the UTL_URL equivalents.

URLEscape

create or replace function URLEscape(url                   in clob
                                   , escape_reserved_chars in boolean default false
                                   , url_charset           in varchar2 default utl_http.get_body_charset) return clob as
  MAX_CHUNK_LEN constant number := 333; -- Max UTL_URL.Escape URL = 4000 bytes. Max multibyte char = 4 bytes. One escape char = 3 bytes. Max chunk len = 4000 / (4 * 3) = 333.33
  vOffset   number := 1;
  vURLLen   number := DBMS_LOB.Getlength(url);
  vResult   clob;
begin
  while vOffset <= vURLLen
  loop
    vResult := vResult || UTL_URL.Escape(DBMS_LOB.Substr(url, MAX_CHUNK_LEN, vOffset), escape_reserved_chars, url_charset);
    vOffset := vOffset + MAX_CHUNK_LEN;
  end loop;
  return vResult;
end;
/

URLUnescape

create or replace function URLUnescape(url         in clob
                                     , url_charset in varchar2 default utl_http.get_body_charset
                                      ) return clob as
  MAX_CHUNK_LEN constant number := 1000; -- 1000 chars max bytes = 4000 bytes
  URL_PARTIAL_MULTIBYTE exception;
  pragma exception_init (URL_PARTIAL_MULTIBYTE, -29275);
  PARTIAL_MULTIBYTE exception;
  pragma exception_init (PARTIAL_MULTIBYTE, -64204);
  vOffset   number := 1;
  vLen      number := DBMS_LOB.Getlength(url);
  vChunk    varchar2(32767);  
  vLastChunk varchar2(32767);
  vChunkLen number;
  vResult   clob;
begin
  while vOffset <= vLen
  loop
    vChunk := dbms_lob.substr(url, MAX_CHUNK_LEN, vOffset);
    while vChunk is not null
    loop
      vChunkLen := length(vChunk);
      begin
        vResult := vResult || UTL_URL.Unescape(vChunk, url_charset);
        exit;
      exception
        when UTL_URL.bad_url             -- Chunk ends in %
          or PARTIAL_MULTIBYTE           -- Unescaped string failed concatenation. Some partial multibytes can be unescaped, but aren't valid for string ops. e.g. the first octet of a two byte UTF-8,
          or URL_PARTIAL_MULTIBYTE then  -- Chunk ends with a unescapable partial multibyte
          vLastChunk := vChunk;
          vChunk := regexp_replace(vChunk, '(%[0-9A-F]{0,2})$', null, vChunkLen - 3, 1, 'i');  -- 'Remove invalid escaped / partial escaped char, that appears in last 3 chars
          if vChunk = vLastChunk then  -- If new chunk hasn't changed, we need to exit to prevent infinite loop. This happens if text isn't escaped, so return as is 
            return url; 
          end if; 
      end;
    end loop;
    vOffset := vOffset + vChunkLen;
  end loop;
  return vResult;
end;
/

Escaping CLOBs is a simple case of chunking and calling UTL_URL.ESCAPE for each chunk. Unescaping CLOBs is far more complicated, because you have to process in 1000 char chunks and the end of that chunk could be part of an escaped multibyte character, so I had to handle those situations. For each chunk, this is done by attempting to unescape and concatenating and trapping any expected exception, and using REGEX to remove the offending escaped char and trying again.

Most modern Oracle setups use a UTF-8 character (AL32UTF8) so to call pass ‘UTF-8’ as the url_charset,

e.g. select URLUnescape(<some escaped URL>, 'UTF-8') from dual;

Leave a comment