![]() |
Special June 2016 - Token splitting with getpos and substr |
---|
Category : User |
Level : Advanced |
Back to Overview |
Welcome to the june web special.
We will discover the advanced usage of two provided SQL string functions substr and getpos. To follow this, you should have already set up a cego database system with an online tableset. If you are still not ready for this, please refer to the cego documentation Getting Started and follow the instructions for setting up a database.
To illustrate the possibilities of the powerful substr and getpos string function, we will tokenize a retrieved string pattern into several tokens.
For this, we use the testing table called t1. The table can be created and filled with the following commands
lit > create table t1 ( fullname string(100)); Table t1 created ok ( 0.037 s ) lit > insert into t1 values ( 'Donald Duck'); 1 tuples inserted ok ( 0.037 s ) lit > insert into t1 values ( 'Hugo Habicht'); 1 tuples inserted ok ( 0.037 s ) |
Now the table actually contains two fullname entries, which can be retrieved with a simple select query.
lit > select fullname from t1; +-----------------------------------------------------------------------------------------------------+ | t1 | | fullname | +-----------------------------------------------------------------------------------------------------+ | Donald Duck | | Hugo Habicht | +-----------------------------------------------------------------------------------------------------+ 2 tuples ok ( 0.073 s ) |
The getpos function can be used to return the position of a contained whitespace character. To get the position for the first occurence we use the following query
lit > select getpos(fullname, ' ',1) from t1; +---------------------------+ | FUNC | | getpos(t1.fullname,' ',1) | +---------------------------+ | 7 | | 5 | +---------------------------+ 2 tuples ok ( 0.073 s ) |
Now we want to get the firstname and lastname as separated column results. For this, we have to split the fullname into two tokens using the whitespace character as separator. As shown above, the getpos function can return the first occurence of a whitespace. This position can be used inside the substr to retrieve the substring from the beginnung to first whitespace.
lit > select substr(fullname, 1, getpos(fullname, ' ',1)) from t1; +-------------------------------------------------+ | FUNC | | substr(t1.fullname,1,getpos(t1.fullname,' ',1)) | +-------------------------------------------------+ | Donald | | Hugo | +-------------------------------------------------+ 2 tuples ok ( 0.074 s ) |
In a second step, we use the position as a starting point for substr. Called with just two arguments, substr returns all characters from the starting position to the end
lit > select substr(fullname, getpos(fullname, ' ', 1)) from t1; +-----------------------------------------------+ | FUNC | | substr(t1.fullname,getpos(t1.fullname,' ',1)) | +-----------------------------------------------+ | Duck | | Habicht | +-----------------------------------------------+ 2 tuples |
The usage of getpos as show above is restricted to string values with two separated token.
A more general approach is discussed now in the following.
To provide more complex strings, we change our test table with the following queries.
lit > delete from t1; 2 tuples deleted ok ( 0.037 s ) lit > insert into t1 values ('this is a text with some words'); 1 tuples inserted ok ( 0.037 s ) |
To handle strings with more tokens, we have to to use getpos in a more advanced manner. An approach is given in the following where x defines the position of the corresponding token. The tokens can be filtered with the expression
substr(attr, getpos(fullname, ' ', 1,x) + 1, getpos(fullname, ' ', 1,x+1) - getpos(fullname, ' ', 1, x)-1)Using getpos with four parameters enables the function to return the x occurence of the whitespace separator. First we select the input string with a plain query
lit > select fullname from t1; +-----------------------------------------------------------------------------------------------------+ | t1 | | fullname | +-----------------------------------------------------------------------------------------------------+ | this is a text with some words | +-----------------------------------------------------------------------------------------------------+ 1 tuples ok ( 0.075 s ) |
To indicate any trailing white space characters, we enclose the result string with square brackets. So using the above expression, we can retrieve the second token with
lit > select '[' | substr(fullname, getpos(fullname, ' ', 1, 1) + 1, getpos(fullname, ' ', 1,2) - getpos(fullname, ' ', 1, 1)-1) | ']' from t1; +-------------------------------------------------------------------------------------------------------------------------------+ | EXPR | | '[' | substr(t1.fullname,getpos(t1.fullname,' ',1,1) + 1,getpos(t1.fullname,' ',1,2) - getpos(t1.fullname,' ',1,1) - 1) | ']' | +-------------------------------------------------------------------------------------------------------------------------------+ | [is] | +-------------------------------------------------------------------------------------------------------------------------------+ 1 tuples ok ( 0.075 s ) |
And the fifth token with
lit > select '[' | substr(fullname, getpos(fullname, ' ', 1, 4) + 1, getpos(fullname, ' ', 1,5) - getpos(fullname, ' ', 1, 4)-1) | ']' from t1; +-------------------------------------------------------------------------------------------------------------------------------+ | EXPR | | '[' | substr(t1.fullname,getpos(t1.fullname,' ',1,4) + 1,getpos(t1.fullname,' ',1,5) - getpos(t1.fullname,' ',1,4) - 1) | ']' | +-------------------------------------------------------------------------------------------------------------------------------+ | [with] | +-------------------------------------------------------------------------------------------------------------------------------+ 1 tuples ok ( 0.075 s ) |
If you want to use the tokenizing expression more comfortable, it might be useful to encapsulate the expression into a small procedure. For this, we create the stored procedure gettoken
drop if exists procedure gettoken; @ create procedure gettoken(pattern in string(100), pos in int) return string(100) begin var result string(100); :result = substr(:pattern, getpos(:pattern, ' ', 1, :pos) + 1, getpos(:pattern, ' ', 1,:pos+1) - getpos(:pattern, ' ', 1, :pos)-1); return :result; end; @ |
Now the procedure can be called with appropriate arguments. Please note, that the first token is retrieved with pos = 0, the second with pos =1 ans so on.
lit > select gettoken(fullname, 2) from t1; +-----------------------------------------------------------------------------------------------------+ | FUNC | | gettoken(t1.fullname,2) | +-----------------------------------------------------------------------------------------------------+ | a | +-----------------------------------------------------------------------------------------------------+ 1 tuples ok ( 0.073 s ) lit > select gettoken(fullname, 3) from t1; +-----------------------------------------------------------------------------------------------------+ | FUNC | | gettoken(t1.fullname,3) | +-----------------------------------------------------------------------------------------------------+ | text | +-----------------------------------------------------------------------------------------------------+ 1 tuples |
Have fun !