Lemke IT Homepage

Special June 2016 - Token splitting with getpos and substr
Category : User
Level : Advanced

Back to Overview

Welcome to the june web special.

We will discover the advanced usage of two provided SQL string functions substr and getpos. To follow this, you should have already set up a cego database system with an online tableset. If you are still not ready for this, please refer to the cego documentation Getting Started and follow the instructions for setting up a database.

To illustrate the possibilities of the powerful substr and getpos string function, we will tokenize a retrieved string pattern into several tokens.

For this, we use the testing table called t1. The table can be created and filled with the following commands

lit > create table t1 ( fullname string(100));
Table t1 created
ok ( 0.037 s )
lit > insert into t1 values ( 'Donald Duck');
1 tuples inserted
ok ( 0.037 s )
lit > insert into t1 values ( 'Hugo Habicht');
1 tuples inserted
ok ( 0.037 s )

Now the table actually contains two fullname entries, which can be retrieved with a simple select query.

lit > select fullname from t1;
+-----------------------------------------------------------------------------------------------------+
|                                                                                                  t1 |
|                                                                                            fullname |
+-----------------------------------------------------------------------------------------------------+
|                                                                                         Donald Duck |
|                                                                                        Hugo Habicht |
+-----------------------------------------------------------------------------------------------------+
2 tuples
ok ( 0.073 s )

The getpos function can be used to return the position of a contained whitespace character. To get the position for the first occurence we use the following query

lit > select getpos(fullname, ' ',1) from t1;
+---------------------------+
|                      FUNC |
| getpos(t1.fullname,' ',1) |
+---------------------------+
|                         7 |
|                         5 |
+---------------------------+
2 tuples
ok ( 0.073 s )

Now we want to get the firstname and lastname as separated column results. For this, we have to split the fullname into two tokens using the whitespace character as separator. As shown above, the getpos function can return the first occurence of a whitespace. This position can be used inside the substr to retrieve the substring from the beginnung to first whitespace.

lit > select substr(fullname, 1, getpos(fullname, ' ',1)) from t1;
+-------------------------------------------------+
|                                            FUNC |
| substr(t1.fullname,1,getpos(t1.fullname,' ',1)) |
+-------------------------------------------------+
|                                         Donald  |
|                                           Hugo  |
+-------------------------------------------------+
2 tuples
ok ( 0.074 s )

In a second step, we use the position as a starting point for substr. Called with just two arguments, substr returns all characters from the starting position to the end

lit > select substr(fullname, getpos(fullname, ' ', 1)) from t1;
+-----------------------------------------------+
|                                          FUNC |
| substr(t1.fullname,getpos(t1.fullname,' ',1)) |
+-----------------------------------------------+
|                                          Duck |
|                                       Habicht |
+-----------------------------------------------+
2 tuples

The usage of getpos as show above is restricted to string values with two separated token. A more general approach is discussed now in the following.
To provide more complex strings, we change our test table with the following queries.

lit > delete from t1;
2 tuples deleted
ok ( 0.037 s )
lit > insert into t1 values ('this is a text with some words');
1 tuples inserted
ok ( 0.037 s )

To handle strings with more tokens, we have to to use getpos in a more advanced manner. An approach is given in the following where x defines the position of the corresponding token. The tokens can be filtered with the expression

    substr(attr, getpos(fullname, ' ', 1,x) + 1, getpos(fullname, ' ', 1,x+1) - getpos(fullname, ' ', 1, x)-1)

Using getpos with four parameters enables the function to return the x occurence of the whitespace separator. First we select the input string with a plain query

lit > select fullname from t1;
+-----------------------------------------------------------------------------------------------------+
|                                                                                                  t1 |
|                                                                                            fullname |
+-----------------------------------------------------------------------------------------------------+
|                                                                      this is a text with some words |
+-----------------------------------------------------------------------------------------------------+
1 tuples
ok ( 0.075 s )

To indicate any trailing white space characters, we enclose the result string with square brackets. So using the above expression, we can retrieve the second token with

lit > select '[' | substr(fullname, getpos(fullname, ' ', 1, 1) + 1, getpos(fullname, ' ', 1,2) - getpos(fullname, ' ', 1, 1)-1) | ']'  from t1;
+-------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                          EXPR |
| '[' | substr(t1.fullname,getpos(t1.fullname,' ',1,1) + 1,getpos(t1.fullname,' ',1,2) - getpos(t1.fullname,' ',1,1) - 1) | ']' |
+-------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                          [is] |
+-------------------------------------------------------------------------------------------------------------------------------+
1 tuples
ok ( 0.075 s )

And the fifth token with

	  
lit > select '[' | substr(fullname, getpos(fullname, ' ', 1, 4) + 1, getpos(fullname, ' ', 1,5) - getpos(fullname, ' ', 1, 4)-1) | ']'  from t1;
+-------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                          EXPR |
| '[' | substr(t1.fullname,getpos(t1.fullname,' ',1,4) + 1,getpos(t1.fullname,' ',1,5) - getpos(t1.fullname,' ',1,4) - 1) | ']' |
+-------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                        [with] |
+-------------------------------------------------------------------------------------------------------------------------------+
1 tuples
ok ( 0.075 s )

If you want to use the tokenizing expression more comfortable, it might be useful to encapsulate the expression into a small procedure. For this, we create the stored procedure gettoken

drop if exists procedure gettoken;
@
create procedure gettoken(pattern in string(100), pos in int) return string(100)
begin
   var result string(100);
   :result =  substr(:pattern, getpos(:pattern, ' ', 1, :pos) + 1, getpos(:pattern, ' ', 1,:pos+1) - getpos(:pattern, ' ', 1, :pos)-1);
   return :result;
end;
@

Now the procedure can be called with appropriate arguments. Please note, that the first token is retrieved with pos = 0, the second with pos =1 ans so on.

lit > select gettoken(fullname, 2) from t1;
+-----------------------------------------------------------------------------------------------------+
|                                                                                                FUNC |
|                                                                             gettoken(t1.fullname,2) |
+-----------------------------------------------------------------------------------------------------+
|                                                                                                   a |
+-----------------------------------------------------------------------------------------------------+
1 tuples
ok ( 0.073 s )
lit > select gettoken(fullname, 3) from t1;
+-----------------------------------------------------------------------------------------------------+
|                                                                                                FUNC |
|                                                                             gettoken(t1.fullname,3) |
+-----------------------------------------------------------------------------------------------------+
|                                                                                                text |
+-----------------------------------------------------------------------------------------------------+
1 tuples

Have fun !