Topic
  • 2 replies
  • Latest Post - ‏2013-08-07T21:04:02Z by anost
anost
anost
13 Posts

Pinned topic find duplicates between xml cells

‏2013-08-05T21:13:11Z |

Hi,

I need to check a table if there exists duplicates between the xml cells (and delete them). Up to now I use xmltable. I want to ask if there is a "better" solution available. My table has around 10 Mio. rows, so maybe there is a faster,  more pure ;) xml query solution ...

I have a table with a ID column and a XML column + some more columns. Currently I use a xmltable like this example, assuming I have just 2 nodes:

DELETE FROM SCHEMA.TABLE WHERE NOT ID IN
(SELECT MAX(ID) FROM SCHEMA.TABLE o,
xmltable('$c/xmlpath' passing o.data as "c"
COLUMNS
SensoID INTEGER path 'SensoID',
Sequence_NR INTEGER path 'Sequence_NR'
) AS x
GROUP BY SensoID,Sequence_NR)

Thanks, Andreas
 

  • MatthiasNicola
    MatthiasNicola
    321 Posts

    Re: find duplicates between xml cells

    ‏2013-08-07T04:57:09Z  

    Hi Andreas,

    interesting question. For what it's worth, I have now stared at this for 10 minutes and I can't come up with a better idea. I think this is a pretty efficient way to code this. You can't avoid looking into each XML document at least once.

    - Matthias

  • anost
    anost
    13 Posts

    Re: find duplicates between xml cells

    ‏2013-08-07T21:04:02Z  

    Hi Andreas,

    interesting question. For what it's worth, I have now stared at this for 10 minutes and I can't come up with a better idea. I think this is a pretty efficient way to code this. You can't avoid looking into each XML document at least once.

    - Matthias

    Hi Matthias,

    I stopped the above query (which had an select count instead of delete) after runnning more than 36 hours. Now I concatenate the xml values into one varchar and use a distinct on it:

    SELECT count(distinct(xmlcast(xmlquery('$b/Block0/DendroRawID' passing data as "b") as varchar(128)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawNamen' passing data as "b") as varchar(16)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawtimestamp' passing data as "b") as varchar(128)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawDate' passing data as "b") as varchar(128)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawtime' passing data as "b") as varchar(128)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawSensoID' passing data as "b") as varchar(16)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawSequence_NR' passing data as "b") as varchar(128)) ||
    xmlcast(xmlquery('$b/Block0/DendroRawMeasurement' passing data as "b") as varchar(16))
    )) FROM SCHEMA.TABLE
    

    and it took around 23 minutes!

    At first I tried to use this varchar in a "group by" but unfortunately xmlquery is not supported here. So I will create a temp table and fill it with the distinct select above and some delimiters. Afterward I will rebuild the xml and fill my table. I think it should work.

    Unfortunataly my code was put in one line after saving the textbox. Do you know if there is a way to keep the structuring? Is it possible to have a colored code formating?

    Best,

    Andreas

    Updated on 2013-08-07T21:08:46Z at 2013-08-07T21:08:46Z by anost