Topic
  • 3 replies
  • Latest Post - ‏2013-09-09T12:44:14Z by JonPeck
redrose888
redrose888
100 Posts

Pinned topic question about Python

‏2013-09-05T14:01:58Z |

Dear all,

I have a question:

myfile has 20 mio. cases, I let the follwoing syntax run

BEGIN PROGRAM PYTHON.
import spss

# open file
spss.Submit(r'''
GET FILE = 'myfile'
  /KEEP = varlist.
''')

cur = spss.Cursor(accessType='r')

# the first 10 cases and print type
for i in xrange(10):
    row=cur.fetchone()
    for x in row:
        print type(x), x

cur.close()

END PROGRAM.

if I let the same program in python directly, it is very quickly

but if i let in spss run, it read all 20 mio. cases and very slow!

does anyone an idea?

 

thanks!

 

  • JonPeck
    JonPeck
    269 Posts

    Re: question about Python

    ‏2013-09-06T14:00:12Z  

    Please explain further what you mean by in Python directly vs in spss?  Are you referring to external vs internal mode?  What version(s) of Statistics are you using?  V22 has a new case-passing method for the Cursor class that is significantly faster but has to do an entire data pass.  Setting isBinary to False uses the old method.  So I wonder whether you might have two Statistics installations and are actually using the old and new implementation depending on internal vs external mode.

    The optional Boolean argument isBinary (introduced in version 22) specifies the method that is used by
    the Cursor class to work with the data in the active dataset. It has no effect on Cursor functionality. By
    default isBinary is set to True, which typically provides the best performance but might require more
    temporary disk space. When isBinary is set to False, the Cursor class uses the same method for
    working with the data as in versions before version 22.

  • redrose888
    redrose888
    100 Posts

    Re: question about Python

    ‏2013-09-09T06:40:18Z  
    • JonPeck
    • ‏2013-09-06T14:00:12Z

    Please explain further what you mean by in Python directly vs in spss?  Are you referring to external vs internal mode?  What version(s) of Statistics are you using?  V22 has a new case-passing method for the Cursor class that is significantly faster but has to do an entire data pass.  Setting isBinary to False uses the old method.  So I wonder whether you might have two Statistics installations and are actually using the old and new implementation depending on internal vs external mode.

    The optional Boolean argument isBinary (introduced in version 22) specifies the method that is used by
    the Cursor class to work with the data in the active dataset. It has no effect on Cursor functionality. By
    default isBinary is set to True, which typically provides the best performance but might require more
    temporary disk space. When isBinary is set to False, the Cursor class uses the same method for
    working with the data as in versions before version 22.

    Hello Jon,

    I am using V21, but I also tested in V19.

    by Python directly I meant: I save the code as *.py, and let it run in Python GUI.

    Regards,

  • JonPeck
    JonPeck
    269 Posts

    Re: question about Python

    ‏2013-09-09T12:44:14Z  

    Hello Jon,

    I am using V21, but I also tested in V19.

    by Python directly I meant: I save the code as *.py, and let it run in Python GUI.

    Regards,

    I don't know of any reason why there would be a difference in case passing between internal and external modes.  How do you know that the code is passing all the cases in internal mode?

    Procedures use essentially the same data source that the Python plugin uses in order to pass the cases, and procedures normally do not stop part way through the data, so I could imagine that something like this could happen, but I can't think of a reason why this would be different in the two modes.

    One thing that comes to mind is that on the first data pass, Statistics may be realizing the data and has to complete this, but this wouldn't have anything directly to do with Python.

    If you know that you just want the first few cases, you might try using the Dataset class instead of the Cursor class.  It is slower on a per case basis, but since it is derived from the Data Editor code, it is likely to have different behavior.