Topic
10 replies Latest Post - ‏2012-10-29T19:39:10Z by SystemAdmin
Scruffylg
Scruffylg
18 Posts
ACCEPTED ANSWER

Pinned topic Expanding a file with variables (and data) for N most recent observations

‏2012-10-12T20:27:40Z |
I'm trying to accomplish something that is analogous to the following:

I have a list of (unique) Student ID's in one SPSS data file. In another file I have a list of student test scores over time, along with the dates of test administration. For each student record in the first file, I'd like to append the values representing the N most recent test dates and the corresponding N test scores. Both the dates and scores will be unique to to the student. Missing values need to be indicated as appropriate.

What the best way to perform this kind of data manipulation? Feel free to refer me to a web article or another post if something already exists.

Here's an example using N = 2 most recent scores

2nd file:

Student  Date   Score 1     1/01/12   85 1     2/02/12  100 1     3/03/12   90 2     4/04/12   65


I'd like the first file to look like the following:

Student  Date1  Score1   Date2  Score2 1    1/01/12   85    2/02/12   100 2    4/04/12   65    Missing Missing 3    Missing Missing Missing Missing

(In reality, this first file has lots of other variables in it as well, so I'd like to do an append operation with the variables if I can.)

Thanks.
Updated on 2012-10-29T19:39:10Z at 2012-10-29T19:39:10Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    445 Posts
    ACCEPTED ANSWER

    Re: Expanding a file with variables (and data) for N most recent observations

    ‏2012-10-16T17:26:34Z  in response to Scruffylg
    In outline,
    1) use ADD CASES to bring all the secondary dataset records into the first one
    2) use SHIFT VALUES to bring some lead values into the current case
    3) use XSAVE with a test on id to select the cases to write.

    HTH,
    Jon Peck
    • Scruffylg
      Scruffylg
      18 Posts
      ACCEPTED ANSWER

      Re: Expanding a file with variables (and data) for N most recent observations

      ‏2012-10-16T20:04:25Z  in response to SystemAdmin
      Thanks, Jon.

      I'll take a look at those commands and respond if I have any additional questions.
      • Scruffylg
        Scruffylg
        18 Posts
        ACCEPTED ANSWER

        Re: Expanding a file with variables (and data) for N most recent observations

        ‏2012-10-17T19:50:49Z  in response to Scruffylg
        Hi Jon--

        I think I've solved my problem, but I'd like to ask some questions about what you recommended...so I can advance my SPSS knowledge.

        1) Is there a particular reason you recommended that ADD CASES be done as the first, as opposed to last, operation to solve this problem? The first file has unique IDs, the second file does not. So it seems intuitive to me that one should do all the transformations on the "Score" data to generate a file that contains scores associated with each unique Student ID....and then merge the variables into one file. I'm not disagreeing with your recommendation...I just want to understand how you think about these things.

        2) How does XSAVE allow you save some cases and not others? I didn't find the SPSS documentation at all helpful in explaining out how to use it in this manner. Is there an example you an point me to?

        3) I appreciate your highlighting the SHIFT VALUES command; its really helpful for this problem. I discovered that if I combine it with the SPLIT FILE command, then I can get the shift to work only "within" each Student ID. You didn't explicity recommend that I use SPLIT FILE, so I was wondering...is there a way to do this without SPLIT FILE? If so, I'd be curious to know what how, because I couldn't figure out any other way to prevent a score from one Student ID being "shifted" into a different student ID without SPLIT FILE.

        Thanks for your help.
        • SystemAdmin
          SystemAdmin
          445 Posts
          ACCEPTED ANSWER

          Re: Expanding a file with variables (and data) for N most recent observations

          ‏2012-10-18T01:02:02Z  in response to Scruffylg
          1) I was assuming that the test score file had the student id on each record, so merging first would give you a contiguous block with everything for each student.

          2) XSAVE, unlike SAVE, is a transformation and is applied case by case just like, say, COMPUTE. So you would put it inside a DO IF block that controls what is saved.

          3) SHIFT VALUES was designed to work well with SPLIT FILES. It is common to want to work with blocks of cases, and the LAG function does not respect split boundaries. If you used LAG, you would test whether the lagged id matched the current id to determine whether to use the lagged value. SHIFT VALUES is actually a procedure, not a transformation, so it is subject to split file behavior.

          Glad you have now figured this out and earned your SPSS syntax merit badge. :-)

          Regards,
          Jon
  • Scruffylg
    Scruffylg
    18 Posts
    ACCEPTED ANSWER

    Re: Expanding a file with variables (and data) for N most recent observations

    ‏2012-10-18T17:53:52Z  in response to Scruffylg
    I worked on the second file (the one with multiple instances of each ID). First, I sorted the file by ID(A) and Date(descending). I used SPLIT FILE by ID command to restrict the "shifting" to being within each ID, and then I used the SHIFT VALUES command to accomplish the shift and place those values in new variables. To get the N most recent observations, you need to do (N-1) shifts. The shifts will result in the first instance of each ID containing all of the required data, so you need to remove all the duplicate IDs (and keep the first one).
    • Scruffylg
      Scruffylg
      18 Posts
      ACCEPTED ANSWER

      Re: Expanding a file with variables (and data) for N most recent observations

      ‏2012-10-18T17:56:47Z  in response to Scruffylg
      Jon...The above post is not in response to your comments...I simply marked the question as answered and documented what I did...I didn't know it would appear as a post in the conversation.
      • Scruffylg
        Scruffylg
        18 Posts
        ACCEPTED ANSWER

        Re: Expanding a file with variables (and data) for N most recent observations

        ‏2012-10-29T06:34:04Z  in response to Scruffylg
        Jon--

        I declared victory prematurely on this problem. It has turned out to be more complicated than I originally described, and I'm having trouble figuring out how to modify what I've done to address the complication.

        In the original problem, I described a set of students and a desire to capture the N most recent test scores for each student. I accomplished this using SPLIT FILE and SHIFT VALUES.

        Well, imagine the problem is a bit more complicated--each student may have taken the course multiple times, so each Student ID can be associated with multiple times through the course, each characterized with a unique begin date and end date. I still have a single list of test scores/dates and the ID of the corresponding student. But now I want to come up with the N most recent test scores for each unique combination of student and their iteration through the course.

        Instead of doing a SPLIT FILE on the student ID alone, I now need to do a SPLIT FILE on the combination of Student ID and beginning/ending dates of a particular iteration through the course.

        The problem I have is how to associate the test scores/dates, not just with the student name, but also with the relevant begin/end date associate with that student's iteration in the course.

        Here's sample data
        
        File 1: Student Tenure in Course Student    Start        End 1        8/30/2010   6/23/2011 1        9/05/2011   6/30/2012 2        9/15/2008   7/01/2009   File 2: Score Data Student  Date   Score 1     10/01/11   85 1     02/02/12  100 1     11/15/10   95 1 02/03/11   90 2     04/04/09   65
        


        Is there any way I can use File 1 as a "lookup" and associate with File 2 the relevant start and end dates, for each student, that "bookend" the score in question.

        
        File 3 Student   Date   Score     Start        End 1     10/01/11   85    9/05/2011   6/30/2012 1     02/02/12  100    9/05/2011   6/30/2012 1     11/15/10   95    8/30/2010   6/23/2011 1     02/03/11   90    8/30/2010   6/23/2011 2     04/04/09   65    9/15/2008   7/01/2009
        


        If I could get the file to look like this, then I could use the same approach and do a SPLIT FILE on the the Student ID, Start, and End Dates...then use SHIFT VALUES again.

        I'd like to be able to solve this problem using syntax, if possible.

        Any suggestions?
        • SystemAdmin
          SystemAdmin
          445 Posts
          ACCEPTED ANSWER

          Re: Expanding a file with variables (and data) for N most recent observations

          ‏2012-10-29T14:10:14Z  in response to Scruffylg
          Here is a solution using programmability. Besides the Python Essentials, you will need the extendedTransforms.py module and the SPSSINC TRANS extension command from the SPSS Community site. extendedTransforms.py is in the Python Modules collection and SPSSINC TRANS is in the Extension Commands collection. You will need the latest version of extendedTransforms.py. There is doubtless another way to do this, but this seems easiest at this time in the morning. The SPSSINC TRANS command has to be used twice, because the lookup function returns only one value per case.

          Note that the lookup dataset must be sorted by student and start.
          
          data list list /student(f5.0) start(adate9) end(adate9). begin data. 1        8/30/2010   6/23/2011 1        9/05/2011   6/30/2012 2        9/15/2008   7/01/2009 end data. dataset name file1. SORT CASES BY student(A) start(A).   data list list/student(F5.0) date(adate9) score(f8.0). begin data. 1     10/01/11   85 1     02/02/12  100 1     11/15/10   95 1     02/03/11   90 2     04/04/09   65 end data. dataset name file2.   dataset activate file2. SPSSINC TRANS result = start /initial 
          "extendedTransforms.vlookupgroupinterval('file1', 'student', 'start', 'end', 'start')" /formula 
          "func(student, date)". SPSSINC TRANS result = end /initial 
          "extendedTransforms.vlookupgroupinterval('file1', 'student', 'start', 'end', 'end')" /formula 
          "func(student, date)". format start(adate9) end(adate9).
          
          • Scruffylg
            Scruffylg
            18 Posts
            ACCEPTED ANSWER

            Re: Expanding a file with variables (and data) for N most recent observations

            ‏2012-10-29T19:29:19Z  in response to SystemAdmin
            Thanks, Jon.

            I got it running on my machine...it is indeed very slick.

            I noticed that the last line of the code uses a "format" command. Two questions:

            1) Why does it work? Isn't the correct SPSS Syntax spelled "formats" with an "s"?
            2) I've been using ALTER TYPE every time I need to make a variable a date or a variable's format. What's the difference between ALTER TYPE and FORMAT(s)?
            • SystemAdmin
              SystemAdmin
              445 Posts
              ACCEPTED ANSWER

              Re: Expanding a file with variables (and data) for N most recent observations

              ‏2012-10-29T19:39:10Z  in response to Scruffylg
              I'm glad that this is working for you.

              As for the format questions, older commands in Statistics can be abbreviated, so FORMAT works. We stopped allowing that for new commands some years ago, because a new command might be created that would be confused with some abbreviation of an older command. So it is good practice to write out the full name.

              As for why it works, date variables in Statistics are actually of type numeric. The date is just a format. It has no effect on the values. ALTER TYPE is mainly used when changing string length or converting between numeric and string. In those cases, values can change.

              The Data Editor confuses the issue by referring to some formats as types, but there are actually only two variable types in Statistics: numeric and string. Everything else is just a format on top of one of those types.

              Regards,
              Jon