National Social Science Association

National Social Science Association Home
NSSA History
Membership Form
Conferences and Seminars
Publications
Officers and Board Members
Newsletter
New Announcements
Contact NSSA
 
 
 

The Assessment of Science Content Knowledge of Elementary
and Middle School Teachers in a Professional Development Program
Entitled NWO-TEAMS

Emilio Duran
Jacob Burgoon
Bowling Green State University

Background
     The lack of student achievement in science has been a national and international concern for over 30 years. Academic performance, as measured by the National Assessment of Educational Progress (NAEP), steadily declined after the administration of the first assessment in 1969. The publication of A Nation at Risk in 1983 by the National Commission on Excellence in Education (NCEE) increased public awareness of the educational crisis and pushed educational reform to the top of the policy agenda (NCEE, 1983). Reform efforts based on suggestions made by the NCEE helped to increase student achievement scores during the late 1980s. The improvements, however, were short lived and scores reached a plateau in the early 1990s.
     Today, student achievement scores only reflect a basic understanding of science. For example, only 29% of 4th and 8th grade students and 18% of 12th grade students achieved a “proficient” score on the 2005 NAEP science assessment (Academic Competitiveness Council [ACC], 2007). Proficiency is characterized by the ability to apply knowledge to real-world situations and demonstrate analytic skills and mastery of subject-matter knowledge (Brown, 2000). Further evidence of a lack of proficiency comes from NAEP long-term assessment scores that show from 1977 to 1999 less than one-half of 17-year-olds possessed the skills to analyze scientific procedures and data (Campbell, Hombo, & Mazzo, 2000).
     One of the suggestions made by the NCEE to improve student achievement was the improvement of teacher quality (NCEE, 1983). The National Commission on Teaching and America’s Future (NCTAF, 1996) argued that “what teachers know and can do is the most important influence on what students learn.” In order for students to achieve the success expected by state governments and local school boards, they need to be taught by professionals who are fully competent in subject-matter knowledge and able to motivate, encourage, and facilitate student learning. The importance of teacher competency is exemplified by a study comparing low-achieving and high-achieving elementary schools, which found that over 90% of the variance in student achievement could be explained by differences in teacher qualifications (Armour-Thomas et al., 1989). Despite the known importance of high quality teachers, thousands of under qualified teachers currently teach in schools throughout the nation. In fact, 40 states allow their school districts to hire teachers who have not met basic requirements (NCTAF, 1996). Furthermore, Darling-Hammond (2000) showed that teacher characteristics, such as certification status and having a degree in the subject, have a positive impact on student achievement. However, the number of fully certified science teachers declined from 1990 to 2002, and in 2000, 23% of 7th to 12th grade science teachers did not have a major or a minor in the subject they were teaching (National Science Board [NSB], 2006).
     The national emphasis on science inquiry has created many challenges for classroom teachers, who frequently lack the knowledge and skills in specific content areas to deliver the challenging instructional approaches called for by the standards (Fuhrman, 2003; see National Research Council [NRC], 1996, and American Association for the Advancement of Science [AAAS], 1993, for national standards). Inquiry science lessons are designed to give students more freedom than traditional lessons, so teachers must be prepared to answer student questions about subject-matter that may be slightly beyond the scope of the original lesson. For this reason, inquiry-based teaching requires deeper and broader subject-matter knowledge than traditional teaching (Fishman et al., 2003). Teachers who are less competent in subject-matter knowledge may actually be harmful to their students by passing on inaccurate ideas or uncritically using or inappropriately altering textbooks (Ball & McDiarmid, 1990). Furthermore, teachers who are less competent in subject-matter knowledge may have misconceptions similar to those held by their students. Research about science misconceptions held by students (Driver et al., 1994) demonstrates that students are often reluctant to give up their prior conceptions about science because they function well in the real world (Glynn, Yeany, & Britton, 1991; Smith, 1991). Teachers who possess misconceptions are not likely to be able to help their students overcome misconceptions, thus hindering their students’ ability to learn science.
     Improving teacher quality can be accomplished by professional development programs, which provide teachers with opportunities to acquire and improve professional skills such as subject matter knowledge and pedagogical knowledge. Researchers (e.g., Fishman et al., 2003; Mizell, 2003) agree that the first step in designing professional development programs should be to identify areas where students need improvement. For example, national data on student achievement reveal that students lack the ability to apply and analyze scientific data. Since this ability is improved by inquiry-based curricula and teaching methods, professional development programs should train teachers to effectively use inquiry-based methods. Currently, many professional development programs do, in fact, aim to increase teachers’ subject-matter knowledge and inquiry-based teaching skills (e.g., Lotter, 2006; Marx et al., 2004; Supovitz, Mayer, & Kahle, 2000) due to a demand for teachers able to use inquiry-based teaching methods.
     Professional development program design differs depending on the purpose of the program and the targeted population. Programs may take many forms (e.g., seminars, summer institutes, workshops) and cover a variety of topics (e.g., subject-matter, teaching methods, student learning). Reformers almost unanimously agree that programs sustained over a long period of time have a greater impact than short-term programs, such as one- or two-day workshops. For example, one professional development program that lasted for six weeks focused on content knowledge and inquiry teaching and produced long-lasting increases in teachers’ attitudes, preparation, and use of inquiry-based teaching methods (Supovitz, Mayer, & Kahle, 2000). Further, programs that provide teachers opportunities for “hands-on” work and focus on subject matter and student learning are more likely to enhance knowledge and skills (Garet et al., 2001; Kennedy, 1996).
     Evaluation of professional development programs is an important factor in long-term program success since reflection on the results of the program contributes to the program’s continuous improvement (Loucks-Horsley et al., 1998). Many science, technology, engineering, and mathematics (STEM) education programs lack scientifically rigorous evaluations that result in evidence of their effectiveness. In fact, the Department of Education’s Institute of Education Science’s What Works Clearinghouse reviewed 75 middle school mathematics programs and found that only three had well-designed experimental studies that led to strong evidence of their effectiveness (ACC, 2007). Furthermore, despite the efforts of federal and state funding agencies to provide educational programs to improve student learning, the ACC (2007) stated:

“It is unclear which programs or activities are effective in generating positive outcomes. While many ideas have been tested in small case studies, few have been evaluated at the necessary scale to prove their efficacy for a broad range of students in an array of instructional settings. Without such evidence, it is nearly impossible for educators or administrators to know which activities, curricula, or materials to use to achieve the results that our nation demands”. 

The first step in evaluating the effectiveness of professional development should be to assess teacher learning because improvement in teacher knowledge and skills results in increases in student achievement (see Marx et al., 2004). However, evaluating teacher learning (specifically content knowledge) in science and math is difficult, due to the lack of effective assessment instruments (Basile et al., 2006). Most tests that assess teacher knowledge use a multiple-choice format, which is “not very useful for assessing teachers’ ability to analyze and apply knowledge” (Darling-Hammond, 2000). Open-ended items (e.g., essay items) are better than multiple-choice items at assessing higher order thinking skills such as organization, integration, and application of knowledge (Gronlund, 2003; Kubiszyn & Borich, 2003), but the use of these items on teacher tests is rare due to grading difficulties (Kubiszyn & Borich, 2003). The alignment of an assessment to course curriculum is also an important factor in an instrument’s effectiveness. General knowledge assessments published by national centers are commonly used in the evaluation of professional development programs. However, content varies between programs, so the assessments may not align with the program’s curriculum, thus not adequately measuring what is being taught. Locally developed assessments that are directly targeted at course content result in more accurate demonstrations of teacher knowledge as compared to nationally developed assessments that are not directly tied to course content (Basile et al., 2006).
     The purpose of this study was to design science content knowledge instruments that effectively assessed the ability of a professional development program to increase elementary and middle school teachers’ science content knowledge. This paper outlines the development of those instruments and presents a model for the development of rigorous science content knowledge instruments.
Methods
Professional Development Program Entitled NWO-TEAMS
     NWO-TEAMS (Northwest Ohio–Teachers Enhancing Achievement in Mathematics and Science) is a three-year-long (2006-2009) grant funded by the Ohio Department of Education MSP Program. This professional development program aims to increase the content knowledge and inquiry-based teaching skills of elementary and middle school teachers in northwest Ohio. The topics covered during the program are aligned with state and national standards, and were chosen by a local focus group comprised of curriculum experts and experienced teachers who selected the concepts that teachers have the most reluctance and/or difficulty teaching (e.g., electricity, physical/chemical changes, forces and motion).
     Physical, earth, and life science topics are taught over the course of the program with lessons designed to align with Ohio grade-level indicators. The teachers are separated into groups by grade-level (third through sixth) and receive instruction in the selected topics from facilitators, who are experienced teachers familiar with inquiry-based teaching methods, and scientists. The facilitator-scientist team teaching model (Ballone-Duran, Czerniak, & Haney,2005; National Science Resources Center, 1997) allows teachers to experience science instruction that models how they should teach science in their own classrooms. Also, by allowing the teachers to ask the scientists complex scientific questions, the teaching model helps the teachers to develop the deep subject-matter knowledge that is needed to successfully use inquiry-based teaching methods. Program instruction is based on modified OSCI (Ohio Science Institute) modules, which are science inquiry lessons, supplemented by inquiry-based science kits (e.g., FOSS, STC), which are effective in preparing teachers to use inquiry-based methods and increasing their students’ science achievement (Mangrubang, 2004; Young & Lee, 2005).
     NWO-TEAMS is comprised of three sessions over a year long period: Summer Institute I (SI-I), Academic Year (AY), and Summer Institute II (SI-II). Teachers are provided a total of 168 professional development hours throughout the program. During SI-I, teachers attend eight full days of science instruction that includes lessons that cover concepts in physical and earth sciences, including non-contact forces, weathering and erosion, electricity, and physical and chemical changes. During the AY phase, teachers attend eight monthly sessions (September to April) covering concepts in life, physical, and earth sciences. During SI-II, teachers attend four days of instruction in life, physical, and earth sciences as well as educational field trips to the Toledo Zoo and Fossil Park in Sylvania, Ohio. Teachers are encouraged to use these area resources in their own classrooms to enhance their students’ learning experience.
Participants
     The participants in this study include two cohorts of in-service elementary and middle school science teachers involved in NWO-TEAMS. The first NWO-TEAMS cohort was comprised of 65 teachers, including 14 third grade, 22 fourth grade, 16 fifth grade, and 11 sixth grade teachers. The second NWO-TEAMS cohort was comprised of 64 teachers, including 17 third grade, 20 fourth grade, 17 fifth grade, and 10 sixth grade teachers.
Content Test Development
     While content tests for each group (third grade through sixth grade) were created for every phase of the program to assess the program’s effect on teachers’ science content knowledge, we are only focusing on the development of the SI-I tests for cohorts one and two. The SI-I content tests for cohort one were created using items from state achievement tests and locally developed classroom tests. Items were chosen from achievement tests from Texas, California, Oregon, and Ohio, among others, that aligned with the Ohio grade-level indicators on which the instructional content was based. Developing the tests in this way ensured that the teachers would not be tested on a concept that was not covered during the program. After the administration of the first year tests, a software program called ClearStat (Stone, 2003) was used to analyze the test items. ClearStat provides information such as item difficulty according to the Rasch (1960) model, item point biserials (i.e., discrimination), the proportion of teachers answering the item correctly, and the proportion of teachers choosing each of the item’s distracters. Analysis revealed that many of the items were too easy for the teacher participants, and for some items, improvement on the posttest was impossible because all of the teachers answered correctly on the pretest.
     Due to their lack of difficulty, the SI-I content tests were extensively modified before the second year of the program. Most of the previously used items taken from student achievement tests were replaced with new items that aligned with the program content.  However, some items from the first year instruments were changed and used on the second year instruments. For some items, the stem, which is the part of the item where the question is posed, was reworded to make the intentions clearer to the teachers. Furthermore, some items on the first year tests that measured lower order cognitive abilities were reformatted so they measured higher order cognitive abilities on the second year tests.
     The content tests for cohort two were developed to establish higher difficulty by using Bloom’s taxonomy, tables of specifications, lesson plans from the first year of the program, and literature about science misconceptions.
     Bloom’s taxonomy of educational objectives (Bloom et al., 1956) was used to classify the test items by the cognitive level they measured. The cognitive domain of Bloom’s taxonomy consists of six cognitive levels: knowledge, comprehension, application, analysis, synthesis, and evaluation. Items that measured knowledge and comprehension comprised what will be termed “lower order” items, and items that measured application, analysis, synthesis, and evaluation comprised what will be termed “higher order” items. Bloom’s taxonomy was used a guide to write a larger number of higher order items for the second year tests. The reasoning behind the use of Bloom’s taxonomy was because higher order items require more critical thinking skills than lower order items, the addition of these items would increase the content tests’ difficulty.  See Appendix I for examples of lower and higher order items.
      Tables of specifications were used during the development of the tests to ensure the items were aligned with the instructional content and that lower order items did not comprise the majority of the test, as was the case with the first year tests. A table of specifications serves as a blueprint for a test in the form of a two-dimensional table, with content on one side and behavior or skill on the other. In this study, behavior was defined by the cognitive levels of Bloom’s taxonomy. The use of these tables ensures that a wide range of content is represented in the set of items, as well as higher order cognitive abilities (Notar et al., 2004). See Appendix II for an example of a table of specifications.
     Lesson plans from the first year of the program were attained from the facilitators and consulted during test development. Some of the grade-level indicators contain a number of more specific concepts not taught during the program. Newly written test items were compared to the lesson plans to make sure that the teachers were only being tested on the concepts taught during the program.
     Research literature about student and teacher misconceptions supplemented the use of Bloom’s taxonomy during test development. For the purposes of this study, science misconceptions were defined as ideas or concepts that are not scientifically accurate but are commonly held by teachers before and after science instruction.
     In order to increase the effectiveness of the second year tests, student and teacher misconceptions previously identified in literature guided the development of both multiple-choice and open-ended items. A major flaw in the first year tests was the existence of poor item distracters, which are the incorrect answer options in multiple-choice items. The purpose of using misconceptions in the development of multiple-choice items was to create distracters that seemed more plausible to teachers and would therefore distract uninformed teachers from the correct answer. Some of the open-ended items were also based on student and teacher misconceptions previously identified by research. The purpose of using misconceptions for the development of these open-ended items was to identify teacher misconceptions that were similar to those held by students. Other open-ended items were written to explore concepts about which little is known regarding student or teacher misconceptions. The open-ended items provided participants with the opportunity to explain the reasoning behind their answers, which would help us to better understand teachers’ knowledge base for a particular concept. In addition, several open-ended items were aligned with multiple-choice items that assessed the same concept, so we would be able to explore the reasoning behind the participants’ answers to the multiple-choice items.
Data Analysis
     The results of the first year instruments were analyzed by comparing the pretest and posttest means using dependent t-tests. Effect sizes were represented by Cohen’s d, where d = .2 is a small effect size, d = .5 is a medium effect size, and d = .8 is a large effect size (Cohen, 1988). The effect size shows how far apart the pretest mean is from the posttest mean in standard deviation units and demonstrates how important the difference is between the two means. Effect size, in contrast to statistical significance, is not affected by sample size so it provides a more practical interpretation of the difference between the means.
     It was hypothesized that items that measured higher order cognitive abilities would be more difficult than those that measured lower order cognitive abilities. Based on this hypothesis, it was predicted that instruments that contain a balanced number of higher and lower order items would be more difficult than instruments that contain a majority of lower order items. After the development of the second year tests, the percentage of higher order items on the second year tests was compared to the percentage of higher order items on the first year tests.
     Misconceptions that were held by teachers were identified by analyzing the responses to the open-ended items to look for incorrect answers that were common among the participants. Also, because misconceptions were used as distracters for multiple-choice items, pre and posttest multiple-choice items were analyzed to find the proportion of teachers that chose each distracter. In addition to identifying misconceptions in each of the two item formats, the tests were analyzed to find connections between multiple-choice and open-ended items that assessed the same concept.
     After the administration of the second year tests, the results were analyzed by comparing the pretest and posttest means using dependent t-tests.
Results
     The pretest and posttest scores for all of the grades (three through six) in cohort one were compared using dependent t-tests. No significant changes in test scores were seen for grades three and five, but a significant increase in test scores was seen in grades four and six. The mean scores, standard deviations, t values, and effect sizes for each grade are presented in Table 1.
     The three major goals for the development of the second year instruments were to (a) write a larger number of higher order items, (b) reword or reformat items from the first year instruments, and (c) use misconceptions to guide the development of new items.
     The first year instruments were made up primarily of items that measured lower order cognitive abilities. In fact, over 80% of the items on every instrument measured lower order cognitive abilities. In order to balance the percentages of lower and higher order items, new items that measured higher order cognitive abilities were developed for the second year SI-I tests, and some of the lower order items were removed. After the new items were developed for the second year tests, the percentages of items measuring higher order cognitive abilities on every test had increased from the first year. The change in the percentages between the first and second year instruments is depicted in Table 2.Some of the items on the first year instruments were poorly worded and needed to be modified before being used again on the second year instruments. The item below is an example of an item from the first year tests that needed to be reworded. Option A is too general because it does not specify what about the soil appears to be the same. Option B might be interpreted as the same as option A; both options state that the soil is the same, only option B is more specific. Option C uses the word “material” to refer to chemical changes, while option D actually uses the words “chemical changes.” The item analysis showed that 71.4% of the teachers chose the correct answer (option C), while 14.3% chose options A and D, and 0% chose option B. The answer options needed to be reworded to be more specific and to use the same terminology as the other options.

If you were to start digging in the schoolyard, what would you discover as you dug further down?

A. The soil appears to be the same until you reach rock – 14.3%

B. The soil appears to be the same thickness everywhere – 0%

C. The soil color and material appears to change until you reach rock – 71.4%

D. The soil color appears to change but there are no chemical changes – 14.3%

     The item below is the modified version of the item above that was used for the second year test. The stem of the item was changed to a situation that is more likely to be encountered by adults than digging in a schoolyard, which was the situation used in the previous version of the item. The options were written so they used the same terminology as the other options. The item analysis showed that 47.1% chose the correct answer (option C), while 11.8% chose option A, 23.5% chose option B, and 17.6% chose option D. The teacher responses to this item were more evenly distributed than the previous version. Therefore, it can be concluded that rewording the item’s distracters made the item a more accurate measure of teacher science content knowledge.

You take a core sample of soil in order to determine the ground composition before you pour foundation for an extension to your house. What are you most likely to see within the core sample?
A. The physical appearance and chemical composition of the soil remains the same until you reach rock – 11.8%
B. The chemical composition of the soil changes until you reach rock, but the physical appearance does not – 23.5%
C. The physical appearance and chemical composition of the soil appear to change until you reach rock – 47.1%
D. The physical appearance of the soil seems to change until you reach rock, but the chemical composition does not – 17.6%

     Misconceptions guided the development of new items on the second year tests. One multiple-choice item asked teachers to predict what would happen when two balls with different masses were pushed off of a table at the same time. Students tend to think that the speed of an object depends on its weight (Halloun & Hestenes, 1985), and that misconception was used to develop the distracters for the item. Teachers had to decide if the balls would hit the floor at the same time, the heavier ball would hit the floor first, or the light ball would hit the floor first. Item analysis showed that 64.7% of the teachers chose that the heavier ball would hit the floor first, and only 29.4% chose the correct answer, that both balls would hit the floor at the same time.
     The development of the open-ended items was also guided by misconceptions. Some of the open-ended items were written to probe for new misconceptions that had not been previously identified in students or teachers. Table 3 presents some of the misconceptions that were identified in NWO-TEAMS teachers.
     After the administration of the second year instruments, the results were analyzed by comparing the pretest and posttest means using dependent t-tests. Significant increases in science content knowledge were seen for all of the grades, with large effect sizes present for three of the four grades. The mean scores, standard deviations, t values, and effect sizes for each grade are presented in Table 4.
Discussion
     The results of the second year instruments showed that all four grades significantly increased their science content knowledge (see Table 4). Since program instruction and the number of participants remained the same from year one to year two, it can be concluded that the increases in science content knowledge resulted from the increased effectiveness of the instruments to measure science content knowledge. The scores on the first year tests were all fairly high and thus not informative or representative of the teacher participants’ true science knowledge. Because the instruments were not difficult enough, teachers with low science knowledge acquired similar scores to teachers with high science knowledge. Therefore, determining the effectiveness of NWO-TEAMS based on the results from these tests would be inaccurate. However, writing higher order items, rewording items and their distracters, and using misconceptions in the instrument development process resulted in science content knowledge instruments that can effectively evaluate NWO-TEAMS.
     Rewording and writing new effective distracters with the help of misconceptions was an important method in this study. One of Gronlund’s (2003) 18 rules for writing effective multiple-choice items is to “make the distracters plausible and attractive to the uninformed.” One way he suggested for doing this is to “use the common misconceptions or errors of students as distracters.” A Mathematics and Science Partnership (MSP) program called MOSART (Misconception Oriented Standards-based Assessment Resource for Teachers) uses a similar method in creating multiple-choice assessment instruments that are based on national standards and research literature about science misconceptions (Coyle, 2006). Like the instruments developed for NWO-TEAMS, MOSART’s instruments are designed to look for changes in conceptual understanding as a result of professional development (Coyle, 2006).
     Using misconceptions for item development not only makes the instruments more effective, but also guides the program’s curriculum development by identifying what specific concepts the teacher participants struggle with. For example, the responses to a multiple-choice question that asked teachers “What happens to water molecules when liquid water changes to vapor?” showed that many teachers thought that the molecules change by becoming lighter. The next time this concept is addressed, the results of this example will guide curriculum development to make sure that teachers understand that the water molecules are not altered during a phase change.
     In addition to highlighting specific science concepts that need improvement, the results of the instruments could also be used to show which general content areas need the most improvement. Total test scores could be broken down into physical science and earth science scores, and dependent t-tests could be run to find the differences between pretest and posttest scores. Then, for example, if a significant difference was not found between the pretest and posttest earth science scores, the curriculum for that particular grade could be modified to better address the earth science indicators.
Conclusion
     The results of this study are presented here as a model for the development of effective science knowledge instruments for professional development programs. The results have shown that developing instruments by focusing on cognitive abilities and common misconceptions is an effective way to assess the quality of professional development programs and inform decisions about curricular changes. Figure 1 presents the science knowledge instrument development model that was used for NWO-TEAMS.
     The model is presented in three parts: the preparation phase, the continuous improvement cycle, and outcomes. The first step of the preparation phase is the definition of the content and cognitive domains. Defining the content domain involves determining the science content that will be measured with the instruments. The lesson plans developed by program facilitators provide the specific content that will be taught during the program, so consultation of these plans ensures that the created items accurately align to program content. In addition, the lesson plans also show how the content is taught, which is necessary for the definition of the cognitive domain because ideally, instruments should measure program content at the same cognitive level as it was taught (Notar et al., 2004). Also, in order for test items to be classified as higher order, they need to measure the participants’ ability to apply their science knowledge to new situations that were not encountered during instruction; knowledge of how the content is taught guides the creation of these items. Tables of specifications are created during this step to ensure that the content and cognitive domains are accurately defined and aligned with the program’s instructional objectives.
     The next step in the preparation phase is the creation of test items, which are written with the guidance of Bloom’s taxonomy to ensure that they measure the desired cognitive ability. Also, literature is searched for common misconceptions about the program content. These misconceptions can be used in the instruments as distracters for multiple-choice items or as a basis for open-ended items. As items are created, they are inserted into a table of specifications. When the instrument is completed, the table of specifications will show how many items measure each content and cognitive objective (see Appendix II for an example).
     The next part of the model is the continuous improvement cycle, so named because the steps facilitate the improvement of the instruments and can be repeated as many times as the instrument is administered. The results are analyzed after each administration of the instrument to collect data such as the proportion of teachers answering each item correctly, item point biserials (i.e., discrimination), and the proportion of teachers choosing each of the item’s distracters. In this study, ClearStat software was used to obtain these data, but other software can be used, or it can be done by hand, depending on the number of participants in the program.
     Next, the data that come from the analyses are used to modify the instruments. For example, if the analyses show that there are several multiple-choice items that contain distracters that were not chosen by any of the participants, more plausible distracters should be considered before the items are used again. Also, if responses to an open-ended item are drastically different from what was expected, the item should possibly be reworded to make the intentions of the item clearer to the participants. The modified instruments can then be administered again to the next group of participants to begin the cycle again.
     The last part of the model is the outcomes. The data that come from the analyses of the test results are used to inform the program staff about the quality of the program. Total scores are broken into several sub scores that represent each of the major content areas measured by the instrument, such as physical, earth, and life sciences. The participants’ scores in each of the content areas show the program staff which areas need the most work, thus leading to changes in the program’s curriculum. Also, the misconceptions that were identified by the instruments are presented to the facilitators so that those concepts can be better addressed for the next group. The curriculum is modified to meet the specific needs of the participants, thus improving the quality of the program by addressing concepts that are difficult to understand or commonly misunderstood. Therefore, continuous improvement of the instruments leads to continuous improvement of the program. The improvements in program quality will most likely lead to increases in the participants’ science subject matter knowledge, which plays a central role in their effectiveness as teachers (Ball & McDiarmid, 1990; Kennedy, 1998). The use of the model, as this study shows, results in science knowledge instruments that effectively assess professional development programs.

References
Academic Competitiveness Council. (2007). Report of the Academic Competitiveness Council. Washington, DC: The U.S.
       Department of Education.
American Association for the Advancement of Science. (1993). Benchmarks for science literacy, Project 2061. New York:
       Oxford University Press.
Armour-Thomas, E., Clay, C., Domanico, R., Bruno, K., & Allen, B. (1989). An outlier study of elementary and middle
       schools in New York City: Final report
. New York: New York City Board of Education.
Ball, D. L., & McDiarmid, G. W. (1990). The subject-matter preparation of teachers. In W. R. Houston (Ed.), Handbook
       of research on teacher education
(pp. 437-449). New York: Macmillan.
Ballone-Duran, L., Czerniak, C. M., & Haney, J. J. (2005). A descriptive study of the effects of a LSC project on scientists'
       teaching practices and beliefs. Journal of Science Teacher Education, 16, 159-184.
Basile, C., Koellner, K., Kimbrough, D., Jacobson, M., Morris, L., Heath, B., & Lakshmanan, A. (2006). The veritable
       quandary of measuring teacher content knowledge in a math and science partnership
. Paper presented at the 2006
       MSP Evaluation Summit II, Minneapolis, MN.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives,
       Handbook 1: Cognitive domain.
New York: David McKay.
Brown, W. (2000). Reporting NAEP by achievement levels: An analysis of policy and external reviews. In M. L. Bourque
       & S. Byrd (Eds.), Student performance standards on the national assessment of educational progress: Affirmations
       and improvements
(pp. 12-39). Washington, DC: National Assessment Governing Board.
Campbell, J. R., Hombo, C. M., & Mazzo, J. (2000). NAEP 1999 trends in academic performance: Three decades of
       students performance
(NCES 2000-469). Washington, DC: U.S. Department of Education, National Center for
       Educational Statistics.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press.
Coyle, H. (2006). Assessment instruments for MSPs from MOSART. Retrieved May 3, 2008, from
       http://mosart.mspnet.org/index.cfm/11773
Darling-Hammond, L. (2000). Teacher quality and student achievement: A review of state policy evidence. Education
       Policy Analysis Archives, 8
(1). Retrieved September 11, 2007, from http://epaa.asu.edu/epaa/v8n1
Driver, R., Squires, A., Rushworth, P., & Wood-Robinson, V. (1994). Making sense of secondary science: Research into
       children’s ideas
. New York: Routledge.
Fishman, B., Marx, R., Best, S., & Tal, R. (2003). Linking teacher and student learning to improve professional development
       in systemic reform. Teaching and Teacher Education, 19, 643-658.
Fuhrman, S. (2003). Riding waves, trading horses: The twenty-year effort to reform education. In D. T. Gordon (Ed.), A nation
       reformed? American education 20 years after A Nation at Risk
. Cambridge, MA: Harvard Education Press.
Garet, M., Porter, A., Desimone, L., Birman, B., & Yoon, K. S. (2001). What makes professional development effective?
       Results from a national sample of teachers. American Educational Research Journal, 38, 915-945.
Gronlund, N. (2003). Assessment of student achievement. Boston: Allyn & Bacon.
Glynn, S. M., Yeany, R. H., & Britton, B. K. (1991). A constructive view of learning science. In S. M. Glynn, R. H. Yeany,
       & B. K. Britton (Eds.), The psychology of learning science (pp. 3-19). Hillsdale, NJ: Lawrence Erlbaum.
Halloun, A. H., & Hestenes, D. (1985). Common sense concepts about motion. American Journal of Physics, 53,
       1056-1065.
Kennedy, M. (1996). Form and substance in inservice teacher education (Research Monograph No. 13). Madison, WI:
       University of Wisconsin-Madison, National Institute for Science Education.
Kennedy, M. M. (1998). Education reforms and subject matter knowledge. Journal of Research in Science Teaching,
      
35, 249-263.
Kubiszyn, T., & Borich, G. (2003). Educational testing and measurement: Classroom application and practice. New York:
       John Wiley & Sons.
Lotter, C. (2006). The impact of an inquiry professional development program on secondary science teachers’ enactment
       of inquiry-based pedagogies
. Paper presented at the annual conference of the National Association of Research in
       Science Teaching, San Francisco.
Loucks-Horsley, S., Hewson, P. W., Love, N., & Stiles, K. (1998). Designing professional development for teachers of
       science and mathematics
. Thousand Oaks, CA: Corwin Press.
Mangrubang, F. R. (2004). Preparing elementary education majors to teacher science using an inquiry-based approach:
       The full option science system. American Annals of the Deaf, 149, 290-303.
Marx, R., Blumenfeld, P., Krajcik, J., Fishman, B., Soloway, E., Geier, R., Tal, R. T. (2004). Inquiry-based science in the
       middle grades: Assessment of the learning in urban systemic reform. Journal of Research in Science Teaching, 41,
       1063-1080.
Mizell, H. (2003). Facilitator: 10 refreshments: 8 evaluation: 0. Journal of Staff Development, 24, 10-13.
National Commission on Excellence in Education. (1983). A nation at risk: The imperative for educational reform.
       Washington, DC: The U.S. Department of Education.
National Commission on Teaching and America’s Future. (1996). Doing what matters most: Teaching for America’s future
      
. New York: Author.
National Research Council. (1996). National science education standards. Washington, DC: National Academy Press.
National Science Board. (2006). Science and engineering indicators 2006. Arlington, VA: National Science Foundation.
National Science Resources Center. (1997). Science for all children: A guide to improving elementary science education
       in your school district
. Washington, DC: National Academy Press, National Academy of Sciences, Smithsonian
       Institution.
Notar, C. E., Zuelke, D. C., Wilson, J. D., & Yunker, B. D. (2004). The table of xpecifications: Insuring accountability in
       teacher made tests. Journal of Instructional Psychology, 31, 115-129.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danmarks Paedagogiske
       Institut.
Smith, E. L. (1991). A conceptual change model of learning science. In S. M. Glynn, R. H. Yeany, & B. K. Britton (Eds.),
       The psychology of learning science (pp. 3-19). Hillsdale, NJ: Lawrence Erlbaum.
Stone, G. (2003). ClearStat: Item analysis presentation software [Computer Software]. Sylvania, OH: MetriKs Amerique,
       LLC.
Supovitz, J., Mayer, D., & Kahle, J. (2000). Promoting inquiry-based instructional practice: The longitudinal impact of
       professional development in the context of systemic reform. Educational Policy, 14, 331-356.
Young, B. J., & Lee, S. K. (2005). The effects of a kit-based science curriculum and intensive science professional
       development on elementary student science achievement. Journal of Science Education and Technology, 14, 471-481.

Table 1: Changes in science content knowledge as measured by the first year instruments. The pretest and posttest means from the first year SI-I instruments were compared using dependent t-tests. Effect sizes were measured by Cohen’s d and calculated by using the means and standard deviations of each grade’s tests.

Grade

n

Total Possible Points

Mean Pretest Score ± SD

Mean Posttest Score ± SD

t

Effect Size (d)

3

14

21

18.0 ± 1.4

18.8 ± 1.0

1.55

0.66M

4

20

21

15.5 ± 2.4

17.8 ± 1.4

5.88*

1.17L

5

20

22

15.4 ± 1.9

15.9 ± 3.7

0.63

0.17

6

12

26

18.3 ± 2.6

21.1 ± 1.5

4.37*

1.31L

*p < .001, n = number of participants in each grade,L = large, M = medium.

Table 2: Percentages of higher and lower order items for the first and second year instruments. The percentages of items measuring each cognitive level were calculated for each grade by counting the number of items measuring each cognitive level and dividing by the instruments’ total number of items. The total percentages of items measuring each cognitive level were calculated by combining the number of items from all four grades that measure each cognitive level and dividing by the total number of items

Grade

Year One

Year Two

Lower Order

Higher Order

Lower Order

Higher Order

3

94.5%

5.6%

46.2%

53.8%

4

80.0%

20.0%

60.9%

39.1%

5

82.3%

17.7%

50.0%

50.0%

6

88.3%

11.7%

50.0%

50.0%

Total

86.1%

13.9%

53.1%

46.9%

Table 3: Misconceptions held by NWO-TEAMS teacher participants. Items that probed for misconceptions were based on misconceptions that had already been identified in students, already been identified in teachers, or had not already been identified. Teacher responses demonstrated that teachers held misconceptions in all three categories.

Previously Identified in Students

Previously Identified in Teachers

Previously Unidentified

Gravity acts only on a body in motion

Chemical changes are not reversible

Gravity acts more when an object is further from the ground

All metals are attracted to magnets

Density is directly proportional to density and/or volume

Heat and pressure provide the energy stored in fossil fuels

Table 4: Changes in science content knowledge as measured by the second year instruments. The pretest and posttest means from the second year SI-I instruments were compared using dependent t-tests. Effect sizes were measured by Cohen’s d and calculated by using the means and standard deviations of each grade’s tests.

Grade

n

Total Possible Points

Mean Pretest Score ± SD

Mean Posttest Score ± SD

t

Effect Size (d)

3

13

22

9.9 ± 4.0

13.0 ± 3.3

3.65**

0.84L

4

18

38

19.6 ± 4.8

22.8 ± 4.7

3.04**

0.67M

5

17

25

9.9 ± 1.9

13.9 ± 3.4

4.54***

1.41L

6

12

25

9.2 ± 5.3

13.6 ± 4.1

2.66*

0.91L

*p < .05, **p < .01, ***p < .001, L = large, M = medium.

Figure 1: NWO-TEAMS Science Knowledge Instrument Development Model

APPENDIX I: Examples of Lower and Higher Order Items

Lower Order:

What units do scientists use to measure temperature?

When liquid water is heated to vapor, the water molecules are:

  A. changed, and they become lighter
  B. unchanged, and they exist as single molecules in the air
  C. changed, and they form compounds with other air molecules
  D. unchanged, and they are bonded together in groups

Higher Order:

                                    

Look at the picture above. Water is poured into Beaker 1 and frozen. Water is poured into Beakers 2 and 3 to match the level of the ice in Beaker 1. Each beaker is tightly sealed. Beaker 2 is heated until all the water has evaporated. Beaker 3 is left at room temperature. Describe how the mass of each beaker relates to the others, and explain why. Be specific. Be sure to compare EVERY beaker

The bathroom in your house has tile along the bottom of the wall. On a hot day in July, you touch the tile and find it to be surprisingly cold. You then touch the non-tiled wall above, and it seems to be warmer than the tile. How can this be explained?

   A. Warm air rises by convection, so the wall will be warmer than the tile
   B. Tile effectively reduces heat transfer by blocking UV radiation, so the tile does not get warm
   C. Tile is a good conductor of thermal energy, so heat is transferred from your body to the tile.
   D. The wall underneath the tile acts as an additional layer of insulation, so the tile does not get warm.

APPENDIX II: Example of a Table of Specifications

Cohort Two: Grade 4


Indicator

Item Type

Bloom’s Taxonomy

Total

Know

Comp

Appl

Anal

Synth

Eval

PS 1

MC

(3)(7)

(12)

 

 

 

 

3

PS 4

MC
SA

 

(5)(9)

 

(19)

 

 

 

3

ES 1

SA
MC

 

 

(21)
(22)

 

 

 

2

ES 2

SA

(20)

 

 

 

 

 

1

ES 3

MC
SA

 

(2)(4)

(1)
(18)

 

 

 

4

ES 8

MC
SA

(10)

(8)
(14)

(16)

 

 

 

4

ES 9

SA

 

 

(23)

 

 

 

1

ES 10

SA
MC

 

(15)
(6)(13)

(11)(17)

 

 

 

5

Total

 

4

10

9

 

 

 

23

PS = Physical Science
ES = Earth Science
SA = Short Answer
MC = Multiple Choice


 
Home | About NSSA | Membership Form | Conferences & Seminars | Publications | Officers & Board | Newsletter | Announcements | Contact Us
Site Map | Terms and Conditions | Privacy Policy
Designed by Dreamwirkz Web Designs 2007 All Rights Reserved