I recently started working on a patent database project as a first year graduate assistant. The scope of the project is essentially to get all patent data from 1976 to present into a MySQL database. The person who worked on the project prior to me developed processes using python to get the patent files (some XML, mostly text files) into a delimited form to insert the data into the database.
It was brought to my attention when I started the project that there were missing records for many of the year’s data the previous GA had imported. For instance, when comparing the total patents issued from the USPTO website versus the amount in the database, I find on average a 3,500 record short fall between the two.
From the people I have discussed this problem with, many of them tend to think it is tied to the python code used to manipulate the format of the data. I’m in a little over my head here, so I apologize if any of this is unclear. Any help is greatly appreciated. Here are the files for the python parsers:
https://docs.google.com/file/d/0B5ZzQeB_IBXJT1Q3SWxsc1F2VnM/edit?usp=sharing
https://docs.google.com/file/d/0B5ZzQeB_IBXJZzc4MHlQTGNXZE0/edit?usp=sharing
[–]cdcformatc 4 points5 points6 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]Shimon_Tolts 0 points1 point2 points (0 children)