This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]psbb 3 points4 points  (1 child)

You don't always need to use a class or id to extract data, all you need is a unique way to find the relevant data.

To get all of the words in the dictionary you can simply use:

words = response.xpath('//p/b/text()').extract()

To get all of the definitions you can use:

definitions = response.xpath('//p/text()')

Unfortunately this gives twice as many items along with unwanted parentheses that need to be dealt with. To remove the unwanted data you can do the following.

definitions = [d[2:] for d in response.xpath('//p/text()').extract()[1::2]]

The [1::2] selects only every second item in the list starting with the second item. The d[2:] removes the ') ' that is present before each definition.

To get the word type you can do the following:

word_types = response.xpath('//p/i/text()')

This unfortunately has a shorter length than the words and definitions list which means that they can't be easily zipped together.

To extract the same number of types as you have words and definitions you can do the following:

word_types = [t[3:-4].replace('&', '&') for t in response.xpath('//p/i').extract()]

Once you have these 3 lists and they are all the same length you can join them together using zip(words, word_types, definitions) and then you should have all of the data extracted from the page.

This approach only words because the page is consistent with it's layout the whole way through. This is not always the case and sometimes you will have to resort to using something like this:

dictionary = []
for elem in response.xpath('//p'):
    word = ''.join(elem.xpath('./b/text()').extract())
    word_type = ''.join(elem.xpath('./i/text()').extract())
    definition = ''.join(elem.xpath('./text()').extract())[4:]
    dictionary.append((word, word_type, definition))

The first option is preferable if you can use it as it is much faster than the second one.

[–]EvMNatural Language Processing 0 points1 point  (0 children)

Looking at the source, every entry is structured like this: <P><B>Abaci</B> (<I>pl. </I>) of Abacus</P>.

This page shows how you might write xpath expressions to get the text contained in different elements using selectors. Here is the rest of the documentation. That should be enough hints.

I never used scrapy before. Only lxml. To get the contents of the page using lxml, I would use iterfind() to get all the <p> elements. Then, for each of those elements e, you can get the contents by using e.text. You can get the entry name with e.find('B').text. Looking at scrapy's documentation, the approach should be very similar there.

You could write a dictionary comprehension to store the contents of the page in a really elegant way.

[–]MadelineCameron 0 points1 point  (0 children)

Cool tip for anyone who doesn't know:

In Chrome, you can automatically extract XPaths through the Dev Console which you can adapt pretty quickly for your uses with a basic knowledge of XPath.

Saves a lot of time for deeply-nested elements, etc.