频道栏目
首页 > 资讯 > 其他 > 正文

calibre recipes的API中文文档

17-06-30        来源:[db:作者]  
收藏   我要投稿

calibre recipes的API中文文档,class calibre.web.feeds.news.BasicNewsRecipe(options, log, progress_reporter)。

这个基类包含逻辑所需的所有功能。通过逐步覆盖更多的功能在这个类中,你可以逐渐更多的定制/强大的recipes。

方法

abort_article(msg=None)

调用这个方法里面的任何预处理方法中止当前文章的下载。可以跳过包含不合适的内容的文章,如纯视频文章。

abort_recipe_processing(msg)

recipes下载系统中止这个recipe的下载,给用户一个简单的反馈消息。

add_toc_thumbnail(article, src)

从populate_article_metadata调用这个方法,就是从当前的article中的?img>中src属性的链接图片的缩略图作为目录。目前kindle有显示这个的功能。

adeify_images(soup)

这个方法为了兼容Adobe Digital Editions对EPUB格式中的图像的支持, postprocess_html()调用这个方法.

canonicalize_internal_url(url, is_link=True)

返回一组规范表示的url。默认实现使用的服务器的主机名和URL的路径,忽略所有query parameters,fragments等。可以看urlparse.urlparse()函数。

is_link
True: URL是html文件里面带的
False: 下载文章的url链接

cleanup()

当所有的工作做完之后,对一些信息的清除,比如清楚登录信息。

clone_browser(br)

Clone the browser br. Cloned browsers are used for multi-threaded downloads, since mechanize is not thread safe. The default cloning routines should capture most browser customization, but if you do something exotic in your recipe, you should override this method in your recipe and clone manually.

Cloned browser instances use the same, thread-safe CookieJar by default, unless you have customized cookie handling.

default_cover(cover_file)

Create a generic cover for recipes that don’t have a cover

download()
Download and pre-process all articles from the feeds in this recipe. This method should be called only once on a particular Recipe instance. Calling it more than once will lead to undefined behavior. :return: Path to index.html

extract_readable_article(html, url)
Extracts main article content from ‘html’, cleans up and returns as a (article_html, extracted_title) tuple. Based on the original readability algorithm by Arc90.

get_article_url(article)
Override in a subclass to customize extraction of the URL that points to the content for each article. Return the article URL. It is called with article, an object representing a parsed article from a feed. See feedparser. By default it looks for the original link (for feeds syndicated via a service like feedburner or pheedo) and if found, returns that or else returns article.link.

get_browser(*args, **kwargs)
Return a browser instance used to fetch documents from the web. By default it returns a mechanize browser instance that supports cookies, ignores robots.txt, handles refreshes and has a mozilla firefox user agent.

If your recipe requires that you login first, override this method in your subclass. For example, the following code is used in the New York Times recipe to login for full access:

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
if self.username is not None and self.password is not None:
br.open(‘https://www.nytimes.com/auth/login‘)
br.select_form(name=’login’)
br[‘USERID’] = self.username
br[‘PASSWORD’] = self.password
br.submit()
return br
get_cover_url()
Return a URL to the cover image for this issue or None. By default it returns the value of the member self.cover_url which is normally None. If you want your recipe to download a cover for the e-book override this method in your subclass, or set the member variable self.cover_url before this method is called.

get_extra_css()

By default returns self.extra_css. Override if you want to programmatically generate the extra_css.

get_feeds()

Return a list of RSS feeds to fetch for this profile. Each element of the list must be a 2-element tuple of the form (title, url). If title is None or an empty string, the title from the feed is used. This method is useful if your recipe needs to do some processing to figure out the list of feeds to download. If so, override in your subclass.

get_masthead_title()

Override in subclass to use something other than the recipe title

get_masthead_url()

Return a URL to the masthead image for this issue or None. By default it returns the value of the member self.masthead_url which is normally None. If you want your recipe to download a masthead for the e-book override this method in your subclass, or set the member variable self.masthead_url before this method is called. Masthead images are used in Kindle MOBI files.

get_obfuscated_article(url)

If you set articles_are_obfuscated this method is called with every article URL. It should return the path to a file on the filesystem that contains the article HTML. That file is processed by the recursive HTML fetching engine, so it can contain links to pages/images on the web.

This method is typically useful for sites that try to make it difficult to access article content automatically.

classmethod image_url_processor(baseurl, url)

Perform some processing on image urls (perhaps removing size restrictions for dynamically generated images, etc.) and return the precessed URL.

index_to_soup(url_or_raw, raw=False, as_tree=False)

Convenience method that takes an URL to the index page and returns a BeautifulSoup of it.

url_or_raw: Either a URL or the downloaded index page as a string

is_link_wanted(url, tag)

Return True if the link should be followed or False otherwise. By default, raises NotImplementedError which causes the downloader to ignore it.

Parameters:
url – The URL to be followed
tag – The tag from which the URL was derived
parse_feeds()[source]
Create a list of articles from the list of feeds returned by BasicNewsRecipe.get_feeds(). Return a list of Feed objects.

parse_index()

This method should be implemented in recipes that parse a website instead of feeds to generate a list of articles. Typical uses are for news sources that have a “Print Edition” webpage that lists all the articles in the current print edition. If this function is implemented, it will be used in preference to BasicNewsRecipe.parse_feeds().

It must return a list. Each element of the list must be a 2-element tuple of the form (‘feed title’, list of articles).

Each list of articles must contain dictionaries of the form:

{
‘title’ : article title,
‘url’ : URL of print version,
‘date’ : The publication date of the article as a string,
‘description’ : A summary of the article
‘content’ : The full article (can be an empty string). Obsolete
do not use, instead save the content to a temporary
file and pass a file:///path/to/temp/file.html as
the URL.
}
For an example, see the recipe for downloading The Atlantic. In addition, you can add ‘author’ for the author of the article.

If you want to abort processing for some reason and have calibre show the user a simple message instead of an error, call abort_recipe_processing().

populate_article_metadata(article, soup, first)[source]
Called when each HTML page belonging to article is downloaded. Intended to be used to get article metadata like author/summary/etc. from the parsed HTML (soup).

Parameters:
article – A object of class calibre.web.feeds.Article. If you change the summary, remember to also change the text_summary
soup – Parsed HTML belonging to this article
first – True iff the parsed HTML is the first page of the article.
postprocess_book(oeb, opts, log)[source]
Run any needed post processing on the parsed downloaded e-book.

Parameters:
oeb – An OEBBook object
opts – Conversion options
postprocess_html(soup, first_fetch)[source]
This method is called with the source of each downloaded HTML file, after it is parsed for links and images. It can be used to do arbitrarily powerful post-processing on the HTML. It should return soup after processing it.

Parameters:
soup –
A BeautifulSoup instance containing the downloaded HTML.

first_fetch – True if this is the first page of an article.
preprocess_html(soup)[source]
This method is called with the source of each downloaded HTML file, before it is parsed for links and images. It is called after the cleanup as specified by remove_tags etc. It can be used to do arbitrarily powerful pre-processing on the HTML. It should return soup after processing it.

soup: A BeautifulSoup instance containing the downloaded HTML.

preprocess_image(img_data, image_url)[source]
Perform some processing on downloaded image data. This is called on the raw data before any resizing is done. Must return the processed raw data. Return None to skip the image.

preprocess_raw_html(raw_html, url)[source]
This method is called with the source of each downloaded HTML file, before it is parsed into an object tree. raw_html is a unicode string representing the raw HTML downloaded from the web. url is the URL from which the HTML was downloaded.

Note that this method acts before preprocess_regexps.

This method must return the processed raw_html as a unicode object.

classmethod print_version(url)[source]
Take a url pointing to the webpage with article content and return the URL pointing to the print version of the article. By default does nothing. For example:

def print_version(self, url):
return url + ‘?&pagewanted=print’
skip_ad_pages(soup)[source]
This method is called with the source of each downloaded HTML file, before any of the cleanup attributes like remove_tags, keep_only_tags are applied. Note that preprocess_regexps will have already been applied. It is meant to allow the recipe to skip ad pages. If the soup represents an ad page, return the HTML of the real page. Otherwise return None.

soup: A BeautifulSoup instance containing the downloaded HTML.

sort_index_by(index, weights)[source]
Convenience method to sort the titles in index according to weights. index is sorted in place. Returns index.

index: A list of titles.

weights: A dictionary that maps weights to titles. If any titles in index are not in weights, they are assumed to have a weight of 0.

classmethod tag_to_string(tag, use_alt=True, normalize_whitespace=True)[source]
Convenience method to take a BeautifulSoup Tag and extract the text from it recursively, including any CDATA sections and alt tag attributes. Return a possibly empty unicode string.

use_alt: If True try to use the alt attribute for tags that don’t have any textual content

tag: BeautifulSoup Tag

articles_are_obfuscated = False
Set to True and implement get_obfuscated_article() to handle websites that try to make it difficult to scrape content.

auto_cleanup = False
Automatically extract all the text from downloaded article pages. Uses the algorithms from the readability project. Setting this to True, means that you do not have to worry about cleaning up the downloaded HTML manually (though manual cleanup will always be superior).

auto_cleanup_keep = None
Specify elements that the auto cleanup algorithm should never remove. The syntax is a XPath expression. For example:

auto_cleanup_keep = ‘//div[@id=”article-image”]’ will keep all divs with
id=”article-image”
auto_cleanup_keep = ‘//*[@class=”important”]’ will keep all elements
with class=”important”
auto_cleanup_keep = ‘//div[@id=”article-image”]|//span[@class=”important”]’
will keep all divs with id=”article-image” and spans
with class=”important”
center_navbar = True
If True the navigation bar is center aligned, otherwise it is left aligned

compress_news_images = False
Set this to False to ignore all scaling and compression parameters and pass images through unmodified. If True and the other compression parameters are left at their default values, jpeg images will be scaled to fit in the screen dimensions set by the output profile and compressed to size at most (w * h)/16 where w x h are the scaled image dimensions.

compress_news_images_auto_size = 16
The factor used when auto compressing jpeg images. If set to None, auto compression is disabled. Otherwise, the images will be reduced in size to (w * h)/compress_news_images_auto_size bytes if possible by reducing the quality level, where w x h are the image dimensions in pixels. The minimum jpeg quality will be 5/100 so it is possible this constraint will not be met. This parameter can be overridden by the parameter compress_news_images_max_size which provides a fixed maximum size for images. Note that if you enable scale_news_images_to_device then the image will first be scaled and then its quality lowered until its size is less than (w * h)/factor where w and h are now the scaled image dimensions. In other words, this compression happens after scaling.

compress_news_images_max_size = None
Set jpeg quality so images do not exceed the size given (in KBytes). If set, this parameter overrides auto compression via compress_news_images_auto_size. The minimum jpeg quality will be 5/100 so it is possible this constraint will not be met.

conversion_options = {}
Recipe specific options to control the conversion of the downloaded content into an e-book. These will override any user or plugin specified values, so only use if absolutely necessary. For example:

conversion_options = {
‘base_font_size’ : 16,
‘tags’ : ‘mytag1,mytag2’,
‘title’ : ‘My Title’,
‘linearize_tables’ : True,
}
cover_margins = (0, 0, ‘#ffffff’)
By default, the cover image returned by get_cover_url() will be used as the cover for the periodical. Overriding this in your recipe instructs calibre to render the downloaded cover into a frame whose width and height are expressed as a percentage of the downloaded cover. cover_margins = (10, 15, ‘#ffffff’) pads the cover with a white margin 10px on the left and right, 15px on the top and bottom. Color names defined at https://www.imagemagick.org/script/color.php Note that for some reason, white does not always work in Windows. Use #ffffff instead

delay = 0
Delay between consecutive downloads in seconds. The argument may be a floating point number to indicate a more precise time.

description = u”
A couple of lines that describe the content this recipe downloads. This will be used primarily in a GUI that presents a list of recipes.

encoding = None
Specify an override encoding for sites that have an incorrect charset specification. The most common being specifying latin1 and using cp1252. If None, try to detect the encoding. If it is a callable, the callable is called with two arguments: The recipe object and the source to be decoded. It must return the decoded source.

extra_css = None
Specify any extra CSS that should be added to downloaded HTML files. It will be inserted into

相关TAG标签
上一篇:Java内存区域与内存溢出异常
下一篇:常见端口号及其分类
相关文章
图文推荐

关于我们 | 联系我们 | 广告服务 | 投资合作 | 版权申明 | 在线帮助 | 网站地图 | 作品发布 | Vip技术培训 | 举报中心

版权所有: 红黑联盟--致力于做实用的IT技术学习网站