feat: 重写 ADS 搜索脚本为 REST API,新增 Obsidian 转换 skill,修复路径

- ads_metadata_search: 移除 ads 库依赖,改用 requests 直连 ADS REST API;
  移除硬编码 API Key,改为 .env 文件/环境变量加载
- 新增 ads_html_to_obsidian skill:将下载的 HTML 文献批量转换为
  Obsidian Markdown 笔记(BS4 提取正文 + Pandoc 转换 + 清洗后处理)
- 两个 SKILL.md 中的 Windows 绝对路径改为相对路径
This commit is contained in:
Asfmq 2026-05-26 17:30:36 +08:00
parent 0663091691
commit dfd0a980a5
5 changed files with 401 additions and 47 deletions

View File

@ -0,0 +1,89 @@
---
name: ads_html_to_obsidian
description: "将 ADS 下载的 HTML/PDF 天体物理文献批量转换为 Obsidian Markdown 笔记格式。当用户要求将下载的论文转为 Obsidian 笔记、将 HTML 文献转为 Markdown、或者在使用 ads_literature_downloader 之后需要整理文献到 Obsidian 知识库时,务必触发并使用本技能。当用户提到 '转换文献'、'导入 Obsidian'、'整理论文'、'放到笔记库' 等关键词时也应触发。"
---
# ADS HTML to Obsidian (ADS 文献转 Obsidian 笔记)
本技能将 `ads_literature_downloader` 下载的 HTML 文献批量转换为 Obsidian Markdown 格式的笔记文件。它会:
1. 从 HTML 中提取论文正文(使用 BeautifulSoup
2. 通过 Pandoc 将 HTML 转为干净的 Markdown
3. 清理转换残留标记CSS class、div、脚注等
4. 生成带有 YAML frontmatter 的 Obsidian 笔记文件
5. 对于无法获取全文的文献,自动回退为仅保存摘要
## 依赖
- Python 虚拟环境中需安装 `beautifulsoup4``uv pip install beautifulsoup4`
- 系统需安装 `pandoc``sudo apt install pandoc`
## 运行方式
本技能通过附带脚本 `scripts/convert.py` 执行批量转换:
```bash
python .claude/skills/ads_html_to_obsidian/scripts/convert.py \
<metadata.json> \
<download_dir> \
<output_dir>
```
### 参数说明
- `metadata.json`:由 `ads_metadata_search` 技能生成的文献元数据 JSON 文件,包含每篇论文的 `bibcode`、`title`、`author`、`year`、`abstract`、`pub`、`doi` 等字段
- `download_dir`:由 `ads_literature_downloader` 技能创建的下载目录,内含 `HTML/``PDF/` 子目录
- `output_dir`Obsidian 笔记库的目标目录(如 Obsidian vault 中的主题文件夹)
### 输出格式
每个文献生成一个以 bibcode 命名的 `.md` 文件,格式如下:
```yaml
---
title: "论文标题"
author: ["作者1", "作者2"]
publisher: "期刊名"
source: "https://ui.adsabs.harvard.edu/abs/BIBCODE/abstract"
date: "2025-01-01"
tags: "Astrophysics-Solar-and-Stellar-Astrophysics"
---
# 论文标题
## [ADS: ADS链接](ADS链接)
论文正文Markdown 格式)...
```
如果 HTML 全文无法转换如会议摘要、星表、HST 提案等),文件中会包含 `## Abstract` 部分和摘要文本。
## 典型工作流
本技能通常与其他 ADS 技能配合使用:
1. **搜索文献**:使用 `ads_metadata_search` 搜索并保存元数据到 `results.json`
2. **下载文献**:使用 `ads_literature_downloader` 下载 PDF/HTML 到 `download_dir/`
3. **转换笔记**:使用本技能将下载的文献转为 Obsidian 笔记
```bash
# 完整工作流示例
# Step 1: 搜索
python .claude/skills/ads_metadata_search/scripts/search.py \
--query '"hot subdwarf"' --output results.json --rows 50 --year_range 2025-2026
# Step 2: 提取 bibcodes 并下载
python .claude/skills/ads_literature_downloader/scripts/download.py \
--bibcode_file bibcodes.txt --output_dir ./papers --threads 3
# Step 3: 转换为 Obsidian 笔记
python .claude/skills/ads_html_to_obsidian/scripts/convert.py \
results.json ./papers /path/to/obsidian/vault/TopicFolder
```
## 脚本输出
运行时会显示每篇文献的转换状态:
- `[HTML->MD]`:成功从 HTML 转为 Markdown含全文
- `[Abstract only]`:无法获取全文,仅保存摘要
结束后统计总数、成功转换数和仅摘要数。

View File

@ -0,0 +1,236 @@
"""Convert downloaded HTML papers to Obsidian Markdown format."""
import json
import os
import subprocess
import re
import sys
from bs4 import BeautifulSoup
def extract_main_content(html_text):
"""Use BeautifulSoup to extract main article content from HTML."""
soup = BeautifulSoup(html_text, "html.parser")
# Remove scripts, styles, nav, header, footer
for tag in soup.find_all(["script", "style", "nav", "header", "footer", "noscript"]):
tag.decompose()
# Try to find the main content area
# Ar5iv/LaTeX HTML: look for ltx_page_main or ltx_page_content
main = soup.find("div", class_="ltx_page_main")
if main:
return str(main)
# ArXiv abstract page: look for #content or .leftcolumn
main = soup.find("div", id="content")
if main:
# Further extract just the abstract area if it's the arxiv abs page
abs_div = main.find("blockquote", class_="abstract")
if abs_div:
return str(abs_div)
return str(main)
# Generic: look for <main>, <article>, or role="main"
for selector in [
lambda: soup.find("main"),
lambda: soup.find("article"),
lambda: soup.find(attrs={"role": "main"}),
lambda: soup.find("div", class_="article"),
lambda: soup.find("div", class_="paper"),
]:
main = selector()
if main:
return str(main)
# Fallback: use body
body = soup.find("body")
if body:
return str(body)
return html_text
def html_to_markdown(html_path):
"""Convert HTML to clean markdown using BS4 pre-processing + pandoc."""
try:
with open(html_path, "r", encoding="utf-8", errors="replace") as f:
raw_html = f.read()
except Exception:
return ""
# Pre-process: extract main content only
clean_html = extract_main_content(raw_html)
# Pipe through pandoc via stdin
try:
result = subprocess.run(
["pandoc", "-f", "html", "-t", "markdown",
"--wrap=none", "--markdown-headings=atx"],
input=clean_html, capture_output=True, text=True, timeout=30
)
if result.returncode == 0 and result.stdout.strip():
md = result.stdout.strip()
md = postprocess_markdown(md)
if len(md) > 200: # must have meaningful content
return md
except Exception:
pass
return ""
def postprocess_markdown(text):
"""Clean up pandoc markdown output to remove artifacts."""
# Remove pandoc div markers (::: with attributes)
text = re.sub(r'^:::.*$', '', text, flags=re.MULTILINE)
# Remove {#id} and {.class} attribute blocks
text = re.sub(r'\{[#\.][^}]*\}', '', text)
# Remove leftover HTML tags
text = re.sub(r'<div[^>]*>', '', text)
text = re.sub(r'</div>', '', text)
text = re.sub(r'<span[^>]*>', '', text)
text = re.sub(r'</span>', '', text)
text = re.sub(r'<br\s*/?>', '\n', text)
# Remove raw HTML comments
text = re.sub(r'<!--.*?-->', '', text, flags=re.DOTALL)
# Remove ar5iv footnote artifacts: [^N[^N[institutetext: ...]]
# After bracket cleanup, may look like [^N^^N^[institutetext: ...]
text = re.sub(r'\[\^[0-9]+\^?\^?[0-9]*\^?\[institutetext:\s*', '', text)
# Clean up any remaining [^N^ patterns
text = re.sub(r'\[\^[0-9]+\^?', '', text)
# Close brackets for above
text = re.sub(r'\]\]\]', '', text)
text = re.sub(r'\]\]', '', text)
# Remove empty brackets []
text = re.sub(r'\[\]', '', text)
# Remove trailing author note numbers: ][1122 or ][77
text = re.sub(r'\]\[\d+', '', text)
# Remove orphan ] from cleaned brackets
text = re.sub(r'\](?=\s)', ' ', text)
# Remove pandoc raw HTML markers: ``{=html} or ```{=html}
text = re.sub(r'``+\{=html\}', '', text)
# Remove footnote reference numbers after author names: [1234]
text = re.sub(r'\[\d{1,6}\]', '', text)
# Clean up author line artifacts: [ [Name ][1122]] -> Name
text = re.sub(r'\[\s*\[\s*', '', text)
text = re.sub(r'\s*\]\s*\]', '', text)
# Remove [] spacing artifacts
text = re.sub(r'\[\]', ' ', text)
# Remove hidden="" attributes text
text = re.sub(r'hidden=""', '', text)
# Remove [Submitted on ...] datelines
text = re.sub(r'\[Submitted on[^\]]*\]', '', text)
# Clean up excessive whitespace
text = re.sub(r'\n{4,}', '\n\n\n', text)
# Remove lines that are just whitespace
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)
# Clean up leading/trailing whitespace on lines
lines = text.split('\n')
lines = [line.rstrip() for line in lines]
text = '\n'.join(lines)
return text.strip()
def create_obsidian_file(paper, markdown_content, output_dir):
"""Create an Obsidian markdown file with YAML frontmatter."""
bibcode = paper["bibcode"]
title = paper.get("title", bibcode)
authors = paper.get("author", [])
publisher = paper.get("pub", "")
year = paper.get("year", "")
abstract = paper.get("abstract", "")
author_str = json.dumps(authors, ensure_ascii=False)
source_url = f"https://ui.adsabs.harvard.edu/abs/{bibcode}/abstract"
date_str = f"{year}-01-01" if year else ""
frontmatter = f"""---
title: "{title}"
author: {author_str}
publisher: "{publisher}"
source: "{source_url}"
date: "{date_str}"
tags: "Astrophysics-Solar-and-Stellar-Astrophysics"
---
# {title}
## [ADS: {source_url}]({source_url})
"""
if markdown_content:
body = markdown_content
elif abstract:
body = f"## Abstract\n\n{abstract}"
else:
body = "Full text not available."
content = frontmatter + body + "\n"
output_path = os.path.join(output_dir, f"{bibcode}.md")
with open(output_path, "w", encoding="utf-8") as f:
f.write(content)
return output_path
def main():
if len(sys.argv) < 4:
print("Usage: python convert_to_obsidian.py <metadata.json> <download_dir> <output_dir>")
sys.exit(1)
metadata_path = sys.argv[1]
download_dir = sys.argv[2]
output_dir = sys.argv[3]
os.makedirs(output_dir, exist_ok=True)
with open(metadata_path, encoding="utf-8") as f:
papers = json.load(f)
html_dir = os.path.join(download_dir, "HTML")
stats = {"html_converted": 0, "abstract_only": 0, "total": 0}
for paper in papers:
bibcode = paper["bibcode"]
stats["total"] += 1
html_path = os.path.join(html_dir, f"{bibcode}.html")
markdown_content = ""
if os.path.isfile(html_path):
markdown_content = html_to_markdown(html_path)
if markdown_content:
stats["html_converted"] += 1
if not markdown_content:
stats["abstract_only"] += 1
output_path = create_obsidian_file(paper, markdown_content, output_dir)
status = "HTML->MD" if markdown_content else "Abstract only"
print(f" [{status}] {bibcode}")
print(f"\nDone! {stats['total']} papers processed.")
print(f" HTML converted: {stats['html_converted']}")
print(f" Abstract only: {stats['abstract_only']}")
print(f" Output: {output_dir}")
if __name__ == "__main__":
main()

View File

@ -12,7 +12,7 @@ description: "用于根据 ADS Bibcode 批量下载天体物理学文献。当
由于解析及下载逻辑较为复杂,我们将所有操作封装在了附带的 Python 脚本 `scripts/download.py` 中。在需要下载大量文献时,请调用它。
```bash
python c:\Users\fmq\Documents\astro\Article\.agents\skills\ads_literature_downloader\scripts\download.py \
python .claude/skills/ads_literature_downloader/scripts/download.py \
--bibcodes "2023ApJ...955...13H,2022MNRAS.510.4582S" \
--output_dir "./ads_papers_output" \
--threads 3

View File

@ -14,7 +14,7 @@ description: "用于在 ADS 中搜索天体物理文献,提取元数据信息
你可以通过执行该脚本来工作:
```bash
python c:\Users\fmq\Documents\astro\Article\.agents\skills\ads_metadata_search\scripts\search.py \
python .claude/skills/ads_metadata_search/scripts/search.py \
--query "author:\"Hawking, S.\"" \
--output "results.json" \
--rows 10

View File

@ -1,10 +1,29 @@
import ads
import json
import argparse
import os
import sys
# 如果你没有在环境变量里设置 ADS_DEV_KEY将使用以下的硬编码 Token
ads.config.token = "dpJWki7eHJ48TwlKz2AUyhXAxBgZrKo6AjE8hZwp"
import requests
# Load .env from project root if ADS_API_KEY not already set
def _load_token():
token = os.environ.get("ADS_API_KEY", "")
if token and token != "your_api_key_here":
return token
project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..", "..", ".."))
env_path = os.path.join(project_root, ".env")
if os.path.isfile(env_path):
with open(env_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, _, v = line.partition("=")
k, v = k.strip(), v.strip()
if k == "ADS_API_KEY" and v and v != "your_api_key_here":
return v
return ""
ADS_API_URL = "https://api.adsabs.harvard.edu/v1/search/query"
def main():
parser = argparse.ArgumentParser(description="Search ADS and return metadata")
@ -12,55 +31,65 @@ def main():
parser.add_argument("--output", required=True, help="Output JSON file path")
parser.add_argument("--rows", type=int, default=10, help="Number of rows to return")
parser.add_argument("--year_range", help="Year range to filter, e.g. 2018-2023 or 2020")
args = parser.parse_args()
print(f"Searching ADS for query: {args.query}")
query_params = {
"q": args.query,
"rows": args.rows,
"fl": ["bibcode", "title", "author", "year", "abstract", "citation_count", "reference_count", "pub", "doi"]
}
token = _load_token()
if not token:
print("Error: ADS_API_KEY not configured. Edit .env in project root or set env var.")
sys.exit(1)
q = args.query
if args.year_range:
if '-' in args.year_range:
start_year, end_year = args.year_range.split('-')
query_params["fq"] = f"year:[{start_year} TO {end_year}]"
if "-" in args.year_range:
start, end = args.year_range.split("-", 1)
q += f" year:[{start} TO {end}]"
else:
query_params["fq"] = f"year:{args.year_range}"
q += f" year:{args.year_range}"
print(f"Searching ADS for query: {q}")
params = {
"q": q,
"rows": args.rows,
"fl": "bibcode,title,author,year,abstract,citation_count,reference_count,pub,doi",
}
headers = {"Authorization": f"Bearer {token}"}
try:
papers = list(ads.SearchQuery(**query_params))
resp = requests.get(ADS_API_URL, params=params, headers=headers, timeout=30)
resp.raise_for_status()
data = resp.json()
except Exception as e:
print(f"Query Failed: {e}")
sys.exit(1)
docs = data.get("response", {}).get("docs", [])
results = []
for p in papers:
record = {
"bibcode": getattr(p, "bibcode", "") or "",
"title": getattr(p, "title", [""])[0] if getattr(p, "title", None) else "",
"author": getattr(p, "author", []),
"year": getattr(p, "year", "") or "",
"abstract": getattr(p, "abstract", "") or "",
"citation_count": getattr(p, "citation_count", 0) or 0,
"reference_count": getattr(p, "reference_count", 0) or 0,
"pub": getattr(p, "pub", "") or "",
"doi": getattr(p, "doi", [""])[0] if getattr(p, "doi", None) else ""
}
results.append(record)
for d in docs:
title_list = d.get("title", [])
doi_list = d.get("doi", [])
results.append({
"bibcode": d.get("bibcode", ""),
"title": title_list[0] if title_list else "",
"author": d.get("author", []),
"year": d.get("year", ""),
"abstract": d.get("abstract", ""),
"citation_count": d.get("citation_count", 0),
"reference_count": d.get("reference_count", 0),
"pub": d.get("pub", ""),
"doi": doi_list[0] if doi_list else "",
})
with open(args.output, "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=2)
print(f"Found {len(results)} papers. Saved metadata to {args.output}.")
# 打印简单摘要到终端
for i, r in enumerate(results[:5]):
print(f"\n[{i+1}] {r['title']} ({r['year']})")
print(f" Bibcode: {r['bibcode']} | Citations: {r['citation_count']}")
authors = ", ".join(r['author'][:3]) + (" et al." if len(r['author']) > 3 else "")
print(f" Authors: {authors}")
except Exception as e:
print(f"Query Failed: {e}")
sys.exit(1)
if __name__ == "__main__":
main()