ArchiveBox - 自托管网站存档服务

ArchiveBox – 自托管网站存档服务

开源自托管的网站存档服务,自动对输入的 URL 进行信息爬取,将其中的 HTML、媒体文件、JS、PDF 文件等归档,方便离线查看

以下引用官网的 Background & Motivation

The aim of ArchiveBox is to enable more of the internet to be archived by empowering people to self-host their own archives. The intent is for all the web content you care about to be viewable with common software in 50 – 100 years without needing to run ArchiveBox or other specialized software to replay it.

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it’s to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010’s flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

ArchiveBox - 自托管网站存档服务
Image from WTF is Link Rot?

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don’t think everything should be preserved in an automated fashion–making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

下文将展示如何在 Debian 10 下使用包管理搭建 ArchiveBox 服务

环境

  • Debian 10

参考

步骤

以下未特殊说明的指令均在 root 用户下执行,使用其他用户请酌情添加 sudo

安装依赖环境

官方安装方法中未说明要单独安装 npm,但使用时需要 npm

直接安装 Debian 10 包管理默认版本即可,会顺带安装 node

apt update
apt install npm

安装 ArchiveBox

添加源

执行

echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" | tee /etc/apt/sources.list.d/archivebox.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
apt update

安装 ArchiveBox

官方提供了两种方法,推荐使用 pip 安装

apt install archivebox
# 或者
python3 -m pip install --upgrade --ignore-installed archivebox

使用包管理安装可能会无法运行,出现此状况后直接输入 pip 安装指令即可

包管理安装不成功是因为其提供的 Django 版本过低

设置 ArchiveBox

  1. 切换到非 root 用户(此步骤下的指令均在非 root 用户下执行)
    • 执行
      su - [Username]
  2. 新建 ArchiveBox 空目录
    • 执行
      mkdir ~/archivebox
  3. 初始化 ArchiveBox
    • 执行
      cd ~/archivebox
      archivebox init --setup

      ArchiveBox - 自托管网站存档服务

    • 安装过程中会提示新建 Web 界面的管理员账户,输入密码和邮箱
    • 安装完毕ArchiveBox - 自托管网站存档服务
  4. 启用 ArchiveBox WebUI
    • 安装完毕后开启 WebUI,执行
      archivebox server 0.0.0.0:[port]

      注意 ArchiveBox 没有权限监听 1 – 1024 端口

效果

ArchiveBox - 自托管网站存档服务

ArchiveBox - 自托管网站存档服务

ArchiveBox - 自托管网站存档服务

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据